Re: [PATCH v9 net-next 01/23] net/tcp: Prepare tcp_md5sig_pool for TCP-AO

From: Dmitry Safonov <hidden>
Date: 2023-08-11 21:03:02
Also in: lkml

Hi Eric,

On 8/8/23 10:58, Eric Dumazet wrote:

On Wed, Aug 2, 2023 at 7:27 PM Dmitry Safonov [off-list ref] wrote:

[..]

quoted

-       hash = crypto_alloc_ahash("md5", 0, CRYPTO_ALG_ASYNC);
-       if (IS_ERR(hash))
-               return;
-
-       for_each_possible_cpu(cpu) {
-               void *scratch = per_cpu(tcp_md5sig_pool, cpu).scratch;
-               struct ahash_request *req;
-
-               if (!scratch) {
-                       scratch = kmalloc_node(sizeof(union tcp_md5sum_block) +
-                                              sizeof(struct tcphdr),
-                                              GFP_KERNEL,
-                                              cpu_to_node(cpu));
-                       if (!scratch)
-                               return;
-                       per_cpu(tcp_md5sig_pool, cpu).scratch = scratch;
-               }
-               if (per_cpu(tcp_md5sig_pool, cpu).md5_req)
-                       continue;
-
-               req = ahash_request_alloc(hash, GFP_KERNEL);
-               if (!req)
-                       return;
-
-               ahash_request_set_callback(req, 0, NULL, NULL);
-
-               per_cpu(tcp_md5sig_pool, cpu).md5_req = req;
+       scratch_size = sizeof(union tcp_md5sum_block) + sizeof(struct tcphdr);
+       ret = tcp_sigpool_alloc_ahash("md5", scratch_size);
+       if (ret >= 0) {
+               tcp_md5_sigpool_id = ret;

tcp_md5_alloc_sigpool() can be called from multiple cpus,
yet you are writing over tcp_md5_sigpool_id here without any
spinlock/mutex/annotations ?

Yeah, it's writing-only: as long as there was a TCP-MD5 key in
the system, sigpool_id returned by tcp_sigpool_alloc_ahash() would stay
the same. The only time when it writes a different value - iff all
previous MD5 keys were released.

KCSAN would eventually file a report, and a compiler might very well
transform this to

tcp_md5_sigpool_id = random_value;
<window where readers might catch garbage>
tcp_md5_sigpool_id = ret;

Agree, haven't thought about the optimizing compiler nightmares.
Will add WRITE_ONCE().

[..]

quoted

+
+/**
+ * sigpool_reserve_scratch - re-allocates scratch buffer, slow-path
+ * @size: request size for the scratch/temp buffer
+ */
+static int sigpool_reserve_scratch(size_t size)
+{
+       struct scratches_to_free *stf;
+       size_t stf_sz = struct_size(stf, scratches, num_possible_cpus());

This is wrong.   num_possible_cpus() could be 2, with two cpus numbered 0 and 8.

Look for nr_cpu_ids instead.

Hmm, but it seems num_possible_cpus() was exactly what I wanted?

As free_old_scratches() does:

:	while (stf->cnt--)
:		kfree(stf->scratches[stf->cnt]);

So, that's just the number of previously allocated scratches, not
per-CPU index.

quoted

+       int cpu, err = 0;
+
+       lockdep_assert_held(&cpool_mutex);
+       if (__scratch_size >= size)
+               return 0;
+
+       stf = kmalloc(stf_sz, GFP_KERNEL);
+       if (!stf)
+               return -ENOMEM;
+       stf->cnt = 0;
+
+       size = max(size, __scratch_size);
+       cpus_read_lock();
+       for_each_possible_cpu(cpu) {
+               void *scratch, *old_scratch;
+
+               scratch = kmalloc_node(size, GFP_KERNEL, cpu_to_node(cpu));
+               if (!scratch) {
+                       err = -ENOMEM;
+                       break;
+               }
+
+               old_scratch = rcu_replace_pointer(per_cpu(sigpool_scratch, cpu),
+                                       scratch, lockdep_is_held(&cpool_mutex));
+               if (!cpu_online(cpu) || !old_scratch) {
+                       kfree(old_scratch);
+                       continue;
+               }
+               stf->scratches[stf->cnt++] = old_scratch;
+       }


I wonder why you are not simply using something around __alloc_percpu()

I presume that was a heritage of __tcp_alloc_md5sig_pool() that
allocated scratches this way. In pre-crypto-cloning versions per-CPU
ahash_requests were allocated with alloc_percpu().

But now looking closer, if I read pcpu allocator code correctly, it does
pcpu_mem_zalloc() for populating per-CPU chunks which depending on
pcpu_chunk_struct_size may be allocated with vmalloc() or kzalloc().

IOW, as long as CONFIG_NEED_PER_CPU_KM != n, which is:

: config NEED_PER_CPU_KM
:	depends on !SMP || !MMU

The memory, returned by pcpu allocator has no guarantee of being from
direct-map. Which I presume defeats the purpose of having any
scratch/temporary area for sg/crypto.

Am I missing something?

quoted

+       cpus_read_unlock();
+       if (!err)
+               __scratch_size = size;
+
+       call_rcu(&stf->rcu, free_old_scratches);
+       return err;
+}
+
+static void sigpool_scratch_free(void)
+{
+       int cpu;
+
+       for_each_possible_cpu(cpu)
+               kfree(rcu_replace_pointer(per_cpu(sigpool_scratch, cpu),
+                                         NULL, lockdep_is_held(&cpool_mutex)));

I wonder why bothering about freeing some space ? One scratch buffer
per cpu is really small.

When all TCP-MD5/TCP-AO keys removed, why not free some memory?

[..]

quoted

+       /* slow-path */
+       mutex_lock(&cpool_mutex);
+       ret = sigpool_reserve_scratch(scratch_size);
+       if (ret)
+               goto out;
+       for (i = 0; i < cpool_populated; i++) {
+               if (!cpool[i].alg)
+                       continue;
+               if (strcmp(cpool[i].alg, alg))
+                       continue;
+
+               if (kref_read(&cpool[i].kref) > 0)
+                       kref_get(&cpool[i].kref);
+               else
+                       kref_init(&cpool[i].kref);

This looks wrong. kref_init() should be called once, after allocation,
not based on kref_read().

Well, as long as it's the slow-path and cpool_mutex was grabbed, this
essentially re-allocating an entry that had it's last reference
released, but wasn't yet destroyed by __cpool_free_entry().

I think, if I interpreted it correctly, it was suggested-by Jakub:
https://lore.kernel.org/all/20230106175326.2d6a4dcd@kernel.org/T/#u (local)

[..]

quoted

+static void __cpool_free_entry(struct sigpool_entry *e)
+{
+       crypto_free_ahash(e->hash);
+       kfree(e->alg);
+       memset(e, 0, sizeof(*e));
+}
+
+static void cpool_cleanup_work_cb(struct work_struct *work)

Really this looks over complicated to me.

All this kref maintenance and cpool_mutex costs :/

Hmm, unsure: I can remove this sigpool thing and just allocate a crypto
tfm for every new setsockopt(). Currently, this allocates one tfm per
hash algorithm, and without it it would be one tfm per setsockopt()/key.

The supported scale currently is thousands of keys per-socket, which
means that with sizeof(struct crypto_ahash) == 104, that will increase
memory consumption by hundreds of kilobytes to megabytes. I thought
that's a reasonable thing to do.

Should I proceed TCP-AO patches without this manager thing?

quoted

+{
+       bool free_scratch = true;
+       unsigned int i;
+
+       mutex_lock(&cpool_mutex);
+       for (i = 0; i < cpool_populated; i++) {
+               if (kref_read(&cpool[i].kref) > 0) {
+                       free_scratch = false;
+                       continue;
+               }
+               if (!cpool[i].alg)
+                       continue;
+               __cpool_free_entry(&cpool[i]);
+       }
+       if (free_scratch)
+               sigpool_scratch_free();
+       mutex_unlock(&cpool_mutex);
+}

[..]

Thanks,
         Dmitry

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help