Thread (61 messages) 61 messages, 10 authors, 2019-10-01

RE: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption

From: Pascal Van Leeuwen <hidden>
Date: 2019-09-27 09:58:23
Also in: linux-crypto

quoted
That remark is just very stupid. The hardware ALREADY exists, and
more hardware is in the pipeline. Once this stuff is designed in, it
usually stays in for many years to come. And these are chips sold in
_serious_ quantities, to be used in things like wireless routers and
DSL, cable and FTTH modems, 5G base stations, etc. etc.
Yes, I very much mentioned routers. I believe those can happen much
more quickly.

But I would very much hope that that is not the only situation where
you'd see wireguard used.
Same here
I'd want to see wireguard in an end-to-end situation from the very
client hardware. So laptops, phones, desktops. Not the untrusted (to
me) hw in between.
I don't see why the crypto HW would deserve any less trust than, say,
the CPU itself. I would say CPU's don't deserve that trust at the moment.
quoted
No, these are just the routers going into *everyone's* home. And 5G
basestations arriving at every other street corner. I wouldn't call
that rare, exactly.
That's fine for a corporate tunnel between devices. Which is certainly
one use case for wireguard.

But if you want VPN for your own needs for security, you want it at
the _client_. Not at the router box. So that case really does matter.
Personally, I would really like it in my router box so my CPU is free
to do useful work instead of boring crypto. And I know there's nothing
untrusted in between my client and the router box, so I don't need to
worry about security there. But hey, that's just me.
And I really don't see the hardware happening in that space. So the
bad crypto interfaces only make the client _worse_.
Fully agree. We don't focus on the client side with our HW anyway.
(but than there may be that router box in between that can help out)
See?

But on to the arguments that we actually agree on:
quoted
Hey, no argument there. I don't see any good reason why the key can't
be on the stack. I doubt any hardware would be able to DMA that as-is
directly, and in any case, key changes should be infrequent, so copying
it to some DMA buffer should not be a performance problem.
So maybe that's an area for improvement: allow that to be on the stack.
It's not even just the stack. It's really that the crypto interfaces
are *designed* so that you have to allocate things separately, and
can't embed these things in your own data structures.

And they are that way, because the crypto interfaces aren't actually
about (just) hiding the hardware interface: they are about hiding
_all_ the encryption details.
Well, that's the general idea of abstraction. It also allows for 
swapping in any other cipher with minimal effort just _because_ the 
details were hidden from the application. So it may cost you some 
effort initially, but it may save you effort later.
There's no way to say "hey, I know the crypto I use, I know the key
size I have, I know the state size it needs, I can preallocate those
AS PART of my own data structures".

Because the interface is designed to be so "generic" that you simply
can't do those things, they are all external allocations, which is
inevitably slower when you don't have hardware.
Hmm, Ok, I see your point here. But most of those data structures 
(like the key) should be allocated infrequently anyway, so you can
amortize that cost over _many_ crypto operations.

You _do_ realize that performing the key schedule for e.g. AES with
AES-NI also takes quite a lot of time? So you should keep your keys
alive and not reload them all the time anyway.

But I already agreed with you that there may be cases where you just
want to call the library function directly. Wireguard just isn't one
of those cases, IMHO.
And you've shown that you don't care about that "don't have hardware"
situation, and seem to think it's the only case that matters. That's
your job, after all.
I don't recall putting it that strongly ... and I certainly never said
the HW acceleration thing is the _only_ case that matters. But it does
matter _significantly_ to me, for blatantly obvious reasons.
But however much you try to claim otherwise, there's all these
situations where the hardware just isn't there, and the crypto
interface just forces nasty overhead for absolutely no good reason.
quoted
I already explained the reasons for _not_ doing direct calls above.
And I've tried to explain how direct calls that do the synchronous
thing efficiently would be possible, but then _if_ there is hardware,
they can then fall back to an async interface.
OK, I did not fully get that latter part. I would be fine with such an
approach for use cases (i.e. fixed, known crypto) where that makes sense.
It would actually be better than calling the SW-only library directly
(which was my suggestion) as it would still allow HW acceleration as
an option ... 
quoted
quoted
So there is absolutely NO DOWNSIDE for hw accelerated crypto to just
do it right, and use an interface like this:

       if (!chacha20poly1305_decrypt_sg(sg, sg, skb->len, NULL, 0,
                                        PACKET_CB(skb)->nonce, key->key,
                                        simd_context))
               return false;
Well, for one thing, a HW API should not expect the result to be
available when the function call returns. (if that's what you
mean here). That would just be WRONG.
Right. But that also shouldn't mean that when you have synchronous
hardware (ie CPU) you have to set everything up even though it will
never be used.

Put another way: even with hardware acceleration, the queuing
interface should be a simple "do this" interface.
OK, I don't think we disagree there. I _like_ simple. As long as it
doesn't sacrifice functionality I care about.
The current crypto interface is basically something that requires all
the setup up-front, whether it's needed or not. And it forces those
very inconvenient and slow external allocations.
But you should do the setup work (if by "setup" you mean things like
cipher allocation, key setup and request allocation) only _once_ in a 
_long_ while. You can just keep using it for the lifetime of the 
application (or key, for the key setup part).

If I look at my cipher fallback  paths in the driver (the only places
where I actually get to _use_ the API from the "other" side), per 
actual indivual request they _only_ do - the rest is all preallocated
earlier:

_set_callback()
_set_crypt()
_set_ad()
_encrypt() or _decrypt()

And now that I look at that, I think the _set_callback()  could
move to the setup phase as it's always the same callback function.
Probably, in case of Wireguard, you could even move the _set_ad()
there as it's always zero and  the crypto driver is not allowed 
to overwrite it in the request struct anyway.

Also, I already agreed with you that _set_crypt(), _set_ad()
and _encrypt()/_decrypt() _could_ be conveniently wrapped into
one API call instead of 3 separate ones if we think that's worth it.

BUT ... actually ... I just looked at the actual _implementation_
and it turns out these are _inlineable_ functions defined in the
header file that _just_ write to some struct fields. So they 
should not end up being function calls at all(!!).
_Only_ the _encrypt()/_decrypt() invocation will end up with a
true (indirect) function call.

So where are all those allocations you mention? Have you ever
actually _used_ the Crypto API for anything?

Yes, if you actually want to _queue_ requests you need to use one
request struct for every queued operation, but you could just
preallocate an array of them that you cycle through. No need to do
those allocations in the hot path.

So is your problem really with the API _itself_ or with incorrect/
inefficient _use_ of the API in some places?
And I'm saying that causes problems, because it fundamentally means
that you can't do a good job for the common CPU  case, because you're
paying all those costs even when you need absolutely none of them.
Both at setup time, but also at run-time due to the extra indirection
and cache misses etc.
There is some cost sure, but is it _significant_ for any use case that
_matters_? You started bringing up optimization rules, so how about
Amdahls law?
quoted
Again, HW acceleration does not depend on the indirection _at all_,
that's there for entirely different purposes I explained above.
HW acceleration _does_ depend greatly on a truly async ifc though.
Can you realize that the world isn't just all hw acceleration?
Sure. But there's also a lot of HW acceleration _already out there_
that _could_ have been used if only the proper SW API's had existed.
Welcome to _my_ world.
Can you admit that the current crypto interface is just horrid for the
non-accelerated case?
Is agreeing that it is not perfect sufficient for you? :-)
Can you perhaps then also think that "maybe there are better models".
Sure. There's always better. There's also good enough though ...
quoted
So queue requests on one side, handle results from the other side
in some callback func off of an interrupt handler.
Actually, what you can do - and what people *have* done - is to admit
that the synchronous case is real and important, and then design
interfaces that work for that one too.
But they _do_ work for that case as well. I still haven't seen any
solid evidence that they are as horribly inefficient as you are 
implying for _real life_ use cases. And even if they are, then there's
the question whether that is the fault of the API or incorrect use 
thereof.
You don't need to allocate resources ahead of time, and you don't have
to disallow just having the state buffer allocated by the caller.

So here's the *wrong* way to do it (and the way that crypto does it):

 - dynamically allocate buffers at "init time"
Why is that so "wrong"? It sure beats doing allocations on the hot path.
But yes, some stuff should be allowed to live on the stack. Some other
stuf can't be on the stack though, as that's gone when the calling 
function exits while the background crypto processing still needs it.

And you don't want to have it on the stack initially and then have
to _copy_ it to some DMA-able location that you allocate on the fly
on the hot path if you _do_ want HW acceleration.
 - fill in "callback fields" etc before starting the crypto, whether
they are needed or not
I think this can be done _once_ at request allocation time.
But it's just one function pointer write anyway. Is that significant? 
Or: _if_ that is significant, you  shouldn't be using the Crypto API for 
that use case in the first place.
 - call a "decrypt" function that then uses the indirect functions you
set up at init time, and possibly waits for it (or calls the callbacks
you set up)

note how it's all this "state machine" model where you add data to the
state machine, and at some point you say "execute" and then either you
wait for things or you get callbacks.
Not sure how splitting data setup over a few seperate "function" calls
suddenly makes it a "state machine model" ...

But yes, I can understand why the completion handling through this
callback function seems like unnecessary complication for the SW case.
That makes sense for a hw crypto engine. It's how a lot of them work, after all.
Oh really?

Can't speak for other people stuff, but for our hardware you post a
request to it and then go off do other stuff while the HW does its thing
after which it will inform you it's done by means of an interrupt.
I don't see how this relates to the "statemachine model" above, there
is no persistent state involved, it's all included in the request.
The _only_ thing that matters is that you realize it's a pipeline that
needs to be kept filled and has latency >> throughput, just like your 
CPU pipeline.
But it makes _zero_ sense for the synchronous case. You did a lot of
extra work for that case, and because it was all a styate machine, you
did it particularly inefficiently: not only do you have those separate
allocations with pointer following, the "decrypt()" call ends up doing
an indirect call to the CPU implementation, which is just quite slow
to begin with, particularly in this day and age with retpoline etc.

So what's the alternative?

I claim that a good interface would accept that "Oh, a lot of cases
will be synchronous, and a lot of cases use one fixed
encryption/decryption model".

And it's quite doable. Instead of having those callback fields and
indirection etc, you could have something more akin to this:

 - let the caller know what the state size is and allocate the
synchronous state in its own data structures

 - let the caller just call a static "decrypt_xyz()" function for xyz
decryptrion.
Fine for those few cases where the algorithm is known and fixed.
(You do realize that the primary use cases are IPsec, dmcrypt and
fscrypt where that is most definitely _not_ the case?)

Also, you're still ignoring the fact that there is not one, single,
optimal, CPU implementation either. You have to select that as well,
based on CPU features. So it's either an indirect function call that
would be well predictable - as it's always the same at that point in
the program - or it's a deep if-else tree (which might actually be
implemented by the compiler as an indirect (table) jump ...) 
selecting the fastest implementation, either SW _or_ HW.
 - if you end up doing it synchronously, that function just returns
"done". No overhead. No extra allocations. No unnecessary stuff. Just
do it, using the buffers provided. End of story. Efficient and simple.
I don't see which "extra allocations" you would be saving here.
Those shouldn't happen in the hot path either way.
 - BUT.

 - any hardware could have registered itself for "I can do xyz", and
the decrypt_xyz() function would know about those, and *if* it has a
list of accelerators (hopefully sorted by preference etc), it would
try to use them. And if they take the job (they might not - maybe
their queues are full, maybe they don't have room for new keys at the
moment, which might be a separate setup from the queues), the
"decrypt_xyz()" function returns a _cookie_ for that job. It's
probably a pre-allocated one (the hw accelerator might preallocate a
fixed number of in-progress data structures).

And once you have that cookie, and you see "ok, I didn't get the
answer immediately" only THEN do you start filling in things like
callback stuff, or maybe you set up a wait-queue and start waiting for
it, or whatever".
I don't see the point of saving that single callback pointer write.
I mean, it's just _one_ CPU word memory write. Likely to the L1 cache.

But I can see the appeal of getting a "done" response on the _encrypt()/
_decrypt() call and then being able to immediately continue processing
the result data and having the async response handling separated off. 

I think it should actually be possible to change the API to work like
that without breaking backward compatibility, i.e. define some flag
specifying you actually _want_ this behavior and then define some
return code that says "I'm done processing, carry on please".
See the difference in models? One forces that asynchronous model, and
actively penalizes the synchronous one.

The other _allows_ an asynchronous model, but is fine with a synchronous one.
quoted
quoted
       aead_request_set_callback(req, 0, NULL, NULL);
This is just inevitable for HW acceration ...
See above. It really isn't. You could do it *after* the fact,
Before ... after ... the point was you need it. And it's a totally
insignificant saving anyway.
when
you've gotten that ticket from the hardware. Then you say "ok, if the
ticket is done, use these callbacks". Or "I'll now wait for this
ticket to be done" (which is what the above does by setting the
callbacks to zero).

Wouldn't that be lovely for a user?
Yes and no.
Because the user would _still_ need to handle the case of callbacks.
In case the request _does_ go to the HW accelerator.

So you keep the main processing path clean I suppose, saving some 
cycles there, but you still have this case of callbacks and having
multiple requests queued you need to handle as well. Which now 
becomes a separate _exception_ case.  You now have  two distinct 
processing paths you have to manage from your application.
How is that an _improvement_ for the user? (not withstanding that
it may be an improvement to SW only performance)
I suspect it would be a nice model for a hw accelerator too. If you
have full queues or have problems allocating new memory or whatever,
you just let the code fall back to the synchronous interface.
HW drivers typically _do_ use SW fallback for cases they cannot
handle. Actually, that works very nicely with the current API,
with the fallback cipher just being attached to the original
requests' callback function ... i.e. just do a tail call to 
the fallback cipher request.
quoted
Trust me, I have whole list of things I don't like about the
API myself, it's not exacty ideal for HW acceleration  either.
That';s the thing. It's actively detrimental for "I have no HW acceleration".
You keep asserting that with no evidence whatsoeever.
And apparently it's not optimal for you guys either.
True, but I accept the fact that it needs to be that way because some
_other_ HW may drive that requirement. I accept the fact that I'm not
alone in the world.
quoted
But the point is - there are those case where you _don't_ know and
_that_ is what the Crypto API is for. And just generally, crypto
really _should_ be switchable.
It's very much not what wireguard does.
And that's very much a part of Wireguard that is _broken_. I like
Wireguard for a lot of things, but it's single cipher focus is not
one of them. Especially since all crypto it uses comes from a single
source (DJB), which is frowned upon in the industry.

Crypto agility is a very important _security_ feature and the whole
argument Jason makes that it is actually a weakness is _bullshit_.
(Just because SSL _implemented_ this horribly wrong doesn't mean 
it's a bad thing to do - it's not, it's actually _necessary_. As 
the alternative would be to either continue using broken crypto
or wait _months_ for a new implementation to reach your devices
when the crypto gets broken somehow. Not good.)
And honestly, most of the switchable ones have caused way more
security problems than they have "fixed" by being switchable.
"most of the switchable ones"
You mean _just_ SSL/TLS. SSL/TLS before 1.3 just sucked security
wise, on so many levels. That has _nothing_ to do with the very
desirable feature of crypto agility. It _can_ be done properly and 
securely. (for one thing, it does not _need_ to be negotiable)
                 Linus
Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help