Re: Asynchronous crypto layer.

From: Evgeniy Polyakov <hidden>
Date: 2004-11-01 05:12:52

On Sun, 2004-10-31 at 17:56, jamal wrote:

On Sun, 2004-10-31 at 04:13, Evgeniy Polyakov wrote:

quoted

On 30 Oct 2004 19:41:27 -0400
jamal [off-list ref] wrote:

quoted

Can you explain the "rate" or "speed" parameter ?

Driver writer can set "rate" parameter to any number from 0 to 64k -
and it will show speed of this driver in current mode/type/operation.

[..]

quoted

That mean that this driver perform des 6 time faster than aes, but 
it should be fair numbers and somehow measured and compared to other
drivers.

So you have some init code that does testing? Or is this factor of six
part of the spec provided by chip vendor?

Chip vendor, but what if vendor lies or it was measured in a different
setup than other vendor?
And what about software speeds?

quoted

This also can be achieved by qlen parameter - if driver writer sets
it to bigger walue then in this mode driver/hardware works faster.
But driver writer can set qlen in a too big value just because
it want it to be such without any means of driver/hardware capabilities.
It is not forbidden.

It is no different than say the way you will do ethernet drivers.
DMA ring sizes and link speeds. harder for ethernet drivers if link
speeds change (Linux net scheduling assumes fixed speed ;->)

quoted

And the last and the most interesting one is following:
we create per session initialiser parameter "session_processin_time" 
which will be sum of the time slices when driver performed operation 
on session, since we alredy has "scomplete" paramenter which is equal 
to the amount of completed(processed) session then we have _fair_ speed 
of the driver/hardware in a given mode/operation/type.

Of course load blancer should select device with the lowest
session_processing_time/scompleted.

I think third variant is what we want. I will think of it some more
and will implement soon.

I think you should be able to have multiple, configurable LB algos.

Second is already implemented.
Third I posted to my TODO.
Should we add "rate" or will use "qlen" instead?

quoted

I havent studied your code, however, what Eugene is pointing out is
valuable detail/feedback.

You should have in your queuing towards the crypto chip ability to
batch. i.e sort of nagle-like "wait until we have 10 packets/20KB or 20
jiffies" before you send everything in the queue to the chip.

That is exactly how crypto driver should be written.
Driver has it's queue and number of session in it, so it and only it
can decide when begin to process them in the most effective way.

This should be above driver, really.
 
You should have one or more queues where the scheduler feeds off and
shoves to hardware. Perhaps several levels:

There are already several queues - one per crypto device.
Scheduler decides what device should handle given session, 
and put session into selected device's queue and calls ->data_ready()
device's callback.
It is device that can handle several session per one "travel" of it's
queue.
Or do you say that scheduler can hold session until number of such
session becomes more than threshold and then post them at once to some
device's queue?
Why it is needed? That can be implemented easily by splitting
crypto_session_alloc() into two parts like James sugests, but I do not
see a reason for it. What if no session will be allocated in the near
future? Then first session will sit in a vacuum for a long while it 
could be already processed even non effectively.

->LB-+
     +-> device1 scheduler --> driver queues/rings.
     |
     |
     +-> device2 scheduler --> driver queues/rings.
     |
     .
     .
     +-> devicen scheduler --> driver queues/rings.

If you look at the linux traffic control, the stuff below LB is how it
behaves. That should be generic enough to _not_ sit in the driver.
This allows for adding smart algorithms to it; fe: qos, rate limiting,
feedback to LB so it could make smarter decisions etc.

It does not sit in the driver.
Ethernet device has hard_start_xmit() which takes only one skb just
because hardware can not handle several packets at once.
Crypto devices [often] have ability to handle several sessions at once, 
that is why struct crypto_device has queue from which device can take
sessions more than one in a time.

Sessions are placed into device's queue by scheduler which has some
algorithms inside(like qos), but then driver can get one session and
process it - thus it will looks like netdevice, but it also can take
several sessions and process them at once, netdev just can not do it.

quoted

As he points out (and i am sure he can back it with data ;->), that
given the setup cost, packet size, algo and CPU and bus speed, it may
not make sense to use the chip at all ;->

Michal has numbers - pure hardware beats soft in a certain setups
in a fully synchronous schema, let's work SW and HW in parallel.

Of course SW can encrypt 64 byte faster than it will be transfered to
old ISA crypto card, but it worth to do it for compressing with LZW
9000 bytes jumbo frame.

Would be interesting. I have seen the numbers from Eugene and they are 
quiet intriguing - but they are for the sync mode.

cheers,
jamal

-- 
	Evgeniy Polyakov

Crash is better than data corruption. -- Art Grabowski

Attachments

signature.asc [application/pgp-signature] 189 bytes

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help