Re: [RFC] ath10k: implement dql for htt tx

From: Dave Taht <hidden>
Date: 2016-03-30 00:57:57

Possibly related (same subject, not in this thread)

2016-04-01 · Re: [RFC] ath10k: implement dql for htt tx · Michal Kazior <hidden>
2016-03-25 · [RFC] ath10k: implement dql for htt tx · Michal Kazior <hidden>

As a side note of wifi ideas complementary to codel, please see:

http://blog.cerowrt.org/post/selective_unprotect/

On Tue, Mar 29, 2016 at 12:49 AM, Michal Kazior [off-list ref] wrote:

On 26 March 2016 at 17:44, Dave Taht [off-list ref] wrote:

quoted

Dear Michal:

[...]

quoted

I am running behind on this patch set, but a couple quick comments.

[...]

quoted

 - no rrul tests, sorry Dave! :)

rrul would be a good baseline to have, but no need to waste your time
on running it every time as yet. It stresses out both sides of the
link so whenever you get two devices with these driver changes on them
it would be "interesting". It's the meanest, nastiest test we have...
if you can get past the rrul, you've truly won.

Consistently using tcp_fair_up with 1,2,4 flows and 1-4 stations as
you are now is good enough.

doing a more voip-like test with slamming d-itg into your test would be good...

quoted

Observations / conclusions:
 - DQL builds up throughput slowly on "veryfast"; in some tests it
doesn't get to reach peak (roughly 210mbps average) because the test
is too short

It looks like having access to the rate control info here for the
initial and ongoing estimates will react faster and better than dql
can. I loved the potential here in getting full rate for web traffic
in the usual 2second burst you get it in (see above blog entries)

On one hand - yes, rate control should in theory be "faster".

On the other hand DQL will react also to host system interrupt service
time. On slow CPUs (typically found on routers and such) you might end
up grinding the CPU so much you need deeper tx queues to keep the hw
busy (and therefore keep performance maxed). DQL should automatically
adjust to that while "txop limit" might not.

Mmmm.... current multi-core generation arm routers should be fast enough.

Otherwise, point taken (possibly). Even intel i3 boxes need offloads to get to
line rate.

quoted

It is always good to test codel and fq_codel separately, particularly
on a new codel implementation. There are so many ways to get codel
wrong or add an optimization that doesn't work (speaking as someone
that has got it wrong often)

If you are getting a fq result of 12 ms, that means you are getting
data into the device with a ~12ms standing queue there. On a good day
you'd see perhaps 17-22ms for "codel target 5ms" in that case, on the
rtt_fair_up series of tests.

This will obviously depend on the number of stations you have data
queued to. Estimating codel target time requires smarter tx
scheduling. My earlier (RFC) patch tried doing that.

and I loved it. ;)

quoted

if you are getting a pure codel result of 160ms, that means the
implementation is broken. But I think (after having read your
description twice), the baseline result today of 160ms of queuing was
with a fq_codel *qdisc* doing the work on top of huge buffers,

Yes. The 160ms is with fq_codel qdisc with ath10k doing DQL at 6mbps.
Without DQL ath10k would clog up all tx slots (1424 of them) with
frames. At 6mbps you typically want/need a handful (5-10) of frames to
be queued.

quoted

the
results a few days ago were with a fq_codel 802.11 layer, and the
results today you are comparing, are pure fq (no codel) in the 802.11e
stack, with fixed (and dql) buffering?

Yes. codel target in fq_codel-in-mac80211 is hardcoded at 20ms now
because there's no scheduling and hence no data to derive the target
dynamically.

Well, for these simple 2 station tests, you could halve it, easily.

With ecn on on both sides, I tend to look at the groupings of the ecn
marks in wireshark.

quoted

if so. Yea! Science!

...

One of the flaws of the flent tests is that conceptually they were
developed before the fq stuff won so big, and looking hard at the
per-queue latency for the fat flows requires either looking hard at
the packet captures or sampling the actual queue length. There is that
sampling capability in various flent tests, but at the moment it only
samples what tc provides (Drops, marks, and length) and it does not
look like there is a snapshot queue length exported from that ath10k
driver?

Exporting tx queue length snapshot should be fairly easy. 2 debugfs
entries for ar->htt.max_num_pending_tx and ar->htt.num_pending_tx.

K. Still running *way* behind you on getting stuff up and running. The
ath10ks I ordered were backordered, should arrive shortly.

quoted

...

As for a standing queue of 12ms at all in wifi... and making the fq
portion work better, it would be quite nice to get that down a bit
more. One thought (for testing purposes) would be to fix a txop at
1024,2048,3xxxus for some test runs. I really don't have a a feel for
framing overhead on the latest standards. (I loathe the idea of
holding the media for more than 2-3ms when you have other stuff coming
in behind it...)

 Another is to hold off preparing and submitting a new batch of
packets; when you know the existing TID will take 4ms to transmit,
defer grabbing the next batch for 3ms. Etc.

I don't think hardcoding timings for tx scheduling is a good idea. I

wasn't suggesting that, was suggesting predicting a minimum time to
transmit based on the history.

believe we just need a deficit-based round robin with time slices. The
problem I see is time slices may change with host CPU load. That's why
I'm leaning towards more experiments with DQL approach.

OK.

quoted

It would be glorious to see wifi capable of decent twitch gaming again...

quoted

 - slow+fast case still sucks but that's expected because DQL hasn't
been applied per-station

 - sw/fq has lower peak throughput ("veryfast") compared to sw/base
(this actually proves current - and very young least to say - ath10k
wake-tx-queue implementation is deficient; ath10k_dql improves it and
sw/fq+ath10k_dql climbs up to the max throughput over time)


To sum things up:
 - DQL might be able to replace the explicit txop queue limiting
(which requires rate control info)

I am pessimistic. Perhaps as a fallback?

At first I was (too) considering DQL as a nice fallback but the more I
think about the more it makes sense to use it as the main source of
deriving time slices for tx scheduling.

I don't really get how dql can be applied per station in it's current forrm.

quoted

 - mac80211 fair queuing works

:)

quoted

A few plots for quick and easy reference:

  http://imgur.com/a/TnvbQ


Michał

PS. I'm not feeling comfortable attaching 1MB attachment to a mailing
list. Is this okay or should I use something else next time?

I/you can slam results into the github blogcerowrt repo and then pull
out stuff selectively....

Good idea, thanks!

You got commit privs.


Michał

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help