Re: Multicast packet loss
From: Wesley Chow <hidden>
Date: 2009-02-05 13:46:49
quoted
Maybe its time to change user side, and not try to find an appropriate kernel :) If you know you have to receive N frames per 20us units, then its better to : Use non blocking sockets, and doing such loop : { usleep(20); // or try to compensate if this thread is slowed too much by following code for (i = 0 ; i < N ; i++) { while (revfrom(socket[N], ....) != -1) receive_frame(...); } } That way, you are pretty sure network softirq handler wont have to spend time trying to wakeup 400.000 time per second one thread. All cpu cycles can be spent in NIC driver and network stack. Your thread will do 50.000 calls to nanosleep() per second, that is not really expensive, then N recvfrom() per iteration. It should work on all past , current and future kernels.+1 to this idea. Since the last oprofile traces showed significant variance in the time spent in schedule(), it might be worthwhile to investigate the affects of the application behavior on this. I might also be worth adding a systemtap probe to sys_recvmsg, to count how many times we receive frames on a working and non-working system. If the app is behaving differently on different kernels, and its affecting the number of times you go to get a frame out of the stack, that would affect your drop rates, and it would show up in sys_recvmsg
I did some work to our test program to spin on a non-blocking socket and it indeed seems to fix the problem, at least for 2.6.28.1, which was a kernel we had problems with. The number of context switches drastically drops -- from 200,000+ to less than 50! I haven't done totally comprehensive tests yet, so I don't want to officially state any results. I'm also out today, but Kenny may get a chance to play with it. Spinning on the socket is looking like an interesting solution, but we're a bit nervous about seeing our processes constantly running at 100% CPU. Does C++ have a MachineOnFire exception? Wes