[USRP-users] In search of 200 MSA/sec (Windows 7)

tilla at comcast.net tilla at comcast.net
Thu Mar 17 13:38:05 EDT 2016

Here is a much more detailed stack trace of UHD from send down. 

Hope this helps. 

Please let me know if there is any other introspection that I can provide for you. 

----- Original Message -----

From: "Neel Pandeya" <neel.pandeya at ettus.com> 
To: "Tilla" <tilla at comcast.net> 
Cc: "usrp-users" <usrp-users at lists.ettus.com> 
Sent: Wednesday, March 16, 2016 12:20:41 AM 
Subject: Re: [USRP-users] In search of 200 MSA/sec (Windows 7) 

Hello Tilla: 

I'm not sure what further to suggest, as you've tried almost everything. The only thing I'll mention is upgrading Windows to either version 8.1 or to Server 2012 R2. Other customers have reported improved performance when upgrading to more recent versions of Windows. And as we discussed, you can also look into upgrading the BIOS, but I think upgrading the OS will be more productive. We have also seen that performance can suffer on multi-CPU systems because of NUMA issues and data going over the QPI bus, although it seems like you're at least partially addressing these issues with your attention to CPU affinity and interrupt coalescing. One last suggestion I'd have would be to run on a single-CPU/single-socket Core i7 system, with 8 cores, and with 3.5+ GHz clock speed. I don't think we've run a Windows system at 200 Msps, at least I have not myself, but we have done it under Linux. 

--Neel Pandeya 

On 15 March 2016 at 05:25, Tilla < tilla at comcast.net > wrote: 

Thanks for both of your thoughts. 

Yes, I hit fast datagram. 

I agree with your affinity concepts. The reason I did that here is that I know the NIC is directly attached to the CPU I affinitized to. At times, I would see the converter/tx thread get scheduled on the other CPU and now all that traffic would have to go over QPI to the other CPU before being sent to the NIC. Another reason is to make it look like the second CPU isn’t there to the application, just for giggles… 

I can try to look further into send buffers, but with all the inlines and layers of virtual functions, windows profiler doesn’t like so much and I believe that function was the lowest level reported… I will see if there are more detailed profiler compiler settings I can turn on. I have messed with num_send_frames from 16 -> 64 without any noticeable change in performance. 

I do have the latest NIC driver. 

I don’t perform any IO with this simple app, just read samples into a buffer and do nothing with them… 

I will look into the BIOS version. 

Power management hit, set to high performance. 

I am trying to procure some more equipment in an area with less constraints to test windows server 2012… 

Do you guys have a windows setup that can pump 200 MSA/sec through or just linux? 

Thanks again, 

From: Neel Pandeya [mailto: neel.pandeya at ettus.com ] 
Sent: Monday, March 14, 2016 5:20 PM 
To: tilla < tilla at comcast.net > 
Cc: usrp-users < usrp-users at lists.ettus.com > 
Subject: Re: [USRP-users] In search of 200 MSA/sec (Windows 7) 

Hello Tilla: 

You did just about everything that I might suggest, and your system certainly sounds powerful enough to handle 200 Msps. The Intel X710 is the latest 10 GbE card from Intel, and should perform well and be able to handle this data rate. There are only a few other things that I might mention. Have you installed the latest Intel X710 driver? Would you be able to upgrade Windows? Several customers have reported that they were able to achieve improved performance and throughput by upgrading from Windows 7 to Windows 8 or 10. I'm sure you're already using SSD disks, and I'm not sure if you're being limited by disk I/O, but perhaps you could a RAID setup or a RAM disk? Have you turned off all power management, and disabled ACPI in the BIOS? And speaking of the BIOS, some customers have also reported that a BIOS upgrade improved throughput, although your system looks like it's new, so the BIOS firmware version should be very recent. I agree with Marcus that the FastSendDatagramThreshold registry setting might be helpful. I'm not sure if it behaves differently between Windows 7, 8, 10, and Server 2012. It's the only registry setting tweak that I'm aware of. 

Please let us know if you're able to make any further progress, and whether you see improved results with Windows Server 2012. 


On 12 March 2016 at 10:43, Marcus Müller < usrp-users at lists.ettus.com > wrote: 

Hi Tilla, 

Agreeing, this needs further investigation. 

I have done all the basic NIC tuning that is frequently discussed here: jumbo packets, disable interrupt coalescing, increase buffer sizes... 


I have done a huge amount of other tuning: disable numa, pcie performance mode, process affinitized to same cpu NIC is directly connected to, hyperthreading disabled, hand optimized compilation, and a boatload of other stuff. 


In my (Linux) experience, disabling the automatisms of the OS is most of the time not beneficial to performance; in fact, I remember a case where CPU affinity setting made it hard for the kernel to optimally schedule kernel drivers and userland, so that significant increase in performance was observed after stopping to set affinity. Of course, trying to set affinity still is a very good idea – after all, you have knowledge of the application that your OS lacks. I'd definitely leave hyperthreading on; it's very rarely the reason for slowdown in programs that seem to be IO-bound. You should definitely **enable** interrupt coalescing; no way a system is going to keep up if every single packet of a 200MS/s transmission causes a hardware interrupt (with 8000B per packet, that'd be 100,000 interrupts per second...). 

The one setting that I don't find in your list is 


the Fast Datagram Threshold setting [1]; did you do that? 

What I take away from your execution time percentages is that the most time is spent in the symbol that actually spent in the managed send buffers (see the boost::function1<> line, 38%); looking more closely into that would be interesting; can you do that? 
Also, agreeing, there's not that that one can do about the converter; don't be fooled by the fact that it also does byte conversions; it moves the data out of the network card buffer into the recv()-buffer, and usually you hit a memory bandwidth wall there, if you optimize the numerical operation enough. 

Best regards, 

[1] http://files.ettus.com/manual/page_transport.html#transport_udp_windows ; https://raw.githubusercontent.com/EttusResearch/uhd/master/host/utils/FastSendDatagramThreshold.reg 

On 09.03.2016 17:00, tilla--- via USRP-users wrote: 


Over the past 2 weeks, I have been working towards achieving 200 MSA/sec on win7 64 bit, UHD 3.9.2, X310 w/WBX 120 daughtercard. 

I have a pretty good processor, 3.5 GHz Xeon E5-2637 v2, plenty of memory, 710-DA2 NIC in a x8 Gen 3 slot with verified Gen3 trained speed, a pretty beefy box in general. 

I have gathered a bunch of data and was looking for some further thoughts. 

I have done all the basic NIC tuning that is frequently discussed here: jumbo packets, disable interrupt coalescing, increase buffer sizes... 

I have done a huge amount of other tuning: disable numa, pcie performance mode, process affinitized to same cpu NIC is directly connected to, hyperthreading disabled, hand optimized compilation, and a boatload of other stuff. 

This is a simple prototype of a transmit only application, 1 thread, 1 transmit buffer, just send the same buffer as fast as possible. Transmit loop is as simple as possible (bottom of page 2 in attachment). 

Attached is some performance data. 

50 MSA/sec, very good performance, occasional underflow, max cpu on a core ~35% (Figure 1 screenshots). 

100 MSA/sec, decent performance, more underflows but still "reasonable", max CPU on a core ~70% (Figure 2 screenshots). 

Handwave observable based upon above numbers: performance is linear, when sampling rate doubles, max CPU utilization doubles, perfectly expectable... 

Soooooo, now when going to 200 MSA/sec, constant underflows, not much transmission at all. 

Extrapolating based upon 100 MSA/sec numbers I would need ~140% of a CPU :( (queue Charlie Brown music) 

If it is the byteswapping that is the true bottleneck, I am not sure there is really anything I can do as it is already SSE. Unless do something like AVX or AVX2... 

On the recent thread titled "Throughput of PCIe Interface" we had some performance discussions. I don't have the email available right now, but someone claimed 200 MSA/sec working. I guess I am curious as to the setup. 

I would not think that byteswapping performance would vary much across operating systems... 

So I guess I am looking for any suggestions or thought anyone would have related to getting to 200 MSA, short of changing operating systems (for now). 

I am planning to evaluate Windows Server 2012 Standard in the near future if I cannot get this in some form of working, but would like to exhaust all options before investing that much time. 

Sorry for the long winded email. 


USRP-users mailing list 
USRP-users at lists.ettus.com 


USRP-users mailing list 
USRP-users at lists.ettus.com 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ettus.com/pipermail/usrp-users_lists.ettus.com/attachments/20160317/5dcce919/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: UHD_FullStackTrace.pdf
Type: application/pdf
Size: 424614 bytes
Desc: not available
URL: <http://lists.ettus.com/pipermail/usrp-users_lists.ettus.com/attachments/20160317/5dcce919/attachment.pdf>

More information about the USRP-users mailing list