Discussion and technical support related to USRP, UHD, RFNoC
View all threadsHello,
I currently have an application where I am burst receiving(about 80 micro per milli second) with the ettus X410 at the full sample rate across 4 channels. I am getting occasional issues where data is dropped (terminal messages show “D” error). I have been able to get DPDK to work but that does not seem to have resolved the issue. By my calculation, this comes out to a data rate of 5.12*10^9 Gbit/s
The current host computer has an i9-13900KS, Nvme, 128 GB of RAM, and I am currently using a Mellanox 100 Gbit QSFP network card
I would say in general, I am able to save just under 100% of all the data I request from the x410, however, for our application, it is very critical that we do not lose any data. If I run the default CG_400 image with benchmark rate(1 channel only), I do not get dropped data. The only significant difference between my custom host software and benchmark_rate.cpp is I save data to a .dat file(similar to rx_samples_to_file.cpp) .
I have looked at the tuning notes here https://kb.ettus.com/Getting_Started_with_DPDK_and_UHD. I have tried DPDK, core isolation/ disabling system interrupts, nice priority, multithreading/uhd::set_thread_priority, but none have seemed to resolve the issue.
What I have noticed is that when I get a “D” error, it corresponds to recv() returning a number of samples less than samples per buffer, followed by a return value of 0.
My current assumption is that the task of saving data to NVME is creating a critical path that cant be resolved with thread prioritization or multithreading. Or, maybe I am just not doing thread priority or multithreading correctly. Either way, it is strange to me that recv() can return a number of samples less than buffer outside of a stop signal or timeout signal.
Any help/suggestion are appreciated,
Joe
On 24/01/2024 13:00, jmaloyan@umass.edu wrote:
Hello,
I currently have an application where I am burst receiving(about 80
micro per milli second) with the ettus X410 at the full sample rate
across 4 channels. I am getting occasional issues where data is
dropped (terminal messages show “D” error). I have been able to get
DPDK to work but that does not seem to have resolved the issue. By my
calculation, this comes out to a data rate of 5.12*10^9 Gbit/s
The current host computer has an i9-13900KS, Nvme, 128 GB of RAM, and
I am currently using a Mellanox 100 Gbit QSFP network card
I would say in general, I am able to save just under 100% of all the
data I request from the x410, however, for our application, it is very
critical that we do not lose any data. If I run the default CG_400
image with benchmark rate(1 channel only), I do not get dropped data.
The only significant difference between my custom host software and
benchmark_rate.cpp is I save data to a .dat file(similar to
rx_samples_to_file.cpp) .
I have looked at the tuning notes here
https://kb.ettus.com/Getting_Started_with_DPDK_and_UHD. I have tried
DPDK, core isolation/ disabling system interrupts, nice priority,
multithreading/uhd::set_thread_priority, but none have seemed to
resolve the issue.
What I have noticed is that when I get a “D” error, it corresponds to
recv() returning a number of samples less than samples per buffer,
followed by a return value of 0.
My current assumption is that the task of saving data to NVME is
creating a critical path that cant be resolved with thread
prioritization or multithreading. Or, maybe I am just not doing thread
priority or multithreading correctly. Either way, it is strange to me
that recv() can return a number of samples less than buffer outside of
a stop signal or timeout signal.
Any help/suggestion are appreciated,
Joe
That suggests that packets are getting dropped somewhere in the network
stack -- possibly the network-card interface into
the kernel.
You have done all the things here:
https://kb.ettus.com/USRP_Host_Performance_Tuning_Tips_and_Tricks
Including increasing the number of ring buffers in the network card?
USRP-users mailing list -- usrp-users@lists.ettus.com
To unsubscribe send an email to usrp-users-leave@lists.ettus.com
On Wed, Jan 24, 2024 at 2:43 PM Marcus D. Leech patchvonbraun@gmail.com wrote:
On 24/01/2024 13:00, jmaloyan@umass.edu wrote:
Hello,
I currently have an application where I am burst receiving(about 80
micro per milli second) with the ettus X410 at the full sample rate
across 4 channels. I am getting occasional issues where data is
dropped (terminal messages show “D” error). I have been able to get
DPDK to work but that does not seem to have resolved the issue. By my
calculation, this comes out to a data rate of 5.12*10^9 Gbit/s
The current host computer has an i9-13900KS, Nvme, 128 GB of RAM, and
I am currently using a Mellanox 100 Gbit QSFP network card
I would say in general, I am able to save just under 100% of all the
data I request from the x410, however, for our application, it is very
critical that we do not lose any data. If I run the default CG_400
image with benchmark rate(1 channel only), I do not get dropped data.
The only significant difference between my custom host software and
benchmark_rate.cpp is I save data to a .dat file(similar to
rx_samples_to_file.cpp) .
I have looked at the tuning notes here
https://kb.ettus.com/Getting_Started_with_DPDK_and_UHD. I have tried
DPDK, core isolation/ disabling system interrupts, nice priority,
multithreading/uhd::set_thread_priority, but none have seemed to
resolve the issue.
What I have noticed is that when I get a “D” error, it corresponds to
recv() returning a number of samples less than samples per buffer,
followed by a return value of 0.
My current assumption is that the task of saving data to NVME is
creating a critical path that cant be resolved with thread
prioritization or multithreading. Or, maybe I am just not doing thread
priority or multithreading correctly. Either way, it is strange to me
that recv() can return a number of samples less than buffer outside of
a stop signal or timeout signal.
Any help/suggestion are appreciated,
Joe
That suggests that packets are getting dropped somewhere in the network
stack -- possibly the network-card interface into
the kernel.
You have done all the things here:
https://kb.ettus.com/USRP_Host_Performance_Tuning_Tips_and_Tricks
Including increasing the number of ring buffers in the network card?
Hi Joe,
I noticed that you have 128GB RAM. If you turn this into a 120GB RAM
drive, is this sufficient memory depth for your needs? If this is
possible, there is a good chance it will solve your issue.
Prior to DPDK, I tried to save to fast SSD and I always had problems
at high rates (X310, etc, not X410 rates). I was always able to solve
the problem by saving to a RAM drive. At one point I even wrote a
separate utility to continually monitor and copy files from the RAM
drive to the SSD so that the RAM drive never actually filled. Even
when I toyed with DPDK (a long time ago), I had much improved behavior
saving to SSD but still not as good as when I saved to RAM drive which
always has given me performance that rivals benchmark_rate.
Rob
On 24/01/2024 16:11, Rob Kossler wrote:
Hi Joe,
I noticed that you have 128GB RAM. If you turn this into a 120GB RAM
drive, is this sufficient memory depth for your needs? If this is
possible, there is a good chance it will solve your issue.
Prior to DPDK, I tried to save to fast SSD and I always had problems
at high rates (X310, etc, not X410 rates). I was always able to solve
the problem by saving to a RAM drive. At one point I even wrote a
separate utility to continually monitor and copy files from the RAM
drive to the SSD so that the RAM drive never actually filled. Even
when I toyed with DPDK (a long time ago), I had much improved behavior
saving to SSD but still not as good as when I saved to RAM drive which
always has given me performance that rivals benchmark_rate.
Rob
An SSD can notionally write at about 300-500MB/sec, which amounts to up
to about 80MSPS, depending on sample format, etc.
One can optimize storage size by perhaps reducing sample-size, but any
computation involved in doing that conversion
must necessarily be "paid for".
Joe noted that he was writing in "bursts" where he might only save some
fraction of the samples coming in, which
suggests that disk-write rate may not be the overwhelming issue here,
but rather the ability to safely bring all these
samples into the system in something approaching real-time, without
dropping anything.
Increasing the ring buffer size does not seem to help either, or at least does not clear up the issue entirely, but I will keep the buffer size at its maximum (8192) for now. Same goes setting governing to “performance”
I forgot to add in my previous email that occasionally, (this does not always happen) data will drop in ‘sprints’, so instead of a small bit of data being dropped here and there, there may be a 10-15 second interval where a relatively large amount of data is dropped.
If I bring the data rate slightly down by decreasing the number of samples per burst - for example, decreasing down to 85% of the preferred amount - it seems to act quite reliably(over hours and hours of testing I have yet to notice failure), so I am likely operating near a threshold. It does not get much worse when I increase the samples per burst, but as I have said I rather not lose any data.
I do still find it peculiar that the returned value of recv() when data is dropped is less than the buffer size, then immediately returns 0 the next time its called. Going through the source code I could see why it returns a value smaller than the buffer size, but I dont understand why it would return 0.
On 24/01/2024 18:13, jmaloyan@umass.edu wrote:
I do still find it peculiar that the returned value of recv() when
data is dropped is less than the buffer size, then immediately returns
0 the next time its called. Going through the source code I could see
why it returns a value smaller than the buffer size, but I dont
understand why it would return 0.
If the underlying socket call is returning an error, I can see the
recv() call returning 0.
USRP-users mailing list -- usrp-users@lists.ettus.com
To unsubscribe send an email to usrp-users-leave@lists.ettus.com
I 2nd the "use a RAM disk if you can" approach. RAM is cheap compared to the time to flush out the gremlins in high rate continuous (or almost continuous) record applications.
Are you sending burst commands to achieve the 80us/1ms? Can you continuously recv and just drop the samples you don't want in software?
Check cables / transceivers! With an x410 had a direct attach copper qsfp28 to 4xsfp28 cable that would work most of the time and then give me issues, a simple reseat would usually clear up. Moved that system one day, grabbed a new (same part number mellanox) cable out of convenience and drops were basically non existent, never used old cable again.
Below is my experience with a multi-x310 continuous full rate record system.
I used gnuradio with uhd 3.15 to build my application (python generated from grc so your results my vary), make sure realtime scheduling is on and actually working (check with top). Look at changing the default gnuradio real-time "nice" level to be higher than all other normal tasks but not higher than fundamental os/disk/interrupt tasks. Use individual streamers and separate threads for each channel (why I used gnuradio).
Definitely follow https://kb.ettus.com/USRP_Host_Performance_Tuning_Tips_and_TricksCPU gov is a must (and max you CPU fan for long duration records)Thread priority is a mustIncreased Network buffers is a must8000-9000 jumbo frames/MTU is a mustIncreased ring buffers is good practice DPDK has shown impressive CPU load reduction although I did not use itI never realized a quantifiable performance improvement with disabling hyper threading and disabling the KPTI protections (although I never remember turning KPTI back on). Disabling hyper threading limited me in my later thread affinity optimization so I know I had hyper threading enabled.
Make sure your writing to disk is optimized. I used PWRITE with multiple open flags, I think O_DIRECT O_BINARY O_SYNC O_LARGEFILE and then posix_fadvise WILLNEED NOREUSE SEQUENTIAL. Not saying all are necessary just what I ended up at. Write in your storage medium's block size (fill in zeros if not a block at the end). Use a multiple of that block size as your write length. Do not assume more is better test different multiples!!! I think a multiple that made each write just smaller than the L3 cache of the machine worked best for me. Also store in over the wire format (otw) usually complex INT16 (32bits/sample) rather than typical CPU format complex float (64bits/sample). This reduces the conversion overhead, throughput required and file size.
The sporadic drops you are experiencing sounds familiar. I had to "strategically" lock the thread affinity for the application threads, uhd background threads, mellanox interrupt processing and the GNOME desktop interrupt processing threads to separate but appropriate cores in a multi-node machine (recommend isolating those nodes as well). It had numerous NVME drives so write benchmarking was 4-5x the required throughput. I remember the problem seemed to be more of a latency/core hopping issue for me at least. I also switched to a realtime kernel (Ubuntu 18). It reduced disk throughput benchmarks dramatically but helped me iron out the last few hiccups.
My system did not like to operate without drops right after boot. I always had to "warm up" the machine by running a collect for 5-10min first and always had numerous drops. But after that, I could complete 30+ min collects error free. And when paired with code to track number and time of dropped samples and insert zeros real-time, the system created binary files that were sample/byte/time aligned for multiple channels at full rate regardless if drops occured or not.
I also always kept a large floor fan moving air around the radios, no quantifiable data to support this need but felt like I had less issues and kept radios cool when operating in close proximity to one another.
This was done without DPDK. I tired DPDK at first but with it seemed to have more issues than without it and never tried it again with all the settings optimization. CPU base speed was 2.9GHz boost upto mid 3.x GHz, I typically recommend a base of 3.0-3.5GHz and a boost over 4.0 GHz to stream 200MSPS @ 32 bits/sample without much pain.
I always wanted to try not using an desktop environment to see if that avoided some of the optimizations I had to do, in particular the thread affinity/interrupt optimizations.
On Wed, Jan 24, 2024 at 4:20 PM, Marcus D. Leechpatchvonbraun@gmail.com wrote: _______________________________________________
USRP-users mailing list -- usrp-users@lists.ettus.com
To unsubscribe send an email to usrp-users-leave@lists.ettus.com