[USRP-users] libusb uses only one thread?

Андрій Хома anik123wb at gmail.com
Wed Oct 25 03:52:58 EDT 2017


Hello Marcus,
sorry for the belated answer, I unsubscribed from the mailing list, and in
the end did not receive your answers. I accidentally found them here, via
google:
http://ettus.80997.x6.nabble.com/USRP-users-Buffer-overflow-tips-td7475.html#a7541
=)

270 MS/s is really *a lot* of data. You'd need a very capable computer even
> when just handling that amount of data internally, but with USB-connected
> devices, you also get a lot of interrupt handling. That will put additional
> load on your CPU. I'm therefore actually very amazed by the fact that the
> processor simply managing to deal with that! But: you must make sure you're
> not only counting the time the program itself is running, but also the time
> the CPU is stuck in kernel mode, handling the interrupts, and the data
> transfers. Did you do that?
>
In truth, I do not know how to find these kernel interrupts. For example, I
in the profiler see a lot of ioctl and vmxarea - do you think this is it? ioctl
- this is working with the device, vmxarea - are inside the ioctl. But what
is vmxarea?

205 MS/s (or more so, 270 MS/s) is *extremely much* for a storage system.
> That is more than 16 Gb/s, or, in other words, more than three full SATAIII
> connections running at maximum utilisation. You do have an impressive set
> of SSDs, there, if you can sustain that write rate.
>
Fortunately, I have a completely different task). The original signal in
realtime is processed using GPU magic and only useful information is
conveniently placed in the file storage. In other words, what happens after
AFTER receiving a signal works perfectly.

How large are your buffers? I mean, with 4KB buffers, and 32b/S = 4B/S
> (assuming you use 16bit I + 16bit Q) it follows that a single 4KB packet
> can hold 1024 Samples, and that at 41 MS/s, that happens rougly every 1024
> S / (41 MS/s) ~= 25 µs. For things to take "several milliseconds", your
> buffers need to be Megabytes in size.
>
If you mean "recv_frame_size" then I'm playing with this value, for
example, now I stopped at about 8000. If you mean the buffer that I pass to
"recv" - then it is equal to sample_rate, that is, one buffer holds exactly
one second.

What does help is that the OS buffers named pipes as well as the file
> system in RAM. If overflows happen roughly after your free RAM would have
> been eaten up by the 205 MS/s · 4B/S = 820 MB/s, then your storage isn't
> actually up to the task of writing data as the USRPs are at producing it,
> and buffering by the OS simply saves you for "as long as the bucket does
> not overflow".
>
Well, in general, I probably already answered this question above. On the
other side, FIFO expects a process that does not write data to disk, but
processes it, and does it very quickly.

At the expense of libusb itself - I do not read well in English, but after
reading a bit about libusb itself and digging into uhd I saw that uhd uses
asynchronous API, but it catches in one thread. This is explained by the
developers themselves in the comment to the code:
https://github.com/EttusResearch/uhd/blob/master/host/lib/transport/libusb1_base.cpp#L65
Now it's clear why only one thread. In truth, developers have a reason to
mention this point somewhere in the documentation, or even better: the
approximate requirements for the processor. At least just because this
behavior (one thread) is not obvious, that's why I had to spend so much
time, because it is purely logical that each device will be allocated a
separate threads, which means that if one device works well, the others
too, in no matter how many of them, if the number of CPU cores allows. I
have a total of 40, so I was confused and completely misunderstood.

Also, you can look into using num_recv_frames in your device arguments
> to "tune" the use of the USB subsystem.
>
Yes, I'm playing with the parameters "num_recv_frames" and
"recv_frame_size". But unfortunately, I select them "by the method of blind
poke." I saw that the increase in "recv_frame_size" a little helps, but if
you specify the maximum value - then the overflow occurs even earlier. Also
with "num_recv_frames" - it directly depends on "recv_frame_size", but I
did not catch any logic in their behavior.

The good news: I bought an expansion board "STLab PCIe to USB 3.0 (U-720)",
connected one of usrp to it and now I can use 6 devices with sample_rate
45MHz! (previously it was only 34 MHz). It turns out, the south bridge
could not cope with such a data stream (?), And having transferred some of
the load to the north bridge I still got a good result. Well, it's like a
theory.


But the problem remains in some form: 45MHz x6 is obtained only if the data
is not processed afterwards (/dev/null), but if I raise the handlers, I get
a maximum of 7MHz x6, and the CPU is almost not loaded. The basic
calculations are performed on the GPU (Fourier transform), and all this is
done very quickly. Each stage of processing is less than 10% of the maximum
allowable time for realtime. Although the calculations are fast, but there
is a lot of data, maybe there is not enough RAM speed or something like
this?
What do you think, what could be the problem? And how can I try to
determine it?
Naturally, I am ready to provide any necessary data.

Once again I apologize for the belated answer,
Andrei.

2017-10-23 10:39 GMT+03:00 Андрій Хома <anik123wb at gmail.com>:

> As an addition: I have a USB controller: Intel Corporation C610 / X99
> series chipset USB xHCI Host Controller (rev 05)
>
> 2017-10-23 0:09 GMT+03:00 Андрій Хома <anik123wb at gmail.com>:
>
>> Hello,
>> I have:
>> 6 usrp b502mini
>> The motherboard Z10PE-D16 WS (can it matter in the chipset?)
>> Intel xeon e5-2430v4 processor
>> Memory DDR4 1866 (128 GB)
>>
>> As a result I get overflow ("O") when using 6 usrp at once.
>> I'm not proficient in profiling, but I saw that only one thread is
>> created for libusb, maybe this is the bottleneck.
>>
>> Explanation of attached picture # 1:
>> 1 Create / initialize devices
>> 2 I create two threads for each device, they alternate in the picture:
>>     one on "recv" (starting from the very first). It can be seen that it
>> is quite resource intensive, basically time is spent on
>> convert_sc12_item32, but it's clear that the CPU core copes less
>>     the second to write to named fifo (those that are more intermittent,
>> starting with the second). Also is not resource-intensive.
>> 3 I understand that in this thread there lives libusb, and it's only one,
>> for 6 devices (see picture number 2)
>>
>> Also, I was playing with usrp x310, and I work quietly with 400Msps (via
>> dual 10G ethernet), ie convert_sc12_item32 is fully capable of processing
>> 400Msps on one 2.2GHz core, so the bottleneck is the aforementioned single
>> libusb stream.
>>
>> Did I make the right conclusions?
>> If so, then I need to know exactly what CPU power I need, without
>> guessing on the coffee grounds.
>> Are there any benchmarks, hardware requirements, so that I can still use
>> these devices?
>> [image: Встроенное изображение 1]
>>
>> [image: Встроенное изображение 2]
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ettus.com/pipermail/usrp-users_lists.ettus.com/attachments/20171025/d3d3af5b/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ???????????.png
Type: image/png
Size: 118371 bytes
Desc: not available
URL: <http://lists.ettus.com/pipermail/usrp-users_lists.ettus.com/attachments/20171025/d3d3af5b/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ???????????.png
Type: image/png
Size: 353761 bytes
Desc: not available
URL: <http://lists.ettus.com/pipermail/usrp-users_lists.ettus.com/attachments/20171025/d3d3af5b/attachment-0001.png>


More information about the USRP-users mailing list