[USRP-users] libusb uses only one thread?

Anon Lister listeranon at gmail.com
Sun Oct 29 17:30:23 EDT 2017


Try dragging a window around on the screen. Think it has to do more with
X11. Assume leafpad makes lots of drawing calls.

On Oct 29, 2017 9:27 AM, "Андрій Хома via USRP-users" <
usrp-users at lists.ettus.com> wrote:

> LeafPad calls overflows!
>
> (this is NOT a joke, but not entirely true)
>
> In general, I noticed that if during the recording to start a process -
> there is an overflow. Including, if you run LeafPad: D Also, while my "some
> processing" is running, I periodically spawn processes, which explains the
> "abnormal" drops with a low CPU load. Now it remains to understand WHY the
> start of the process affects the recording so much?
>
> A little more about my environment:
>
> Ubuntu 16, GNU C ++ version 5.4.0 20160609; Boost_105800;
> UHD_3.11.0.x310-285-g78e9d6ba
>
> The threads responsible for recv() and event handling libusb are set to
> REALTIME scheduling and maximum priorities.
>
> recv() threads get ~ 25% CPU, libusb - ~ 85%.
>
> Trying to figure out the reason I used isolcpus and numactl. As a result,
> I allocated for recorder and "some processing" on a separate processor.
>
> numactl --membind = 0 --cpunodebind = 0 - chrt -r 99 start_recorder
>
> numactl --membind = 1 --cpunodebind = 1 - chrt -r 99 start_some_processing
>
> NUMA node 0 is "hidden" from the system with the help of isolcpus, which
> means the following: recorder runs on a processor (0) to which no one else
> has access, even a kernel. "some processing" in turn works on another
> processor (1), and it also does not have access to the process (0) on which
> the recorder resides. With this configuration, there is no overflow! But if
> you run LeafPad .. overflow O_O
>
> Next, I used the "stress" utility.
>
> First, we test the CPU
>
> numactl --membind = 1 --cpunodebind = 1 - chrt -r 99 stress -c 40 -
> overflow is not observed
>
> numactl --membind = 0 --cpunodebind = 0 - chrt -r 99 stress -c 40 -
> overflow is still not there!
>
> Now, RAM
>
> numactl --membind = 1 --cpunodebind = 1 - chrt -r 99 stress -m 40 -
> overflow is not observed
>
> numactl --membind = 0 --cpunodebind = 0 - chrt -r 99 stress -m 40 -
> overflow detected!
>
> Okay, so now we can try to connect overflows with RAM. Correctly?
>
> Does anyone have any thoughts on this?
>
> PS: Delete LeafPad not offer :D
>
> 2017-10-25 11:21 GMT+03:00 Андрій Хома <anik123wb at gmail.com>:
>
>> I'll try it in other words, more structured:
>> there is x6 b205mini, 5 via usual usb 3.0, one via hub "pci-e to usb".
>> For example, there is a gnuradio flowgraph:
>> usrp source -> null sink
>> Despite the fact that libusb works in one thread, I can get 45MHz from
>> each device.
>> Now, if you take such a flowgraph:
>> usrp source -> file sink (FIFO) -> some processing
>> That I can get a maximum of 7MHz from each device.
>> "some processing" is a chain of handlers, some of them use GPUs, and some
>> use named FIFOs to transfer data.
>> Separately, each link manages to process data for a maximum of 10% of the
>> maximum allowable time to reach realtime.
>> Load CPU at 5 MHz is almost not observed.
>>
>> 2017-10-25 10:52 GMT+03:00 Андрій Хома <anik123wb at gmail.com>:
>>
>>> Hello Marcus,
>>> sorry for the belated answer, I unsubscribed from the mailing list, and
>>> in the end did not receive your answers. I accidentally found them
>>> here, via google: http://ettus.80997.x6.nabble.c
>>> om/USRP-users-Buffer-overflow-tips-td7475.html#a7541
>>> =)
>>>
>>> 270 MS/s is really *a lot* of data. You'd need a very capable computer
>>>> even when just handling that amount of data internally, but with
>>>> USB-connected devices, you also get a lot of interrupt handling. That will
>>>> put additional load on your CPU. I'm therefore actually very amazed by the
>>>> fact that the processor simply managing to deal with that! But: you must
>>>> make sure you're not only counting the time the program itself is running,
>>>> but also the time the CPU is stuck in kernel mode, handling the interrupts,
>>>> and the data transfers. Did you do that?
>>>>
>>> In truth, I do not know how to find these kernel interrupts. For
>>> example, I in the profiler see a lot of ioctl and vmxarea - do you think
>>> this is it? ioctl - this is working with the device, vmxarea - are
>>> inside the ioctl. But what is vmxarea?
>>>
>>> 205 MS/s (or more so, 270 MS/s) is *extremely much* for a storage
>>>> system. That is more than 16 Gb/s, or, in other words, more than three full
>>>> SATAIII connections running at maximum utilisation. You do have an
>>>> impressive set of SSDs, there, if you can sustain that write rate.
>>>>
>>> Fortunately, I have a completely different task). The original signal in
>>> realtime is processed using GPU magic and only useful information is
>>> conveniently placed in the file storage. In other words, what happens
>>> after AFTER receiving a signal works perfectly.
>>>
>>> How large are your buffers? I mean, with 4KB buffers, and 32b/S = 4B/S
>>>> (assuming you use 16bit I + 16bit Q) it follows that a single 4KB packet
>>>> can hold 1024 Samples, and that at 41 MS/s, that happens rougly every 1024
>>>> S / (41 MS/s) ~= 25 µs. For things to take "several milliseconds", your
>>>> buffers need to be Megabytes in size.
>>>>
>>> If you mean "recv_frame_size" then I'm playing with this value, for
>>> example, now I stopped at about 8000. If you mean the buffer that I pass to
>>> "recv" - then it is equal to sample_rate, that is, one buffer holds exactly
>>> one second.
>>>
>>> What does help is that the OS buffers named pipes as well as the file
>>>> system in RAM. If overflows happen roughly after your free RAM would have
>>>> been eaten up by the 205 MS/s · 4B/S = 820 MB/s, then your storage isn't
>>>> actually up to the task of writing data as the USRPs are at producing it,
>>>> and buffering by the OS simply saves you for "as long as the bucket does
>>>> not overflow".
>>>>
>>> Well, in general, I probably already answered this question above. On
>>> the other side, FIFO expects a process that does not write data to disk,
>>> but processes it, and does it very quickly.
>>>
>>> At the expense of libusb itself - I do not read well in English, but
>>> after reading a bit about libusb itself and digging into uhd I saw that uhd
>>> uses asynchronous API, but it catches in one thread. This is explained
>>> by the developers themselves in the comment to the code:
>>> https://github.com/EttusResearch/uhd/blob/master/host/lib/tr
>>> ansport/libusb1_base.cpp#L65
>>> Now it's clear why only one thread. In truth, developers have a reason
>>> to mention this point somewhere in the documentation, or even better: the
>>> approximate requirements for the processor. At least just because this
>>> behavior (one thread) is not obvious, that's why I had to spend so much
>>> time, because it is purely logical that each device will be allocated a
>>> separate threads, which means that if one device works well, the others
>>> too, in no matter how many of them, if the number of CPU cores allows. I
>>> have a total of 40, so I was confused and completely misunderstood.
>>>
>>> Also, you can look into using num_recv_frames in your device arguments
>>>> to "tune" the use of the USB subsystem.
>>>>
>>> Yes, I'm playing with the parameters "num_recv_frames" and
>>> "recv_frame_size". But unfortunately, I select them "by the method of
>>> blind poke." I saw that the increase in "recv_frame_size" a little
>>> helps, but if you specify the maximum value - then the overflow occurs even
>>> earlier. Also with "num_recv_frames" - it directly depends on
>>> "recv_frame_size", but I did not catch any logic in their behavior.
>>>
>>> The good news: I bought an expansion board "STLab PCIe to USB 3.0
>>> (U-720)", connected one of usrp to it and now I can use 6 devices with
>>> sample_rate 45MHz! (previously it was only 34 MHz). It turns out, the
>>> south bridge could not cope with such a data stream (?), And having
>>> transferred some of the load to the north bridge I still got a good result. Well,
>>> it's like a theory.
>>>
>>>
>>> But the problem remains in some form: 45MHz x6 is obtained only if the
>>> data is not processed afterwards (/dev/null), but if I raise the handlers,
>>> I get a maximum of 7MHz x6, and the CPU is almost not loaded. The basic
>>> calculations are performed on the GPU (Fourier transform), and all this is
>>> done very quickly. Each stage of processing is less than 10% of the
>>> maximum allowable time for realtime. Although the calculations are
>>> fast, but there is a lot of data, maybe there is not enough RAM speed or
>>> something like this?
>>> What do you think, what could be the problem? And how can I try to
>>> determine it?
>>> Naturally, I am ready to provide any necessary data.
>>>
>>> Once again I apologize for the belated answer,
>>> Andrei.
>>>
>>> 2017-10-23 10:39 GMT+03:00 Андрій Хома <anik123wb at gmail.com>:
>>>
>>>> As an addition: I have a USB controller: Intel Corporation C610 / X99
>>>> series chipset USB xHCI Host Controller (rev 05)
>>>>
>>>> 2017-10-23 0:09 GMT+03:00 Андрій Хома <anik123wb at gmail.com>:
>>>>
>>>>> Hello,
>>>>> I have:
>>>>> 6 usrp b502mini
>>>>> The motherboard Z10PE-D16 WS (can it matter in the chipset?)
>>>>> Intel xeon e5-2430v4 processor
>>>>> Memory DDR4 1866 (128 GB)
>>>>>
>>>>> As a result I get overflow ("O") when using 6 usrp at once.
>>>>> I'm not proficient in profiling, but I saw that only one thread is
>>>>> created for libusb, maybe this is the bottleneck.
>>>>>
>>>>> Explanation of attached picture # 1:
>>>>> 1 Create / initialize devices
>>>>> 2 I create two threads for each device, they alternate in the picture:
>>>>>     one on "recv" (starting from the very first). It can be seen that
>>>>> it is quite resource intensive, basically time is spent on
>>>>> convert_sc12_item32, but it's clear that the CPU core copes less
>>>>>     the second to write to named fifo (those that are more
>>>>> intermittent, starting with the second). Also is not resource-intensive.
>>>>> 3 I understand that in this thread there lives libusb, and it's only
>>>>> one, for 6 devices (see picture number 2)
>>>>>
>>>>> Also, I was playing with usrp x310, and I work quietly with 400Msps
>>>>> (via dual 10G ethernet), ie convert_sc12_item32 is fully capable of
>>>>> processing 400Msps on one 2.2GHz core, so the bottleneck is the
>>>>> aforementioned single libusb stream.
>>>>>
>>>>> Did I make the right conclusions?
>>>>> If so, then I need to know exactly what CPU power I need, without
>>>>> guessing on the coffee grounds.
>>>>> Are there any benchmarks, hardware requirements, so that I can still
>>>>> use these devices?
>>>>> [image: Встроенное изображение 1]
>>>>>
>>>>> [image: Встроенное изображение 2]
>>>>>
>>>>
>>>>
>>>
>>
>
> _______________________________________________
> USRP-users mailing list
> USRP-users at lists.ettus.com
> http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ettus.com/pipermail/usrp-users_lists.ettus.com/attachments/20171029/21b669a1/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ???????????.png
Type: image/png
Size: 353761 bytes
Desc: not available
URL: <http://lists.ettus.com/pipermail/usrp-users_lists.ettus.com/attachments/20171029/21b669a1/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ???????????.png
Type: image/png
Size: 118371 bytes
Desc: not available
URL: <http://lists.ettus.com/pipermail/usrp-users_lists.ettus.com/attachments/20171029/21b669a1/attachment-0001.png>


More information about the USRP-users mailing list