[USRP-users] libusb uses only one thread?

Андрій Хома anik123wb at gmail.com
Sun Oct 29 09:26:03 EDT 2017


LeafPad calls overflows!

(this is NOT a joke, but not entirely true)

In general, I noticed that if during the recording to start a process -
there is an overflow. Including, if you run LeafPad: D Also, while my "some
processing" is running, I periodically spawn processes, which explains the
"abnormal" drops with a low CPU load. Now it remains to understand WHY the
start of the process affects the recording so much?

A little more about my environment:

Ubuntu 16, GNU C ++ version 5.4.0 20160609; Boost_105800;
UHD_3.11.0.x310-285-g78e9d6ba

The threads responsible for recv() and event handling libusb are set to
REALTIME scheduling and maximum priorities.

recv() threads get ~ 25% CPU, libusb - ~ 85%.

Trying to figure out the reason I used isolcpus and numactl. As a result, I
allocated for recorder and "some processing" on a separate processor.

numactl --membind = 0 --cpunodebind = 0 - chrt -r 99 start_recorder

numactl --membind = 1 --cpunodebind = 1 - chrt -r 99 start_some_processing

NUMA node 0 is "hidden" from the system with the help of isolcpus, which
means the following: recorder runs on a processor (0) to which no one else
has access, even a kernel. "some processing" in turn works on another
processor (1), and it also does not have access to the process (0) on which
the recorder resides. With this configuration, there is no overflow! But if
you run LeafPad .. overflow O_O

Next, I used the "stress" utility.

First, we test the CPU

numactl --membind = 1 --cpunodebind = 1 - chrt -r 99 stress -c 40 -
overflow is not observed

numactl --membind = 0 --cpunodebind = 0 - chrt -r 99 stress -c 40 -
overflow is still not there!

Now, RAM

numactl --membind = 1 --cpunodebind = 1 - chrt -r 99 stress -m 40 -
overflow is not observed

numactl --membind = 0 --cpunodebind = 0 - chrt -r 99 stress -m 40 -
overflow detected!

Okay, so now we can try to connect overflows with RAM. Correctly?

Does anyone have any thoughts on this?

PS: Delete LeafPad not offer :D

2017-10-25 11:21 GMT+03:00 Андрій Хома <anik123wb at gmail.com>:

> I'll try it in other words, more structured:
> there is x6 b205mini, 5 via usual usb 3.0, one via hub "pci-e to usb".
> For example, there is a gnuradio flowgraph:
> usrp source -> null sink
> Despite the fact that libusb works in one thread, I can get 45MHz from
> each device.
> Now, if you take such a flowgraph:
> usrp source -> file sink (FIFO) -> some processing
> That I can get a maximum of 7MHz from each device.
> "some processing" is a chain of handlers, some of them use GPUs, and some
> use named FIFOs to transfer data.
> Separately, each link manages to process data for a maximum of 10% of the
> maximum allowable time to reach realtime.
> Load CPU at 5 MHz is almost not observed.
>
> 2017-10-25 10:52 GMT+03:00 Андрій Хома <anik123wb at gmail.com>:
>
>> Hello Marcus,
>> sorry for the belated answer, I unsubscribed from the mailing list, and
>> in the end did not receive your answers. I accidentally found them here,
>> via google: http://ettus.80997.x6.nabble.com/USRP-users-Buffer-overflow-
>> tips-td7475.html#a7541
>> =)
>>
>> 270 MS/s is really *a lot* of data. You'd need a very capable computer
>>> even when just handling that amount of data internally, but with
>>> USB-connected devices, you also get a lot of interrupt handling. That will
>>> put additional load on your CPU. I'm therefore actually very amazed by the
>>> fact that the processor simply managing to deal with that! But: you must
>>> make sure you're not only counting the time the program itself is running,
>>> but also the time the CPU is stuck in kernel mode, handling the interrupts,
>>> and the data transfers. Did you do that?
>>>
>> In truth, I do not know how to find these kernel interrupts. For
>> example, I in the profiler see a lot of ioctl and vmxarea - do you think
>> this is it? ioctl - this is working with the device, vmxarea - are
>> inside the ioctl. But what is vmxarea?
>>
>> 205 MS/s (or more so, 270 MS/s) is *extremely much* for a storage system.
>>> That is more than 16 Gb/s, or, in other words, more than three full SATAIII
>>> connections running at maximum utilisation. You do have an impressive set
>>> of SSDs, there, if you can sustain that write rate.
>>>
>> Fortunately, I have a completely different task). The original signal in
>> realtime is processed using GPU magic and only useful information is
>> conveniently placed in the file storage. In other words, what happens
>> after AFTER receiving a signal works perfectly.
>>
>> How large are your buffers? I mean, with 4KB buffers, and 32b/S = 4B/S
>>> (assuming you use 16bit I + 16bit Q) it follows that a single 4KB packet
>>> can hold 1024 Samples, and that at 41 MS/s, that happens rougly every 1024
>>> S / (41 MS/s) ~= 25 µs. For things to take "several milliseconds", your
>>> buffers need to be Megabytes in size.
>>>
>> If you mean "recv_frame_size" then I'm playing with this value, for
>> example, now I stopped at about 8000. If you mean the buffer that I pass to
>> "recv" - then it is equal to sample_rate, that is, one buffer holds exactly
>> one second.
>>
>> What does help is that the OS buffers named pipes as well as the file
>>> system in RAM. If overflows happen roughly after your free RAM would have
>>> been eaten up by the 205 MS/s · 4B/S = 820 MB/s, then your storage isn't
>>> actually up to the task of writing data as the USRPs are at producing it,
>>> and buffering by the OS simply saves you for "as long as the bucket does
>>> not overflow".
>>>
>> Well, in general, I probably already answered this question above. On
>> the other side, FIFO expects a process that does not write data to disk,
>> but processes it, and does it very quickly.
>>
>> At the expense of libusb itself - I do not read well in English, but
>> after reading a bit about libusb itself and digging into uhd I saw that uhd
>> uses asynchronous API, but it catches in one thread. This is explained
>> by the developers themselves in the comment to the code:
>> https://github.com/EttusResearch/uhd/blob/master/host/lib/
>> transport/libusb1_base.cpp#L65
>> Now it's clear why only one thread. In truth, developers have a reason
>> to mention this point somewhere in the documentation, or even better: the
>> approximate requirements for the processor. At least just because this
>> behavior (one thread) is not obvious, that's why I had to spend so much
>> time, because it is purely logical that each device will be allocated a
>> separate threads, which means that if one device works well, the others
>> too, in no matter how many of them, if the number of CPU cores allows. I
>> have a total of 40, so I was confused and completely misunderstood.
>>
>> Also, you can look into using num_recv_frames in your device arguments
>>> to "tune" the use of the USB subsystem.
>>>
>> Yes, I'm playing with the parameters "num_recv_frames" and
>> "recv_frame_size". But unfortunately, I select them "by the method of
>> blind poke." I saw that the increase in "recv_frame_size" a little
>> helps, but if you specify the maximum value - then the overflow occurs even
>> earlier. Also with "num_recv_frames" - it directly depends on
>> "recv_frame_size", but I did not catch any logic in their behavior.
>>
>> The good news: I bought an expansion board "STLab PCIe to USB 3.0
>> (U-720)", connected one of usrp to it and now I can use 6 devices with
>> sample_rate 45MHz! (previously it was only 34 MHz). It turns out, the
>> south bridge could not cope with such a data stream (?), And having
>> transferred some of the load to the north bridge I still got a good result. Well,
>> it's like a theory.
>>
>>
>> But the problem remains in some form: 45MHz x6 is obtained only if the
>> data is not processed afterwards (/dev/null), but if I raise the handlers,
>> I get a maximum of 7MHz x6, and the CPU is almost not loaded. The basic
>> calculations are performed on the GPU (Fourier transform), and all this is
>> done very quickly. Each stage of processing is less than 10% of the
>> maximum allowable time for realtime. Although the calculations are fast,
>> but there is a lot of data, maybe there is not enough RAM speed or
>> something like this?
>> What do you think, what could be the problem? And how can I try to
>> determine it?
>> Naturally, I am ready to provide any necessary data.
>>
>> Once again I apologize for the belated answer,
>> Andrei.
>>
>> 2017-10-23 10:39 GMT+03:00 Андрій Хома <anik123wb at gmail.com>:
>>
>>> As an addition: I have a USB controller: Intel Corporation C610 / X99
>>> series chipset USB xHCI Host Controller (rev 05)
>>>
>>> 2017-10-23 0:09 GMT+03:00 Андрій Хома <anik123wb at gmail.com>:
>>>
>>>> Hello,
>>>> I have:
>>>> 6 usrp b502mini
>>>> The motherboard Z10PE-D16 WS (can it matter in the chipset?)
>>>> Intel xeon e5-2430v4 processor
>>>> Memory DDR4 1866 (128 GB)
>>>>
>>>> As a result I get overflow ("O") when using 6 usrp at once.
>>>> I'm not proficient in profiling, but I saw that only one thread is
>>>> created for libusb, maybe this is the bottleneck.
>>>>
>>>> Explanation of attached picture # 1:
>>>> 1 Create / initialize devices
>>>> 2 I create two threads for each device, they alternate in the picture:
>>>>     one on "recv" (starting from the very first). It can be seen that
>>>> it is quite resource intensive, basically time is spent on
>>>> convert_sc12_item32, but it's clear that the CPU core copes less
>>>>     the second to write to named fifo (those that are more
>>>> intermittent, starting with the second). Also is not resource-intensive.
>>>> 3 I understand that in this thread there lives libusb, and it's only
>>>> one, for 6 devices (see picture number 2)
>>>>
>>>> Also, I was playing with usrp x310, and I work quietly with 400Msps
>>>> (via dual 10G ethernet), ie convert_sc12_item32 is fully capable of
>>>> processing 400Msps on one 2.2GHz core, so the bottleneck is the
>>>> aforementioned single libusb stream.
>>>>
>>>> Did I make the right conclusions?
>>>> If so, then I need to know exactly what CPU power I need, without
>>>> guessing on the coffee grounds.
>>>> Are there any benchmarks, hardware requirements, so that I can still
>>>> use these devices?
>>>> [image: Встроенное изображение 1]
>>>>
>>>> [image: Встроенное изображение 2]
>>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ettus.com/pipermail/usrp-users_lists.ettus.com/attachments/20171029/e7143f77/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ???????????.png
Type: image/png
Size: 118371 bytes
Desc: not available
URL: <http://lists.ettus.com/pipermail/usrp-users_lists.ettus.com/attachments/20171029/e7143f77/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ???????????.png
Type: image/png
Size: 353761 bytes
Desc: not available
URL: <http://lists.ettus.com/pipermail/usrp-users_lists.ettus.com/attachments/20171029/e7143f77/attachment-0001.png>


More information about the USRP-users mailing list