usrp-users@lists.ettus.com

Discussion and technical support related to USRP, UHD, RFNoC

View all threads

Accelerated UHD drivers

JW
John Wilson
Wed, Aug 13, 2014 7:19 PM

Hi all,

We've recently been playing around with getting GNURadio working on Jetson
and low-power x86 platforms and are running into some issues pulling
sustained samples at rates > 4MS/s (using, e.g. rx_samples_to_file). This
is the case even when writing to /dev/null or removing the stream I/O
operation altogether from the source.

Has anyone got any idea what might be causing the bottleneck? All of the
boards we're using have GigE connections on them, they benchmark/iperf okay
so it's a bit baffling. One thought that we've had is that because the UDP
interface isn't true zero-copy, we might be killing the CPU or memory bus
with memory-memory copies.

We're investigating using an accelerated layer to capture packets from the
USRP2 (using e.g. libpcap or PF_RING) - we would then have the USRP2
conneced on a dedicated PHY. Has anyone tried anything like this yet or can
anyone in the know advise on whether we're maybe barking up the wrong tree?

Cheers,

John

Hi all, We've recently been playing around with getting GNURadio working on Jetson and low-power x86 platforms and are running into some issues pulling sustained samples at rates > 4MS/s (using, e.g. rx_samples_to_file). This is the case even when writing to /dev/null or removing the stream I/O operation altogether from the source. Has anyone got any idea what might be causing the bottleneck? All of the boards we're using have GigE connections on them, they benchmark/iperf okay so it's a bit baffling. One thought that we've had is that because the UDP interface isn't true zero-copy, we might be killing the CPU or memory bus with memory-memory copies. We're investigating using an accelerated layer to capture packets from the USRP2 (using e.g. libpcap or PF_RING) - we would then have the USRP2 conneced on a dedicated PHY. Has anyone tried anything like this yet or can anyone in the know advise on whether we're maybe barking up the wrong tree? Cheers, John
MF
Moritz Fischer
Thu, Aug 14, 2014 8:05 AM

Hi John,

On Wed, Aug 13, 2014 at 9:19 PM, John Wilson via USRP-users
usrp-users@lists.ettus.com wrote:

Has anyone got any idea what might be causing the bottleneck? All of the
boards we're using have GigE connections on them, they benchmark/iperf okay
so it's a bit baffling. One thought that we've had is that because the UDP
interface isn't true zero-copy, we might be killing the CPU or memory bus
with memory-memory copies.

Did you try to profile your setup to see where exactly you spend your cycles?
Perf gives you usually a good idea where most time is spent.

Cheers,

Moritz

Hi John, On Wed, Aug 13, 2014 at 9:19 PM, John Wilson via USRP-users <usrp-users@lists.ettus.com> wrote: > Has anyone got any idea what might be causing the bottleneck? All of the > boards we're using have GigE connections on them, they benchmark/iperf okay > so it's a bit baffling. One thought that we've had is that because the UDP > interface isn't true zero-copy, we might be killing the CPU or memory bus > with memory-memory copies. Did you try to profile your setup to see where exactly you spend your cycles? Perf gives you usually a good idea where most time is spent. Cheers, Moritz
MM
Marcus Müller
Thu, Aug 14, 2014 10:08 AM

Hi John,

cool work!
The Jetson is based on a tegra k1, isn't it? That is a cortex A15, which
should be substantially more beefy than the gumstix Overo; that is what
I have had the pleasure of working with, and that also has a hard time
to sustain rates >4MS/s; I don't think it being multicore helps much,
here, because only one core can handle incoming data at a time.

So, in addition to what Moritz has suggested, you might want to look
into using Jumbo packets on your NIC[1]; that would increase latency a
bit of course, but assuming that one of your CPU drains might be
NIC-related interrupts, that might be worthwhile. Also, make sure
interrupt coalescing is enabled (if that is supported by the NIC/its
driver), and that, if possible, network packet checksums are checked on
the NIC rather than by the CPU.

I don't think sniffing packets using pcap will do you much good
performance wise. The way zero copying with UDP works is that you give
the recv call a buffer, and the network stack directly mmaps the
received packet into that buffer[2].

Greetings,
Marcus

[1]
http://code.ettus.com/redmine/ettus/projects/uhd/wiki/Latency#Ethernet-N2xx
[2]
http://yusufonlinux.blogspot.de/2010/11/data-link-access-and-zero-copy.html
On 13.08.2014 21:19, John Wilson via USRP-users wrote:

Hi all,

We've recently been playing around with getting GNURadio working on Jetson
and low-power x86 platforms and are running into some issues pulling
sustained samples at rates > 4MS/s (using, e.g. rx_samples_to_file). This
is the case even when writing to /dev/null or removing the stream I/O
operation altogether from the source.

Has anyone got any idea what might be causing the bottleneck? All of the
boards we're using have GigE connections on them, they benchmark/iperf okay
so it's a bit baffling. One thought that we've had is that because the UDP
interface isn't true zero-copy, we might be killing the CPU or memory bus
with memory-memory copies.

We're investigating using an accelerated layer to capture packets from the
USRP2 (using e.g. libpcap or PF_RING) - we would then have the USRP2
conneced on a dedicated PHY. Has anyone tried anything like this yet or can
anyone in the know advise on whether we're maybe barking up the wrong tree?

Cheers,

John


USRP-users mailing list
USRP-users@lists.ettus.com
http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com

Hi John, cool work! The Jetson is based on a tegra k1, isn't it? That is a cortex A15, which should be substantially more beefy than the gumstix Overo; that is what I have had the pleasure of working with, and that also has a hard time to sustain rates >4MS/s; I don't think it being multicore helps much, here, because only one core can handle incoming data at a time. So, in addition to what Moritz has suggested, you might want to look into using Jumbo packets on your NIC[1]; that would increase latency a bit of course, but assuming that one of your CPU drains might be NIC-related interrupts, that might be worthwhile. Also, make sure interrupt coalescing is enabled (if that is supported by the NIC/its driver), and that, if possible, network packet checksums are checked on the NIC rather than by the CPU. I don't think sniffing packets using pcap will do you much good performance wise. The way zero copying with UDP works is that you give the recv call a buffer, and the network stack directly mmaps the received packet into that buffer[2]. Greetings, Marcus [1] http://code.ettus.com/redmine/ettus/projects/uhd/wiki/Latency#Ethernet-N2xx [2] http://yusufonlinux.blogspot.de/2010/11/data-link-access-and-zero-copy.html On 13.08.2014 21:19, John Wilson via USRP-users wrote: > Hi all, > > We've recently been playing around with getting GNURadio working on Jetson > and low-power x86 platforms and are running into some issues pulling > sustained samples at rates > 4MS/s (using, e.g. rx_samples_to_file). This > is the case even when writing to /dev/null or removing the stream I/O > operation altogether from the source. > > Has anyone got any idea what might be causing the bottleneck? All of the > boards we're using have GigE connections on them, they benchmark/iperf okay > so it's a bit baffling. One thought that we've had is that because the UDP > interface isn't true zero-copy, we might be killing the CPU or memory bus > with memory-memory copies. > > We're investigating using an accelerated layer to capture packets from the > USRP2 (using e.g. libpcap or PF_RING) - we would then have the USRP2 > conneced on a dedicated PHY. Has anyone tried anything like this yet or can > anyone in the know advise on whether we're maybe barking up the wrong tree? > > Cheers, > > John > > > > _______________________________________________ > USRP-users mailing list > USRP-users@lists.ettus.com > http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com
JW
John Wilson
Fri, Aug 15, 2014 8:15 PM

Hi Moritz,

Thanks for your reply, we have tried a couple of profilers, in particular
the g++ (-pg) one and the Google gperf one, not getting enitrely convincing
results just yet though. Do you have any tips on setting a good profiler up
(i.e. compiler options, packages)?

Cheers,

John

On Thu, Aug 14, 2014 at 9:05 AM, Moritz Fischer moritz.fischer@ettus.com
wrote:

Hi John,

On Wed, Aug 13, 2014 at 9:19 PM, John Wilson via USRP-users
usrp-users@lists.ettus.com wrote:

Has anyone got any idea what might be causing the bottleneck? All of the
boards we're using have GigE connections on them, they benchmark/iperf

okay

so it's a bit baffling. One thought that we've had is that because the

UDP

interface isn't true zero-copy, we might be killing the CPU or memory bus
with memory-memory copies.

Did you try to profile your setup to see where exactly you spend your
cycles?
Perf gives you usually a good idea where most time is spent.

Cheers,

Moritz

Hi Moritz, Thanks for your reply, we have tried a couple of profilers, in particular the g++ (-pg) one and the Google gperf one, not getting enitrely convincing results just yet though. Do you have any tips on setting a good profiler up (i.e. compiler options, packages)? Cheers, John On Thu, Aug 14, 2014 at 9:05 AM, Moritz Fischer <moritz.fischer@ettus.com> wrote: > Hi John, > > On Wed, Aug 13, 2014 at 9:19 PM, John Wilson via USRP-users > <usrp-users@lists.ettus.com> wrote: > > > Has anyone got any idea what might be causing the bottleneck? All of the > > boards we're using have GigE connections on them, they benchmark/iperf > okay > > so it's a bit baffling. One thought that we've had is that because the > UDP > > interface isn't true zero-copy, we might be killing the CPU or memory bus > > with memory-memory copies. > > Did you try to profile your setup to see where exactly you spend your > cycles? > Perf gives you usually a good idea where most time is spent. > > Cheers, > > Moritz >
JW
John Wilson
Fri, Aug 15, 2014 8:36 PM

Hi Marcus,

Thanks for your really detailed reply! It is indeed an A15, it comes with a
Realtek GigE part though, we've ordered an Intel PCIe card to plug into it
instead, there might be some performance benefits. Using jumbo frames has
been suggested as an option by one of our guys and we'll probably crack on
with that one next week.

We've had much better results from a board based on an Intel Celeron J1900,
we can pull samples at 12.5 MS/s with CPU load at approximately 70% on one
thread (it's a quad-core device with a pretty low single threaded
performance). Those results were observed on a Supermicro X10SBA if
anyone's interested, with dual intel GigE chipset. We're going to give the
Gigabyte GA-J1900-D3V a go too, which is a similar board with a dual
Realtek GigE adaptor instead. All of that's great, but we really want to
shift the samples onto the Tegra and start cranking some serious
performance out of the insane number of cores on it!

Did you use the Overo on the E100 or similar? How far did you get into all
this transport layer stuff? I'll keep you updated on what we get working
etc.

Cheers,

John

On Thu, Aug 14, 2014 at 11:08 AM, Marcus Müller usrp-users@lists.ettus.com
wrote:

Hi John,

cool work!
The Jetson is based on a tegra k1, isn't it? That is a cortex A15, which
should be substantially more beefy than the gumstix Overo; that is what I
have had the pleasure of working with, and that also has a hard time to
sustain rates >4MS/s; I don't think it being multicore helps much, here,
because only one core can handle incoming data at a time.

So, in addition to what Moritz has suggested, you might want to look into
using Jumbo packets on your NIC[1]; that would increase latency a bit of
course, but assuming that one of your CPU drains might be NIC-related
interrupts, that might be worthwhile. Also, make sure interrupt coalescing
is enabled (if that is supported by the NIC/its driver), and that, if
possible, network packet checksums are checked on the NIC rather than by
the CPU.

I don't think sniffing packets using pcap will do you much good
performance wise. The way zero copying with UDP works is that you give the
recv call a buffer, and the network stack directly mmaps the received
packet into that buffer[2].

Greetings,
Marcus

[1]
http://code.ettus.com/redmine/ettus/projects/uhd/wiki/Latency#Ethernet-N2xx
[2]
http://yusufonlinux.blogspot.de/2010/11/data-link-access-and-zero-copy.html

On 13.08.2014 21:19, John Wilson via USRP-users wrote:

Hi all,

We've recently been playing around with getting GNURadio working on Jetson
and low-power x86 platforms and are running into some issues pulling
sustained samples at rates > 4MS/s (using, e.g. rx_samples_to_file). This
is the case even when writing to /dev/null or removing the stream I/O
operation altogether from the source.

Has anyone got any idea what might be causing the bottleneck? All of the
boards we're using have GigE connections on them, they benchmark/iperf okay
so it's a bit baffling. One thought that we've had is that because the UDP
interface isn't true zero-copy, we might be killing the CPU or memory bus
with memory-memory copies.

We're investigating using an accelerated layer to capture packets from the
USRP2 (using e.g. libpcap or PF_RING) - we would then have the USRP2
conneced on a dedicated PHY. Has anyone tried anything like this yet or can
anyone in the know advise on whether we're maybe barking up the wrong tree?

Cheers,

John


USRP-users mailing listUSRP-users@lists.ettus.comhttp://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com


USRP-users mailing list
USRP-users@lists.ettus.com
http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com

Hi Marcus, Thanks for your really detailed reply! It is indeed an A15, it comes with a Realtek GigE part though, we've ordered an Intel PCIe card to plug into it instead, there might be some performance benefits. Using jumbo frames has been suggested as an option by one of our guys and we'll probably crack on with that one next week. We've had much better results from a board based on an Intel Celeron J1900, we can pull samples at 12.5 MS/s with CPU load at approximately 70% on one thread (it's a quad-core device with a pretty low single threaded performance). Those results were observed on a Supermicro X10SBA if anyone's interested, with dual intel GigE chipset. We're going to give the Gigabyte GA-J1900-D3V a go too, which is a similar board with a dual Realtek GigE adaptor instead. All of that's great, but we really want to shift the samples onto the Tegra and start cranking some serious performance out of the insane number of cores on it! Did you use the Overo on the E100 or similar? How far did you get into all this transport layer stuff? I'll keep you updated on what we get working etc. Cheers, John On Thu, Aug 14, 2014 at 11:08 AM, Marcus Müller <usrp-users@lists.ettus.com> wrote: > Hi John, > > cool work! > The Jetson is based on a tegra k1, isn't it? That is a cortex A15, which > should be substantially more beefy than the gumstix Overo; that is what I > have had the pleasure of working with, and that also has a hard time to > sustain rates >4MS/s; I don't think it being multicore helps much, here, > because only one core can handle incoming data at a time. > > So, in addition to what Moritz has suggested, you might want to look into > using Jumbo packets on your NIC[1]; that would increase latency a bit of > course, but assuming that one of your CPU drains might be NIC-related > interrupts, that might be worthwhile. Also, make sure interrupt coalescing > is enabled (if that is supported by the NIC/its driver), and that, if > possible, network packet checksums are checked on the NIC rather than by > the CPU. > > I don't think sniffing packets using pcap will do you much good > performance wise. The way zero copying with UDP works is that you give the > recv call a buffer, and the network stack directly mmaps the received > packet into that buffer[2]. > > Greetings, > Marcus > > [1] > http://code.ettus.com/redmine/ettus/projects/uhd/wiki/Latency#Ethernet-N2xx > [2] > http://yusufonlinux.blogspot.de/2010/11/data-link-access-and-zero-copy.html > > On 13.08.2014 21:19, John Wilson via USRP-users wrote: > > Hi all, > > We've recently been playing around with getting GNURadio working on Jetson > and low-power x86 platforms and are running into some issues pulling > sustained samples at rates > 4MS/s (using, e.g. rx_samples_to_file). This > is the case even when writing to /dev/null or removing the stream I/O > operation altogether from the source. > > Has anyone got any idea what might be causing the bottleneck? All of the > boards we're using have GigE connections on them, they benchmark/iperf okay > so it's a bit baffling. One thought that we've had is that because the UDP > interface isn't true zero-copy, we might be killing the CPU or memory bus > with memory-memory copies. > > We're investigating using an accelerated layer to capture packets from the > USRP2 (using e.g. libpcap or PF_RING) - we would then have the USRP2 > conneced on a dedicated PHY. Has anyone tried anything like this yet or can > anyone in the know advise on whether we're maybe barking up the wrong tree? > > Cheers, > > John > > > > > _______________________________________________ > USRP-users mailing listUSRP-users@lists.ettus.comhttp://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com > > > > _______________________________________________ > USRP-users mailing list > USRP-users@lists.ettus.com > http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com > >
MM
Marcus Müller
Fri, Aug 15, 2014 9:06 PM

Hi John,

the celeron outperforming the ARM doesn't really surprise me -- x86s
generally tend to be a little more computational power per clock cycle
-- but the same goes for the electrical power going in.

Now, as a wild shot in the dark, the first time UHD touches the
individual sample is when it's converted from wire format (usually
complex short) to whatever the user demands; now, on the celeron, UHD
would probably make use of a lot of SSE2 instructions. Now, for the ARM,
I remember the source code saying something like "if you can, use ORC,
because NEON is not yet highly optimized".
If you don't have the liborc development files installed, UHD would then
use a NEON implementation; maybe comparing both would be worth a shot.

Greetings,
Marcus

On 15.08.2014 22:36, John Wilson via USRP-users wrote:

Hi Marcus,

Thanks for your really detailed reply! It is indeed an A15, it comes with a
Realtek GigE part though, we've ordered an Intel PCIe card to plug into it
instead, there might be some performance benefits. Using jumbo frames has
been suggested as an option by one of our guys and we'll probably crack on
with that one next week.

We've had much better results from a board based on an Intel Celeron J1900,
we can pull samples at 12.5 MS/s with CPU load at approximately 70% on one
thread (it's a quad-core device with a pretty low single threaded
performance). Those results were observed on a Supermicro X10SBA if
anyone's interested, with dual intel GigE chipset. We're going to give the
Gigabyte GA-J1900-D3V a go too, which is a similar board with a dual
Realtek GigE adaptor instead. All of that's great, but we really want to
shift the samples onto the Tegra and start cranking some serious
performance out of the insane number of cores on it!

Did you use the Overo on the E100 or similar? How far did you get into all
this transport layer stuff? I'll keep you updated on what we get working
etc.

Cheers,

John

On Thu, Aug 14, 2014 at 11:08 AM, Marcus Müller usrp-users@lists.ettus.com
wrote:

Hi John,

cool work!
The Jetson is based on a tegra k1, isn't it? That is a cortex A15, which
should be substantially more beefy than the gumstix Overo; that is what I
have had the pleasure of working with, and that also has a hard time to
sustain rates >4MS/s; I don't think it being multicore helps much, here,
because only one core can handle incoming data at a time.

So, in addition to what Moritz has suggested, you might want to look into
using Jumbo packets on your NIC[1]; that would increase latency a bit of
course, but assuming that one of your CPU drains might be NIC-related
interrupts, that might be worthwhile. Also, make sure interrupt coalescing
is enabled (if that is supported by the NIC/its driver), and that, if
possible, network packet checksums are checked on the NIC rather than by
the CPU.

I don't think sniffing packets using pcap will do you much good
performance wise. The way zero copying with UDP works is that you give the
recv call a buffer, and the network stack directly mmaps the received
packet into that buffer[2].

Greetings,
Marcus

[1]
http://code.ettus.com/redmine/ettus/projects/uhd/wiki/Latency#Ethernet-N2xx
[2]
http://yusufonlinux.blogspot.de/2010/11/data-link-access-and-zero-copy.html

On 13.08.2014 21:19, John Wilson via USRP-users wrote:

Hi all,

We've recently been playing around with getting GNURadio working on Jetson
and low-power x86 platforms and are running into some issues pulling
sustained samples at rates > 4MS/s (using, e.g. rx_samples_to_file). This
is the case even when writing to /dev/null or removing the stream I/O
operation altogether from the source.

Has anyone got any idea what might be causing the bottleneck? All of the
boards we're using have GigE connections on them, they benchmark/iperf okay
so it's a bit baffling. One thought that we've had is that because the UDP
interface isn't true zero-copy, we might be killing the CPU or memory bus
with memory-memory copies.

We're investigating using an accelerated layer to capture packets from the
USRP2 (using e.g. libpcap or PF_RING) - we would then have the USRP2
conneced on a dedicated PHY. Has anyone tried anything like this yet or can
anyone in the know advise on whether we're maybe barking up the wrong tree?

Cheers,

John


USRP-users mailing listUSRP-users@lists.ettus.comhttp://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com


USRP-users mailing list
USRP-users@lists.ettus.com
http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com

Hi John, the celeron outperforming the ARM doesn't really surprise me -- x86s generally tend to be a little more computational power per clock cycle -- but the same goes for the electrical power going in. Now, as a wild shot in the dark, the first time UHD touches the individual sample is when it's converted from wire format (usually complex short) to whatever the user demands; now, on the celeron, UHD would probably make use of a lot of SSE2 instructions. Now, for the ARM, I remember the source code saying something like "if you can, use ORC, because NEON is not yet highly optimized". If you don't have the liborc development files installed, UHD would then use a NEON implementation; maybe comparing both would be worth a shot. Greetings, Marcus On 15.08.2014 22:36, John Wilson via USRP-users wrote: > Hi Marcus, > > Thanks for your really detailed reply! It is indeed an A15, it comes with a > Realtek GigE part though, we've ordered an Intel PCIe card to plug into it > instead, there might be some performance benefits. Using jumbo frames has > been suggested as an option by one of our guys and we'll probably crack on > with that one next week. > > We've had much better results from a board based on an Intel Celeron J1900, > we can pull samples at 12.5 MS/s with CPU load at approximately 70% on one > thread (it's a quad-core device with a pretty low single threaded > performance). Those results were observed on a Supermicro X10SBA if > anyone's interested, with dual intel GigE chipset. We're going to give the > Gigabyte GA-J1900-D3V a go too, which is a similar board with a dual > Realtek GigE adaptor instead. All of that's great, but we really want to > shift the samples onto the Tegra and start cranking some serious > performance out of the insane number of cores on it! > > Did you use the Overo on the E100 or similar? How far did you get into all > this transport layer stuff? I'll keep you updated on what we get working > etc. > > Cheers, > > John > > > On Thu, Aug 14, 2014 at 11:08 AM, Marcus Müller <usrp-users@lists.ettus.com> > wrote: > >> Hi John, >> >> cool work! >> The Jetson is based on a tegra k1, isn't it? That is a cortex A15, which >> should be substantially more beefy than the gumstix Overo; that is what I >> have had the pleasure of working with, and that also has a hard time to >> sustain rates >4MS/s; I don't think it being multicore helps much, here, >> because only one core can handle incoming data at a time. >> >> So, in addition to what Moritz has suggested, you might want to look into >> using Jumbo packets on your NIC[1]; that would increase latency a bit of >> course, but assuming that one of your CPU drains might be NIC-related >> interrupts, that might be worthwhile. Also, make sure interrupt coalescing >> is enabled (if that is supported by the NIC/its driver), and that, if >> possible, network packet checksums are checked on the NIC rather than by >> the CPU. >> >> I don't think sniffing packets using pcap will do you much good >> performance wise. The way zero copying with UDP works is that you give the >> recv call a buffer, and the network stack directly mmaps the received >> packet into that buffer[2]. >> >> Greetings, >> Marcus >> >> [1] >> http://code.ettus.com/redmine/ettus/projects/uhd/wiki/Latency#Ethernet-N2xx >> [2] >> http://yusufonlinux.blogspot.de/2010/11/data-link-access-and-zero-copy.html >> >> On 13.08.2014 21:19, John Wilson via USRP-users wrote: >> >> Hi all, >> >> We've recently been playing around with getting GNURadio working on Jetson >> and low-power x86 platforms and are running into some issues pulling >> sustained samples at rates > 4MS/s (using, e.g. rx_samples_to_file). This >> is the case even when writing to /dev/null or removing the stream I/O >> operation altogether from the source. >> >> Has anyone got any idea what might be causing the bottleneck? All of the >> boards we're using have GigE connections on them, they benchmark/iperf okay >> so it's a bit baffling. One thought that we've had is that because the UDP >> interface isn't true zero-copy, we might be killing the CPU or memory bus >> with memory-memory copies. >> >> We're investigating using an accelerated layer to capture packets from the >> USRP2 (using e.g. libpcap or PF_RING) - we would then have the USRP2 >> conneced on a dedicated PHY. Has anyone tried anything like this yet or can >> anyone in the know advise on whether we're maybe barking up the wrong tree? >> >> Cheers, >> >> John >> >> >> >> >> _______________________________________________ >> USRP-users mailing listUSRP-users@lists.ettus.comhttp://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com >> >> >> >> _______________________________________________ >> USRP-users mailing list >> USRP-users@lists.ettus.com >> http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com >> >> > > > _______________________________________________ > USRP-users mailing list > USRP-users@lists.ettus.com > http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com
PB
Philip Balister
Fri, Aug 15, 2014 10:53 PM

On 08/15/2014 04:15 PM, John Wilson via USRP-users wrote:

Hi Moritz,

Thanks for your reply, we have tried a couple of profilers, in particular
the g++ (-pg) one and the Google gperf one, not getting enitrely convincing
results just yet though. Do you have any tips on setting a good profiler up
(i.e. compiler options, packages)?

https://perf.wiki.kernel.org/index.php/Main_Page

No code recompilation needed.

Philip

Cheers,

John

On Thu, Aug 14, 2014 at 9:05 AM, Moritz Fischer moritz.fischer@ettus.com
wrote:

Hi John,

On Wed, Aug 13, 2014 at 9:19 PM, John Wilson via USRP-users
usrp-users@lists.ettus.com wrote:

Has anyone got any idea what might be causing the bottleneck? All of the
boards we're using have GigE connections on them, they benchmark/iperf

okay

so it's a bit baffling. One thought that we've had is that because the

UDP

interface isn't true zero-copy, we might be killing the CPU or memory bus
with memory-memory copies.

Did you try to profile your setup to see where exactly you spend your
cycles?
Perf gives you usually a good idea where most time is spent.

Cheers,

Moritz

On 08/15/2014 04:15 PM, John Wilson via USRP-users wrote: > Hi Moritz, > > Thanks for your reply, we have tried a couple of profilers, in particular > the g++ (-pg) one and the Google gperf one, not getting enitrely convincing > results just yet though. Do you have any tips on setting a good profiler up > (i.e. compiler options, packages)? https://perf.wiki.kernel.org/index.php/Main_Page No code recompilation needed. Philip > > Cheers, > > John > > > On Thu, Aug 14, 2014 at 9:05 AM, Moritz Fischer <moritz.fischer@ettus.com> > wrote: > >> Hi John, >> >> On Wed, Aug 13, 2014 at 9:19 PM, John Wilson via USRP-users >> <usrp-users@lists.ettus.com> wrote: >> >>> Has anyone got any idea what might be causing the bottleneck? All of the >>> boards we're using have GigE connections on them, they benchmark/iperf >> okay >>> so it's a bit baffling. One thought that we've had is that because the >> UDP >>> interface isn't true zero-copy, we might be killing the CPU or memory bus >>> with memory-memory copies. >> >> Did you try to profile your setup to see where exactly you spend your >> cycles? >> Perf gives you usually a good idea where most time is spent. >> >> Cheers, >> >> Moritz >> > > > > _______________________________________________ > USRP-users mailing list > USRP-users@lists.ettus.com > http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com >