Kai Wolf - SW Consulting

Optimizing my home and work network setup

Kai Wolf Dec 1, 2024 Updated Dec 1, 2024

After analyzing my network setup in a previous post, I decided it was time to dig deeper and optimize my network setup, both at home and at my office. With data-intensive workflows becoming more demanding - especially in areas like deep learning and computer vision - every millisecond and megabit counts.

Show full content

After analyzing my network setup in a previous post, I decided it was time to dig deeper and optimize my network setup, both at home and at my office. With data-intensive workflows becoming more demanding - especially in areas like deep learning and computer vision - every millisecond and megabit counts.

Following is a detailed breakdown of the changes I made, the tools I used and the performance gains I achieved.

Baseline Measurements

At home I am using a less powerful network setup for obvious reasons:

Home Network

Network-Setup-Home

ISP: DSL 250
Router: FritzBox 7590 AX
Ethernet: Directly connected to the router or
Wi-Fi: FritzBox mesh network via a Fritz!Repeater 3000
Speed test:
- Ethernet: 183 Mbps download, 39 Mbps upload, 6.5 ms latency
- Wi-Fi: 181 Mbps download, 37 Mbps upload, 9.8 ms latency

Work Network

At work I am running a bit more sophisticated setup:

Network-Setup-Work

ISP: Vodafone Cable 1000
Setup: 2.5 GBit Ethernet via CalDigit TS4 hub connected to my MacBook
Server Rack: Includes a 1 GBit switch and a Proxmox application server
Baseline Results (NetworkQuality):
- Uplink: 889 Mbps
- Downlink: 746 Mbps
- Responsiveness (RPM): Medium, 487 RPM
- Idle Latency: ~6 ms

Improvements and Their Impact

I did upgrade my server rack with a second-hand used 10G Ethernet switch from Netgear that I’ve bought for cheap online but first I did some experiments using the old (managed) switch first.

Leveraging Link Aggregation (LAG) at Work

I configured two Ethernet ports to share the network load (LAG) with the following results:

Uplink: 1.152 Gbps (+29.6%)
Downlink: 752 Mbps (+0.8%)
Responsiveness: Medium, 387 RPM (slight reduction)
Idle Latency: ~6 ms (unchanged)

While the uplink showed significant improvement, the downlink remained largely unchanged. Responsiveness slightly dipped, likely due to increased protocol overhead.

Upgrading to 10G Ethernet for Synology and Proxmox

At work, I added:

A 10G module (E10G22-T1-Mini) network expansion module to my Synology RS422+
A cheap 10G PCIe card for the Proxmox hypervisor

With these changes I got the following results:

Results (iperf3, Proxmox)

Transfer: 2.74 GBytes
Bitrate: 2.35 Gbits/sec

Reults (NetworkQuality, Proxmox)

Uplink: 2 Gbps (+125%)
Downlink:1 Gbps (+33.9%)
Responsiveness: Medium, 942 RPM (+93.4%)
Idle Latency: 6 ms (unchanged)

Optimizing MTU for Jumbo Frames

Adjusting the MTU from 1500 to 9000 allowed jumbo frames, increasing efficiency.

Impact on Responsiveness

Before: 942 RPM
After: 1778 RPM (+88.8%)

While throughput stayed constant, responsiveness doubled, indicating a significant reduction in network packet overhead.

Introducing the OWC 10G Thunderbilt 3 Ethernet Adapter

Upgrading my connection to the Proxmox server and Synology rack with the OWC adapter yielded dramatic results.

Results (iperf3)

Proxmox:
- Transfer: 9.64 GBytes
- Bitrate: 8.288 Gbits/sec
Synology:
- Transfer: 9.54 GBytes
- Bitrate: 8.19 Gbits/sec

Results (NetworkQuality)

Proxmox:
- Uplink: 1.3 Gbps
- Downlink: 4.6 Gbps (+360%)
- Responsiveness: Medium, 421 RPM
- Idle Latency: 5.2 ms
Synology:
- Uplink: 4.2 Gbps (+373%)
- Downlink: 1.7 Gbps (+70%)
- Responsiveness: Medium, 360 RPM
- Idle Latency: 4.8 ms

Home Network Upgrades with 2.5G Dongles

For my home setup, I added:

A 2.5G USB Ethernet dongle for the Synology DS923+ and the CalDigit TS3 hub
Adjusted internal PCI sharing for stable performance

Results (iperf3)

Transfer: 2.28 GBytes
Bitrate: 1.96 Gbits/sec

Results (NetworkQuality)

Uplink: 641 Mbps
Downlink: 1.433 Gbps
Responsiveness: High, 3905 RPM (+>7x)
Idle Latency: 4.9 ms

The dongle upgrade transformed my basement rack’s performance, especially for responsiveness and latency, despite hitting the throughout ceiling of the 2.5G connection.

Overall Performance Gains Metric Baseline (Work) Final (Work) Improvement Uplink (Gbps) 0.889 4.2 +372.5% Downlink (Gbps) 0.746 4.6 +516.2% Responsiveness (RPM) 487 1778 +265% Metric Baseline (Home) Final (Home) Improvement Uplink (Gbps) 183 641 +250% Downlink (Gbps) 39 1443 +3581% Responsiveness (RPM) 454 3905 +760% Key Learnings and Next Steps

Jumbo Frames: Enabling an MTU of 9000 dramatically improves responsiveness without impacting throughput.
10G Ethernet: Investments in 10G hardware pays of, especially for workloads like backups and Proxmox hypervisor.
PCI Optimization: Understanding internal bus sharing is crucial for stable performance.

Next, I plan to investiage further fine-tuning options, including QoS settings for prioritizing critical traffic and potential upgrades to fiber internal at home.

https://www.kai-wolf.me/blog/2024/12/01/network-upgrade

Extensions

Analyzing network performance

Kai Wolf Nov 3, 2024 Updated Nov 3, 2024

Since I am doing more data intensive work lately (Deep Learning, Computer Vision models with hundreds of terabytes of data), I needed to analyze my current network setup both at home and at work and identify the bottlenecks.

Show full content

Since I am doing more data intensive work lately (Deep Learning, Computer Vision models with hundreds of terabytes of data), I needed to analyze my current network setup both at home and at work and identify the bottlenecks.

Ideally, I wanted to use iperf3 to measure throughput and a more recent tool developed by Apple called networkQuality that uses a more actionable metric about the current state of the network called RPM (Round-trips Per Minute) as this should better measure user experience specifically when the network is under working conditions.

I also wanted to have these running as background services and since I am using Synology for backup both at home and at work and the setup of these won’t change all too often due to being critical infrastructure, they were perfect candidates.

Installing iperf3 on Synology NAS (DSM 7.2+)

In order to install iperf3 on the Synology NAS we have to add another package repository called SynoCommunity. For this we have to log into the DSM, Package Center -> Add the repository there and then install the package SynoCli Monitor Tools from the newly added repository. This also installs the iperf3 commandline tool. We want this to start and run automatically even after rebooting the DSM. For this we can add another systemd service containing the following

$ sudo vi /etc/systemd/system/iperf3.service
[Unit]
Description=Run iperf3 at startup
[Service]
Type=simple
ExecStart=/usr/local/bin/iperf3 -s

[Install]
WantedBy=multi-user.target

$ sudo systemctl start iperf3.service
$ sudo systemctl status iperf3.service
$ sudo systemctl enable iperf3.service

The -s is for running in server mode, listening to new connections. We can now test against this backend from another machine by specifying the endpoint:

$ iperf3 -c wolf-synology.fritz.box
Connecting to host wolf-synology.fritz.box, port 5201
[  7] local 192.168.178.53 port 55961 connected to 192.168.178.23 port 5201
[ ID] Interval           Transfer     Bitrate
[  7]   0.00-1.00   sec  18.0 MBytes   151 Mbits/sec
[  7]   1.00-2.00   sec  15.3 MBytes   128 Mbits/sec
[  7]   2.00-3.00   sec  20.6 MBytes   172 Mbits/sec
[  7]   3.00-4.00   sec  15.0 MBytes   126 Mbits/sec
[  7]   4.00-5.00   sec  18.6 MBytes   156 Mbits/sec

Installing networkQuality server

NetworkQuality is a tool that comes preinstalled with recent macOS versions and uses a configuration provided by Apple everytime it is called without any additional options

$ networkQuality
==== SUMMARY ====
Uplink capacity: 33.018 Mbps
Downlink capacity: 167.554 Mbps
Responsiveness: Low (454.545 milliseconds | 132 RPM)
Idle Latency: 17.625 milliseconds | 3529 RPM

This is fine for measuring the up and downlinks from the localhost to an endpoint in the world wide web, but for my purposes I needed to measure the performance between to endpoints in the same network (and also respecting parallel connections which most other tools do not bother with).

There exist at least two reference implementations (in Swift and Go). I used the latter which can be found here. Since I wanted this to be running in the background on my NAS, I needed to cross compile the networkqualityd binary using a GCC cross compile toolchain:

$ brew install FiloSottile/musl-cross/musl-cross
$ CC=x86_64-linux-musl-gcc \\
  CXX=x86_64-linux-musl-g++ \\
  GOARCH=amd64 \\
  GOOS=linux \\
  CGO_ENABLED=1 \\
  go build -ldflags "-linkmode external -extldflags -static"

Afterwards, I did copy the binary over to my Synology and created another systemd service as before:

$ sudo vi /etc/systemd/system/networkqualityd.service

[Unit]
Description=Run networkqualityd at startup
[Service]
Type=simple
ExecStart=/usr/local/bin/networkqualityd -create-cert -config-name "wolf-synology.fritz.box" -enable-http2 -listen-addr "0.0.0.0"

[Install]
WantedBy=multi-user.target

With this I now could test against a local endpoint and do some further experiments to optimize the result:

$ networkQuality -C https://wolf-synology.fritz.box:4043/.well-known/nq -k
==== SUMMARY ====
Uplink capacity: 68.958 Mbps
Downlink capacity: 108.231 Mbps
Responsiveness: Low (487.805 milliseconds | 123 RPM)
Idle Latency: 8.167 milliseconds | 7500 RPM

Clearly, these measurements have some room for potential which I will discuss in another upcoming post.

I’m available for software consultancy, training and mentoring. Please contact me, if you are interested in my services.

https://www.kai-wolf.me/blog/2024/11/03/network-performance

Extensions

Adventures in Liquid

Kai Wolf Nov 13, 2022 Updated Nov 13, 2022

Liquid is an open-source template language created by the company Shopify, written in Ruby and used by this websites static site generator Jekyll. This language is mainly used to create small logical units that spit out markup. For instance, imagine that we want to display a collection of products on a website which we have saved in a yaml file, we could express this as follows:

Show full content

Liquid is an open-source template language created by the company Shopify, written in Ruby and used by this websites static site generator Jekyll. This language is mainly used to create small logical units that spit out markup. For instance, imagine that we want to display a collection of products on a website which we have saved in a yaml file, we could express this as follows:

<ul>
  {% for product in collection.products %}
    <li>{{ product.title }}</li>
  {% endfor %}
</ul>

Besides loops, it also supports a minimally required set of operations such as if/else statements and variables. Just as in CMake there are no proper types. Essentially everything is of type string. Arrays are comma separated strings and strings may have a truthy or falsy meaning based on the content.

Liquid also ships with some convenience API functions to alter text, which does come in handy for the case at hand. The following example converts a given string to lowercase letters:

{{ "FooBar" | downcase }}

This will output ‘foobar’ as a result. Different operations can also be stacked together via so called pipes that should be very familiar when working in a UNIX environment:

{{ "Just the first word will be printed" | split: " " | first }}

Again, this will output ‘Just’ only.

Advanced example

When I was in the midst of redesigning my website, I came across a slightly more interesting use case for Liquid: As part of my portfolio I do maintain a yaml file that contains all of my previous projects for the last decade or so.

Every entry in this file consists of the project name, an image, a short description and one or more categories that this project is classified into. As part of my consulting services that I provide, I wanted to create a list of relevant projects that I have been working on in the past. For this I had to select all projects that to fall into the same category.

Categories

In other words I wanted to create the intersection from two arrays: Namely, the array with the given project categories and the array with the current page category (which can be one or more).

As it turns out, Liquids limited support for types and arrays for that matter has no operational support to create an intersection of two arrays. So I had to come up with another approach. A brute-force method to achieve this would consist of two concatenated loops, iterating over both arrays and remembering which elements were already visited and removing them from the array.

However, this would have been way to many lines of code for my taste and didn’t seem appropriate. Hence, I came up with the following:

{%- for project in site.data.projects -%}
  {% assign all_cats = project.categories | concat: page.category %}
  {% assign filtered_cats = all_cats | uniq %}
  {%- if all_cats.size > filtered_cats.size -%}
   {{project.description}}
  {%- endif -%}
{%- endfor -%}

First, this creates another array all_cats that consists of both the project and page categories. Secondly, filtered_cats consists of all unique elements from the first array. All that is left to do now is to check, if the original array is larger than the filtered one. As an interesting side note, the actual category isn’t even necessary anymore, as we are only interested in uniqueness.

On the one hand this solution looks admittedly a bit of a stretch. On the other hand, given the constraints of the API this is actually a sufficient elegant solution. For instance, the negation for the case above (the disjoint set in other words) requires to only change the larger then operator to be less then.

I’m available for software consultancy, training and mentoring. Please contact me, if you are interested in my services.

https://www.kai-wolf.me/blog/2022/11/13/adventures-in-liquid

Extensions

Signal Processing and ML Inference on the Edge

Kai Wolf Jul 19, 2021 Updated Jul 19, 2021

I was giving an (online) talk last week at the C++ Meetup in Karlsruhe. The title of my talk was Signal Processing and ML Inference on the Edge where I talk a bit about the constraints and requirements for on-device online computing and some interesting approaches on how to implement a given algorithm using either the (already deprecated) RenderScript as well as Halide.

Show full content

I was giving an (online) talk last week at the C++ Meetup in Karlsruhe. The title of my talk was Signal Processing and ML Inference on the Edge where I talk a bit about the constraints and requirements for on-device online computing and some interesting approaches on how to implement a given algorithm using either the (already deprecated) RenderScript as well as Halide.

At the beginning of this talk I did recapitulate some essential audio processing basics and also introduced some typical ML related computations before discussing different approaches using Halide for my main example. There was even a live stream on YouTube available for this event (which is a first for me) that you can find here:

Following is the Agenda of this talk used for the announcement:

Agenda

The recent decade has been revolutionary regarding many technical aspects of our daily lives. We have experienced the transition from simplistic feature phones to fully featured little supercomputers carried along in our pockets, capable of performing uttermost computationally heavy tasks directly on the device. Moreover, we are currently already in the midst of what some may call the third AI renaissance, and are now able to solve pattern recognition problems with relatively little effort that in the past could only be solved by humans. The latter development has been a game changer for a variety of applications including anomaly detection, classification or speech recognition.

Despite the advanced computational resources of modern mobile devices, it is still challenging to obtain optimal throughput and minimal latency with signal processing implementations, which typically involves exploiting device specific acceleration techniques, such as vector intrinsics. However, this is not always feasible, especially when targeting a large variety of different architectures and target devices.

We will look at heterogeneous computing frameworks to accelerate the processing of otherwise performance intensive tasks which will get optimized across multi-core CPUs, GPUs or DPSs on the target device. Since Androids RenderScript is deprecated, we will focus on alternatives namely Halide.

I’m also available for software consultancy, training and mentoring. Please contact me, if you are interested in my services.

https://www.kai-wolf.me/blog/2021/07/19/signal-processing-ml-inference-edge-devices

Extensions

Bazel performance comparison with CMake

Kai Wolf Apr 16, 2021 Updated Apr 16, 2021

Google released an open-source version of its internal build system named Bazel already six years ago. This system is advertised (and used effectively) to build billions of lines of code. My fundamental interest in this tool was fueled by two key promises according to their release notes for the 1.0 version: Hermetic and reproducible builds. Hermetic build in this context means that everything depends on a known set of inputs, which is essential for ensuring that builds are reproducible. This is achieved through several techniques such as sandboxing. For instance, Bazel tries hard to not be dependent on anything on the host system.

Show full content

Google released an open-source version of its internal build system named Bazel already six years ago. This system is advertised (and used effectively) to build billions of lines of code. My fundamental interest in this tool was fueled by two key promises according to their release notes for the 1.0 version: Hermetic and reproducible builds. Hermetic build in this context means that everything depends on a known set of inputs, which is essential for ensuring that builds are reproducible. This is achieved through several techniques such as sandboxing. For instance, Bazel tries hard to not be dependent on anything on the host system.

Reproducibility for builds on the other hand is also a must for any serious software system used by real customers. Without it, there is no way to go back in time and reproduce the state of the software to any previous version in time. However, there are actually few shops out there I’ve come to known personally where this is really achieved. Frankly, this is not too easy to setup in the first place.

I did port some internal projects as well as a customer project from CMake to Bazel in the last year. All those projects were of reasonable size (meaning lines of code) and had several external dependencies. I think this is mandatory to really get to know a new tool and push its boundaries quite a bit.

Build time benchmarks

One of the things I was most interested in are the build times. Suffice to say that large C++ projects have a strong tendency to build very slowly, due to tooling and certain language features (templates) that have never really got up to speed since the 80s. For a build time benchmark there are three different use-cases I’ve looked into: Fresh builds from scratch, rebuilds without any changes and finally rebuilds with one file touched:

Build time benchmark

For testing I’ve used the most recent versions of both Bazel (3.7.2) and CMake (3.19). This code base in particular made heavy use of Eigen included as an external dependency and CMake spent some time configuring this header-only library. Hence, this benchmark is arguably a bit skewed. However, this is a software project I’ve had at hand and at the end of the day these numbers count. Even if a fresh build is not considered here, Bazel still wins in terms of performance by being twice as fast for a rebuild and as fast as factor 17 for detecting a change (touch) and rebuilding again. This looks promising.

I’m also available for software consultancy, training and mentoring. Please contact me, if you are interested in my services.

https://www.kai-wolf.me/blog/2021/04/16/bazel-cmake-performance-comparisons

Extensions

(Advanced) C++ Design Patterns

Kai Wolf Jan 11, 2021 Updated Jan 11, 2021

Last week I gave a talk for the C++ User group Frankfurt regarding (Advanced) C++ Design Patterns. Due to the pandemic this talk was given remotely. While I still have to get used to talking to my monitor, I think overall it went pretty smooth. There were a lot of participants which I didn’t expect (and didn’t see for the most part). However, the Q&A session after the talk was pretty decent as well. I do hope though that I will be able to give regular talks again once this Corona thing will eventually be over.

Show full content

Last week I gave a talk for the C++ User group Frankfurt regarding (Advanced) C++ Design Patterns. Due to the pandemic this talk was given remotely. While I still have to get used to talking to my monitor, I think overall it went pretty smooth. There were a lot of participants which I didn’t expect (and didn’t see for the most part). However, the Q&A session after the talk was pretty decent as well. I do hope though that I will be able to give regular talks again once this Corona thing will eventually be over.

Concerning the contents of the talk, I tried to motivate using specific C++ related techniques over traditional (GoF) Design Patterns due to the typical system memory and runtime constraints we’re facing when using C++ for the job anyways. More specifically I was discussing how to use static polymorphism to model the software system at hand to achieve basically the same with dynamic polymorphism but without paying for the overhead in memory/runtime. You can find the link to the slides here.

I’m also available for software consultancy, training and mentoring. Please contact me, if you are interested in my services.

https://www.kai-wolf.me/blog/2021/01/11/advanced-cpp-design-patterns

Extensions

Linux perf tools on ARM64

Kai Wolf May 26, 2020 Updated May 26, 2020

For a current client project I need to get a thorough understanding of the performance bottlenecks for several target architectures, including ARM64. For this, I’ve already ordered an ODROID-C2 development board. This is currently the only ARM64 development board available that also supports Android as far as I am aware.

Show full content

For a current client project I need to get a thorough understanding of the performance bottlenecks for several target architectures, including ARM64. For this, I’ve already ordered an ODROID-C2 development board. This is currently the only ARM64 development board available that also supports Android as far as I am aware.

When doing a performance analysis (especially under Linux) I really like using perf as it supports a wide range of options such as hw/sw performance counters and has several useful subcommands. However, getting perf to run on a host system also requires the kernel sources as perf will emit a warning in any other case. Usually this can be achieved by simply fetching the sources for the current kernel by using a package manager such as apt:

$ apt-get install linux-tools-common linux-tools-generic linux-tools-`uname -r`

In the case of ODROID-C2 things are a bit more complicated, which is the main motivation for this blog post. First of all, the kernel for this board is quite old (August 2014). Hence, we probably won’t find the kernel sources from within the package sources of our distribution, which turned out to be the case. In this case, we need to fetch the original sources provided by ODROID and also checkout the correct branch for the current version of this dev board:

git clone --depth 1 https://github.com/hardkernel/linux.git -b odroidc2-v3.16.y

If we change into the subdirectory of perf from within the kernel sources and try to compile the correct version ourselves, we receive another error though:

In file included from util/event.c:3:0:
util/event.h:95:17: error: ‘PERF_REGS_MAX’ undeclared here (not in a function); \
did you mean ‘PERF_REGS_MASK’?
  u64 cache_regs[PERF_REGS_MAX];
                 ^~~~~~~~~~~~~
                 PERF_REGS_MASK
  CC       util/evsel.o

As already mentioned, this board is using a rather old kernel version. Thus, we need to fix this error ourselves by adding another define

// tools/perf/arch/arm64/include/perf_regs.h
#define PERF_REGS_MAX	PERF_REG_ARM64_MAX

Afterwards perf can be compiled with a simple make && make install and we should be good to go.

$ perf --version
perf version 3.16.82.g2ddf

I’m also available for software consultancy, training and mentoring. Please contact me, if you are interested in my services.

https://www.kai-wolf.me/blog/2020/05/26/linux-perf-tools

Extensions

My talk at ConanDays 2020 in Madrid

Kai Wolf Feb 12, 2020 Updated Feb 12, 2020

I’m happy to announce that I’ll be giving a talk in March at the ConanDays conference in Madrid, Spain. I think this is also the first time that JFrog is holding this conference. Thus, I’m curious and excited that I get a change being there. The title of my talk will be Dependency Management with CMake and Conan. Following is my submitted proposal for this talk:

Show full content

I’m happy to announce that I’ll be giving a talk in March at the ConanDays conference in Madrid, Spain. I think this is also the first time that JFrog is holding this conference. Thus, I’m curious and excited that I get a change being there. The title of my talk will be Dependency Management with CMake and Conan. Following is my submitted proposal for this talk:

The story of (thirdparty) package dependency management for C and C++ based projects is a story full of misunderstandings, shortcomings and dead-ends. This becomes most noticeable in the fact that many software projects explicitly advertise that they do not depend on any other thirdparty software packages. This is particularly interesting as this doesn’t seem to be an issue in many other programming languages.

On the other hand, maintaining a large chain of dependencies in a C or C++ based project is typically quite cumbersome and has always been this way due to ABI compatibility reasons and the large number of different factors that influence the binary outcome of a compilation process.

Historically, CMake’s answer to this is to either use ExternalProject (or more recently FetchContent) and gather all dependencies in what is called a superbuild strategy. However, ultimatively this approach is brittle and hard to maintain for unexperienced CMake users.

Conan has made a tremendous progress in the last months and years and is finally usable in CMake based projects in what is typically called a modern CMake-y way as packages distributed with Conan can now be consumed cleanly without the need to adjust the build configuration of a project.

This talk will give a short historic overview of former dependency management strategies, explain best practices for integrating Conan in CMake-based projects and outline what is technically now feasible once a project has been adapted to this new standard.

I’m also available for software consultancy, training and mentoring. Please contact me, if you are interested in my services.

https://www.kai-wolf.me/blog/2020/02/12/my-talk-at-conandays-2020

Extensions

Managing transitive dependencies with Conan and CMake

Kai Wolf Jan 6, 2020 Updated Jan 6, 2020

Back when I looked into Conan the first time in 2017 the state-of-the-art for dependency management in the C and C++ software development world consisted of building everything from scratch and checking all binary artifacts into version control or something similar along the line. Fortunately, this seems to be changing now with the raise of a proper package dependency management solution for C and C++.

Show full content

Back when I looked into Conan the first time in 2017 the state-of-the-art for dependency management in the C and C++ software development world consisted of building everything from scratch and checking all binary artifacts into version control or something similar along the line. Fortunately, this seems to be changing now with the raise of a proper package dependency management solution for C and C++.

However, no software is free of bugs and the same holds true for Conan as well. Back then, when I’ve tried to model a simple, single dependency for a project, this tool has failed me as it wasn’t able to resolve transitive dependencies. This issue in particular has since been resolved, but managing several, inter-depending software packages in a given project using Conan and CMake still challenging.

I’ve recently produced and published a specially configured VTK package for a medical project I’ve worked on for a client. For this I’ve setup VTK that it uses another (Conan) package, published by me. Thus, both dependencies (VTK and TBB in this case) were both distributed using Conan and the first one was depending on the second one. Both dependencies are using CMake for managing their build system. However, in VTK’s CMake configuration there is a slight imperfection as it doesn’t simply link against TBB as yet another logical build target, but it tries to resolve this dependency in a dedicated FindTBB.cmake script file. This eventually leads to hard coded paths in the VTK package as-is when it gets build on a dedicated server. Obviously, this will not work when this package gets consumed by a client due to the hard coded path references that will most certainly be different on any other machine.

The fix for this is either trying to patch VTK’s CMake configuration file, which may takes a while or won’t even be accepted upstream, or swapping the hard coded paths to TBB with a reference to where Conan put the TBB dependency. This can be done as follows:

def cmake_fix_tbb_dependency_path(self, file_path):
    with open(file_path, 'r') as file:
        file_data = file.read()

    if file_data:
        # Replace the target string
        tbb_root = self.deps_cpp_info["tbb"].rootpath.replace('\\', '/')
        file_data = re.sub(tbb_root, r"${CONAN_TBB_ROOT}", file_data, re.M)

        with open(file_path, 'w') as file:
            file.write(file_data)

This fix will be sufficient for my use case, as I can ensure that both packages will be consumed from Conan. That being said, this is far from a common solution for these types of problems. Furthermore, In my experience these types of issues will come up from time to time and fortunately in this case it happened for a rather simple transitive dependency chain that wasn’t that hard to debug and fix. I can imagine though, that things do look a bit different as soon as the number of (inter connected) dependencies do increase.

There will be a Conan conference in March this year that I’m planning to attend. I really appreciate the patience and effort of the people that are actively working on Conan and hopefully and I’ll be able to address a few open issues I have with this tool as well.

I’m available for software consultancy, training and mentoring. Please contact me, if you are interested in my services.

https://www.kai-wolf.me/blog/2020/01/06/managing-transitive-deps-with-conan-and-cmake

Extensions

Heterogeneous computing on Android devices

Kai Wolf Aug 19, 2019 Updated Aug 19, 2019

In today’s technical world with all its different devices, ranging from small, intelligent wristwatches, smartphones, tablets and laptops up to large, powerful servers, there exists a wide variety of CPUs, GPUs and DPSs. All those computing elements obviously differ in their hardware capabilities. For instance, a CPU typically has much less cores than a GPU. There are different level of caches, cache sizes, instruction sets or in the case of any modern GPU, there exist specific computing units for special tasks such as shaders or TMUs.

Show full content

In today’s technical world with all its different devices, ranging from small, intelligent wristwatches, smartphones, tablets and laptops up to large, powerful servers, there exists a wide variety of CPUs, GPUs and DPSs. All those computing elements obviously differ in their hardware capabilities. For instance, a CPU typically has much less cores than a GPU. There are different level of caches, cache sizes, instruction sets or in the case of any modern GPU, there exist specific computing units for special tasks such as shaders or TMUs.

By now, it should have become apparent that writing highly efficient code that makes use of all those various computing capabilities quickly becomes cumbersome, if at all possible. One common solution to this problem is to make use of so called heterogeneous computing frameworks such as OpenCL, CUDA, OpenMP, MPI or in the case of Android: RenderScript.

These frameworks build an abstraction to the underlying hardware and are typically programmed in a dialect of the C programming language. Using one of these frameworks, code can be written in a somewhat device- and architecture independent manner and the framework takes care of executing the code on multiple cores using either shared or non-shared memory or in the case of MPI, it can even run on a distributed cluster of several servers. Hence, choosing a heterogeneous computing framework highly depends on the type of task as well as the target platform as there is no common solution that fits all.

When it comes to Android the variety of different hardware devices is equally complex and manifold. Google’s solution to address this problem is called RenderScript and got first introduced with version 3.0 (Honeycomb). Today, RenderScript consists of a so called compute API, written in a C99-derived language, targeting as well CPUs, GPUs and DSPs. RenderScript is designed to run on all Android devices, independent of the installed hardware. The actual portability depends upon specific device drivers as only the vendor of any device in particular knows exactly how to talk to a specific chip used in that same device. However, Android is distributed with a basic CPU-only driver for RenderScript to ensure a basic compatibility for all devices.

The build process for a given RenderScript kernels happens in two phases: There exists a LLVM-based compiler called slang (llvm-rs-cc) that comes pre-shipped with the Android NDK. This compiler consumes any RenderScript based script containing kernels and emits highly optimized and portable code in IR format (IR stands for intermediate representation) that gets pushed to the device.

RenderScript build process 1

This pipeline has the advantage of performing aggressive, machine-independent optimizations on the developer host machine before the portable bitcode gets pushed to device in order to save both battery and CPU power. Hence, the second phase of the build process can be somewhat more light-weight:

RenderScript build process 2

Here, either the online JIT compiler (bcc) takes care of performing target-specific optimizations and code generation that either gets vectorized on a multicore CPU or, if the vendor of that device provided its own RS specific compiler can be ported to an installed GPU or DSP here.

The reasoning behind this design was to provide a generic runtime library while allowing Android device manufactures to provide their specialized .bc library. The implementation makes heavy use of LLVM and provides a lightweight JIT that enables a fast launch time as well as on-device linking.

I’m available for software consultancy, training and mentoring. Please contact me, if you are interested in my services.

https://www.kai-wolf.me/blog/2019/08/19/heterogeneous-computing-on-android

Extensions