Maya's Programming & Electronics Blog

The tragicomedy of Linux, Raspberry Pi and hardware accelerated video on non-x86 platforms

Maya Posch Apr 8, 2024

Back in the good old days of 2019 I was still a naive youngster who looked at the specifications listed alongside the latest Raspberry Pi and similar ARM-based single-board computers (SBCs) and took the video en- and decoding specifications as gospel. Even a lowly single-core ARMv6 SoC like the BCM2835 [1] as featured on the […]

Show full content

Back in the good old days of 2019 I was still a naive youngster who looked at the specifications listed alongside the latest Raspberry Pi and similar ARM-based single-board computers (SBCs) and took the video en- and decoding specifications as gospel. Even a lowly single-core ARMv6 SoC like the BCM2835 [1] as featured on the original Raspberry Pi and Raspberry Pi Zero SBC should be able to decode 1080p h.264 video at 30 FPS, not to mention the much more powerful SoCs by Broadcom and others with support for 60 FPS, HEVC (h.265) and so on. This assumption led me to focus on such SBCs when I embarked on my NymphCast audio-visual streaming project [2] as they seemed like a cheap, powerful platform which most people had lying around.

Fast-forward to today and I have shifted away from ARM-based SBCs, back to the comforting embrace of x86-based platforms. This follows after countless attempts to make video hardware en- and decoding work on ARM-based platforms, both Raspberry Pi boards and those based on Allwinner (H- and V-series, including the S3 on the horrid PineCube [3]) and Amlogic S9-family (e.g. Odroid C2). Whether trying to hardware decode video files or encode camera feeds using the hardware encoding block in the SoC, the problems were always the same: missing or incomplete drivers, or poor API support.

The hilarious part is that these drivers are generally available from the manufacturer, but only in binary form, and restricted to an ancient Linux kernel or obsolete Android version. In the case of Raspberry Pi, the Broadcom drivers are also binary blobs, and come with a dizzying amount of poorly supported APIs and general lack of software support. Whether OpenMAX or MMAL, these APIs were left behind on 32-bit Linux (Raspberry Pi OS), and 64-bit Linux struggles to make Video4Linux work right now.

After giving it yet another shot on the new Raspberry Pi OS ‘Bookworm’ on a Raspberry Pi 3B system with the new KMS driver I had to sadly conclude that the result was basically the same once more. No hardware acceleration with standard ffmpeg or ffplay unless you manually tell it which specific decoder/encoder to use, and countless forum threads filled with people sharing hacks that may or may not work. The best approach for ffmpeg appears to be to compile a custom version which among other changes fixes the broken rendering when on the desktop as hardware-accelerated render surfaces are still a novelty in 2024, it seems.

What I may still try is running NymphCast with the desktop disabled to see whether that’s a viable way to get reasonable h.264 playback at 1080p working, but I remain skeptical.

One of the N100-based MiniPCs which I have purchased.

Meanwhile, what really caught my eye since a few months as an alternative is Intel’s Ander Lake-N series of SoCs. These include a wide range of (Skylake-based) CPU cores and internal GPU (iGPU) configurations, but what they have in common is a quite modern set of video en- and decoding options in the iGPU. This not only includes h.264 and h.265 (HEVC), but also VP9, AV1 and so on. In addition to compact mini PCs featuring most commonly the N95 and N100 SoCs, you also got a few N100-based mini-ITX and microATX mainboards, which often even allow you to put a better GPU into the PCIe slot. This is now my go-to solution.

What does such an x86-based low-power (6-15 Watt TDP, 6 Watt for the N100) system bring you compared to an HEVC-capable SBC like the Raspberry Pi 4 and 5? The ability to run Windows, Linux, FreeBSD with full support, full driver support for every single feature in the SoC and on the mainboard, a regular BIOS, and above all hardware acceleration for video decoding of any format you can throw at it. Even better is that it just works in ffmpeg, ffplay, OBS and whatever else you may want to use.

Although I have learned along the way that the developers behind Kodi and other projects manage to get hardware accelerated decoding working on Raspberry Pi SBCs, this requires a lot of custom code and testing, with still many issues for endusers. Since the goal of NymphCast is to be portable across any platform that supports regular FFmpeg without having to hack in platform-specific code (since I’m just the sole dev on the project), this obviously won’t fly for me. Theoretically the hardware acceleration drivers will some day end up in the mainline Linux kernel, but realistically this will happen shortly after said SoC is firmly obsolete.

Where I like those small ARM-based SoCs is for audio-related tasks, such as running NymphCast in audio-only mode, but as I recently have found, this works great on an ESP32 microcontroller as well [4], which can also boot up a lot faster. Since AV1 decoding is still a rarity on ARM-based SBCs, this makes Alder Lake-N-based systems moreso attractive, and is the reason why the media-focused HTPC I put next to the TV is based around the N100 SoC, running a recent Linux kernel.

In the tragedy that is ARM SoC driver support, this outcome makes me admittedly feel less salty about all the wasted hours, and it’s definitely been educational albeit painfully so. I hope that this post helps to disabuse at least some people of the notion that ARM SoCs and similar (hi RISC-V) with the same proprietary IP blocks and terrible driver support are at all an easy target for a casual video playback or encoding experience.

Long live x86, indeed.

Maya

[1] https://web.archive.org/web/20120513032855/https://www.broadcom.com/products/BCM2835
[2] https://github.com/mayaposch/nymphcast
[3] https://hackaday.com/2021/04/22/hands-on-with-pinecube-an-open-ip-camera-begging-for-better-kernel-support/
[4] https://mayaposch.wordpress.com/2024/02/07/porting-nymphcast-audio-and-ffmpeg-to-the-esp32-microcontroller/

http://mayaposch.wordpress.com/?p=263

Extensions

Porting NymphCast Audio and FFmpeg to the ESP32 microcontroller

Maya Posch Feb 7, 2024

Last year I found myself looking at Espressif’s ESP32 microcontroller (MCU) in the context of porting the NymphCast receiver (NCS, or NymphCast Server[1]) application to it. This network-based application is responsible for accepting playback requests from clients along with media data to decode using FFmpeg and output to displays and speakers via libSDL. Although the […]

Show full content

Last year I found myself looking at Espressif’s ESP32 microcontroller (MCU) in the context of porting the NymphCast receiver (NCS, or NymphCast Server[1]) application to it. This network-based application is responsible for accepting playback requests from clients along with media data to decode using FFmpeg and output to displays and speakers via libSDL. Although the prospect of decoding video and displaying it in real time on even a connected VGA monitor seemed like a stretch, I figured that decoding and outputting audio should be achievable.

The ESP32 MCU is a dual-core system, with each core capable of running at up to 240 MHz. It is listed as having 320 kB of SRAM, but technically only about 160 kB of this is usable for the heap due to a design limitation. Since it uses an external Flash ROM (via the QSPI bus), a large enough ROM must be available to contain the entire NCS firmware.

Similarly, since 160 kB of SRAM isn’t quite enough working memory to handle the WiFi stack, FFmpeg, NCS and related dependencies – not to mention the in-memory data buffer – I had to use an ESP32 module with enough external RAM available. For the ESP32 there is no SDRAM controller, which just leaves QSPI-based pseudo-static RAM (PSRAM). Of Espressif’s available modules this left the ESP32-WROVER-(I)B and ESP32-WROVER-(I)E. Both have 4 MB of Flash ROM and 8 MB of PSRAM (4 MB mappable), which would be what I’d have to try and fit the whole firmware into.

What expedited the porting was that the dependencies for NCS (NymphRPC [2], NyanSD [3] and libnymphcast [4]) all rely ultimately on the Poco platform abstraction library. Because this third-party dependency did not have FreeRTOS support, I had to add this myself. Along the way I also had to remove C++ exceptions in the framework, which resulted in me forking the code into the NPoco project [5]. NPoco retains the APIs of Poco, but works much better on resource-constrained systems like the ESP32 that do not have C++ exceptions enabled by default for performance reasons. This way I was able to use NymphRPC, NyanSD and libnymphcast without source changes beyond adding the NPoco headers instead of the Poco ones.

Example of a UDA1334A I2S DAC board, commonly sold as the CJMCU-1334.

At this point NCS could be mostly ported over already, which involved the creation of an ESP-IDF (Espressif IoT Development Framework) project, in which both the core NCS code as well as modules containing the dependencies could be placed. The main obstacles to porting remained FFmpeg and replacing libSDL with an ESP-IDF equivalent. For the latter, SDL was effectively replaced with a stub implementation that also replaced the audio routines with an I2S-based interface. This allows for decoded audio data to be sent via the ESP32’s I2S interface to a connected codec, like the UDA1334A or similar I2S DAC.

Getting FFmpeg to compile for the ESP32 was somewhat more involved. The first step was to determine an audio-only configuration and translate this into ESP-IDF components, followed by identifying large blocks of static data in the code and marking these for the compiler to be located in the ROM instead of in RAM to free up a lot of space in the former.

With all components now in place, I was able to build the project for the first time, which resulted in a sad ‘out of space’ error for the linker as the default partition for data in the Flash ROM is rather small, on the order of 1 MB. This was corrected with a custom partitions file with the following parameters:

nvs, data, nvs, , 0x6000,
phy_init, data, phy, , 0x1000,
factory, app, factory, , 3900K,

This sets the app partition size to ~3.8 MB, which leaves 10% after a build and allows for flashing of the ESP32. Much of the efforts after this were focused on resolving crashes and performance issues, some of which were due to the small size preset with FFmpeg builds, while most were due to making all thread stacks fit into the available heap. It clearly was obvious that the PSRAM would need to be used for thread stacks as well, which while not ideal due to performance reasons offers a lot of space for not just a heap, but also stacks.

Simple hardware prototype based around an ESP32-WROVER-E development board, UDA1334A I2S DAC and USB connector for power. Board is a wire wrapped 6×8 cm prototype board.

After modifying NPoco to allow for both stack allocation in PSRAM and pinning tasks to specific CPU cores, the playback performance of the system was tweaked. To help with this, the core speed was bumped up to 240 MHz and the QSPI bus speed to 80 MHz (for effectively 40 MB/s vs 250 MB/s for internal SRAM).

In the current prototype version of the NymphCast Audio ESP32 (NCA-ESP32) project music playback of a range of formats have been tested already, along with network discovery using NymphCast clients like the NymphCast Player (desktop & mobile). Features that exist in the desktop and Android version of NCS such as video decoding and NymphCast Apps (based on AngelScript) will likely not be added in any form, due to a lack of performance, I/O and lack of RAM. Beyond this it is a fully capable audio-only version of NymphCast that should be able to handle any multimedia format that FFmpeg can decode, including RTSP and similar network streams.

Plus it finally brings FFmpeg to the ESP32, which by itself could be much more useful than Espressif’s own audio framework (ESP-ADF) which only supports a handful of codecs. The NCA-ESP32 project is available on GitHub [6] for those interested in trying it out. A 3D printed enclosure (see picture) and more are also in the works.

Maya

[1] https://github.com/MayaPosch/NymphCast
[2] https://github.com/MayaPosch/NymphRPC
[3] https://github.com/MayaPosch/NyanSD
[4] https://github.com/MayaPosch/libnymphcast
[5] https://github.com/MayaPosch/npoco
[6] https://github.com/MayaPosch/NymphCastAudio-ESP32

http://mayaposch.wordpress.com/?p=249

Extensions

Porting NymphCast to the Haiku operating system

Maya Posch Jan 20, 2024

Over the past couple of days I have ported the NymphCast network streaming project [1] to Haiku, which was fairly straightforward due to its POSIX-compatibility. In this article I’ll describe my experiences with porting to what is effectively BeOS’ spiritual successor. Haiku as a porting target is attractive for a number of reasons. The first […]

Show full content

Over the past couple of days I have ported the NymphCast network streaming project [1] to Haiku, which was fairly straightforward due to its POSIX-compatibility. In this article I’ll describe my experiences with porting to what is effectively BeOS’ spiritual successor.

Haiku as a porting target is attractive for a number of reasons. The first is as mentioned that it is POSIX-compatible, meaning that any software written for Linux, BSD, etc. can be ported with zero to minimal changes. Another property that makes Haiku attractive is due to how light-weight it is, with a very rapid installation, and boot-up taking mere seconds, something which makes it appealing for low-end and embedded systems.

Although Haiku is still under development, with the porting taking place on the Beta 4 release as well as a nightly build of Beta 5 (to be released in 2024, possibly?), the only issues that I encountered during development in VirtualBox VM instances was an odd, occasional soft-lock during periods of high I/O activity that required resetting the virtual system, and some package versioning oddities [2] that got fixed by running pkgman update in a terminal to update the system.

As the first step towards porting NymphCast, I had to ensure that I could compile the NymphRPC [3] dependency first. This one was easy, as all it took was installing the Poco dependency with: pkgman install poco poco_devel and running Make. In Haiku’s pkgman system, the runtime package is always just the name of the project (like poco), and if there are development files such as libraries and header files, these are in a package with _devel as postfix.

Figuring out where the header files and freshly compiled static and shared libraries of NymphRPC had to go after this was the next challenge. The documentation for this is admittedly rather poor, but I got help from one of the Haiku developers – Mr. Waddlesplash – on Mastodon [4]. When running make install on NymphRPC now, it copies the aforementioned files to /boot/system/non-packaged/develop/{headers/lib}, which makes them available to the GCC and other compiler toolchains.

Of note here is that the Haiku filesystem is somewhat curious, with a number of overlays, symlinks and the like which make it seem more complicated than it is. Essentially, the entire boot drive containing Haiku and everything else is mounted on /boot, with the system folder containing (shockingly) system files, including files installed from packages (under packages) and those which were installed from elsewhere (non-packaged). This latter folder is thus similar to the /usr/local folder on Linux.

With NymphRPC now compiled and installed into its develop folders, I was able to compile libnymphcast [5], which encapsulates a lot of basic NymphCast functionality useful for primarily clients, but also the server component. After installing libnymphcast, compiling NymphCast Server was easy enough, after changing a few preprocessor macros to append Haiku to the FreeBSD sections, as Haiku’s file stat handling is 64-bit clean, unlike Linux, ergo it has no separate lstat64 as in Linux.

I had to update the setup.sh script in NymphCast’s root folder to check for Haiku (using the output from uname -s) and install the appropriate dependencies, matching the list of dependencies that I was using already for the various flavours of Linux, FreeBSD, MSYS2, etc. One long-ish compile session (using a single core in a VirtualBox VM) later, I had a fresh Haiku build of NymphCast Server (NCS). Of note here is however that unlike Linux, BSD, etc. to use network sockets, you do have to link with -lnetwork. This too was rather poorly documented in my experience.

One confusing issue after this was that having the NymphRPC and libnymphcast shared libraries in the development lib folder was not enough to run the binary on the Beta 5-development VM, but required me to copy them also into /boot/system/non-packaged/lib, which feels somewhat double. I’m not sure that this is intended behaviour.

Perhaps ironically, the last two issues I ran into while executing the NCS binary weren’t necessarily due to Haiku. Both effectively came down to my handling of IPv6 addresses, such as in the NyanSD service discovery component [6] where a logic bug existed that only showed itself when ran on a system with limited IPv6 functionality (no local IPv6 address on Haiku yet, only the loopback).

The other IPv6-related issue was my use of Poco::Net::ServerSocket::bind6() in NymphRPC, that tries to set an IPv6-related socket option that Haiku does not support. Replacing this with the bind() alternative finally got the binary to run and not keel over. After putting the network interface of the VM into bridging mode, I was able to find it using NyanSD in the NymphCast Player app on my Android phone and play back a few multimedia files.

For those who are interested, the text file with Haiku porting notes are in the NymphCast document folder [7]. With the porting complete for now, I’ll probably be looking at creating HPKGs out of NymphRPC, libnymphcast and NymphCast Server (and likely Player) next. With Haiku lacking a lot of video acceleration features, it remains to be seen how useful it will be for playback, but as a supported platform I do think that it has a lot of potential.

Maya

[1] https://github.com/MayaPosch/NymphCast
[2] https://discuss.haiku-os.org/t/solved-gcc-cannot-use-include-paths-provided-by-pkg-config/14494
[3] https://github.com/MayaPosch/NymphRPC
[4] https://mastodon.social/@waddlesplash/111767072256293340
[5] https://github.com/MayaPosch/libnymphcast
[6] https://github.com/MayaPosch/NyanSD
[7] https://github.com/MayaPosch/NymphCast/blob/master/doc/haiku_build_notes.txt

http://mayaposch.wordpress.com/?p=244

Extensions

Lock-free ring buffer implementation for maximum throughput

Maya Posch Nov 12, 2021

During the development of the NymphCast project [1], a core task was to implement a ring buffer that could provide the reliability, latency and throughput required for the needs of the media file decoder during playback of, and seeking actions within high bitrate content. While the general concept of a ring buffer is exceedingly common […]

Show full content

During the development of the NymphCast project [1], a core task was to implement a ring buffer that could provide the reliability, latency and throughput required for the needs of the media file decoder during playback of, and seeking actions within high bitrate content. While the general concept of a ring buffer is exceedingly common in software development, doing this in a way that is completely thread-safe is less straightforward and common.

The general implementation of a ring buffer involves a number of pointers into a preallocated buffer of size N, indicating the buffer front (index 0), end (index N) and the current positions of the read and write pointers. In the implementation as used by NymphCast, a number of higher level counters were added to keep track of characteristics, such as the total unread and free bytes in the buffer.

This ring buffer implementation has now been generalised into its own project [2], with NymphCast-specific features stripped out, for easier analysis.

For a high-level overview of how this ring buffer (DataBuffer) works, we can look at the provided example project [3] that shows the basic structure, set-up and tear-down.

Initially, the DataBuffer class is set up with the number of bytes that should be allocated for its buffer. In this example a 1 MB buffer is used. Using DataBuffer::setSeekRequestCallback(), our custom handler for seek requests is provided to the ring buffer.

Next, we launch a new thread for the data write function – which writes new data into the buffer – and a thread for the data request function – which is called by the DataBuffer class when more data can be written. After this we call DataBuffer::start() to begin the initial buffering process.

In this start function, the data request function callback (set using DataBuffer::setDataRequestCondition()) is called by signalling its condition variable. This causes it to trigger a data write sequence, in this example simplified to the calling of another condition variable in the data write function, which writes a pre-allocated chunk of data (200 kB-sized) into the ring buffer using DataBuffer::write().

The reason for having a separate data request and data write thread is that this allows for efficient buffering, with each write call leading to the data request function being called for so long as there’s room in the buffer for another (200 kB) chunk of data.

The measures taken to make this ring buffer thread-safe can be summarised as atomic variable capture, and atomic wait. The latter refers to the waiting on an atomic variable (dataRequestPending) that indicates whether there’s an active data request action. This is used in DataBuffer::seek() as follows:

uint32_t timeout = 1000;
while (dataRequestPending) {
std::this_thread::sleep_for(std::chrono::milliseconds(1));
if (--timeout < 1) {
		return -1;
	}
}

This code ensures that when the buffer is expecting an incoming write request, we do not start the seeking action, as this involves resetting the buffer contents. Synchronising seeking actions like this is one of the more complicated aspects of a lock-free ring buffer. In addition, this code also implements a time-out feature of 1 second, to prevent application lock-up if the impending write somehow does not occur.

The atomic variable capture aspect comes into play with the DataBuffer::read() and DataBuffer::write() methods. The aforementioned free and unread counters are implemented as atomic variables. This provides some level of safety, but does not guarantee that the variable’s value will not be changed by another thread during the execution of either of these class methods.

A solution is to capture the variables needed into a local variable, as done e.g. in DataBuffer::read():

uint32_t locunread = unread;
uint32_t bytesSingleRead = locunread;
if ((end - index) < bytesSingleRead) { bytesSingleRead = end - index; }

In this code we capture the unread atomic variable into a local variable, which is then used to calculate the of bytes we can safely read from the buffer. This same captured variable is then used throughout the class method instead of the class variable, as after capturing it is deemed unsafe to use.

The same atomic capture approach is used with the counter for free bytes when writing into the buffer. What this guarantees is that at the moment of capture, the value of these counters is accurate, and most importantly that even if they are written to after capture, it has no negative repercussions. The worst outcome of an increase of the value of the number of free or unread bytes is there are more bytes to read or available to write than at the moment of capture. Neither of which poses a problem.

The result of this approach is that the single writer thread can safely write into the ring buffer until it runs out of sufficient space to write another chunk. Similarly, the reader thread can read constantly and should theoretically not run out of data to read, barring throughput issues on the side of the writer that are beyond the scope of the ring buffer.

A possible improvement here would be to have the data request callback receive the total number of free bytes (as captured), so that it could scale the requested data to the capabilities of the data source. This will have to be further tested and analysed.

Maya

[1] https://github.com/MayaPosch/NymphCast
[2] https://github.com/MayaPosch/LockFreeRingBuffer
[3] https://github.com/MayaPosch/LockFreeRingBuffer/blob/master/test/test_databuffer_multi_port.cpp

http://mayaposch.wordpress.com/?p=238

Extensions

Refactoring NymphRPC for zero-copy optimisation

Maya Posch Nov 11, 2021

When I originally wrote the code for what became NymphRPC [1], efficiency was not my foremost concern, but rather reliable functionality was. Admittedly, so long as you just send a couple of bytes and short strings to and from client and server, the overhead of network transmission is very likely to mask many inefficiencies. That […]

Show full content

When I originally wrote the code for what became NymphRPC [1], efficiency was not my foremost concern, but rather reliable functionality was. Admittedly, so long as you just send a couple of bytes and short strings to and from client and server, the overhead of network transmission is very likely to mask many inefficiencies. That is, until you try to send large chunks of data.

The motivation for refactoring NymphRPC came during the performance analysis of NymphCast [2] using Valgrind’s Callgrind tool. NymphCast uses NymphRPC for all its network-based communications, including the streaming of media data between client and server. This involves sending the data in chunks of hundreds of kilobytes, which is where the constant copying of data strings in NymphRPC showed itself to be a major overhead.

Specifically this showed itself (on Linux) in many calls to __memcpy_avx_unaligned_erms, largely originating from within std::string. There were multiple reasons for this, involving the copying of std::string instances into a NymphString type, copying this data again during message serialisation, and repeated copying of message data during deserialisation: first after receiving the messages from the network socket, then again during deserialisation of the message.

Finally, the old NymphRPC API was designed thus that all data would be copied inside the NymphRPC types, which added convenience, but at a fairly large performance impact, as seen.

Using a benchmark program created using the Catch2 benchmarking framework [3][4] – consisting out of a NymphRPC client and server – the following measurements were obtained after compilation with Visual Studio 2019 (MSVC 16) with -O2 optimisation level:

benchmark name                            samples    iterations          mean
-------------------------------------------------------------------------------
uint32                                          20             1    178.387 us
double                                          20             1    138.282 us
array     1:                                    20             1    197.452 us
array     5:                                    20             1    198.407 us
array    10:                                    20             1    204.417 us
array   100:                                    20             1    512.027 us
array  1000:                                    20             1    3.08481 ms
array 10000:                                    20             1    32.8876 ms
blob       1:                                   20             1    188.677 us
blob      10:                                   20             1    141.712 us
blob     100:                                   20             1    174.832 us
blob    1000:                                   20             1    133.617 us
blob   10000:                                   20             1    211.097 us
blob  100000:                                   20             1    362.747 us
blob  200000:                                   20             1    1.35672 ms
blob  500000:                                   20             1    3.37874 ms
blob 1000000:                                   20             1    8.19277 ms

In order to reduce the number of calls to memcpy, it was decided to move to a zero-copy approach, which effectively means that no data is copied by NymphRPC unless it’s absolutely necessary, or there is no significant difference between copying and taking the pointer address of a value.

This involved changing the NymphRPC type system to still copy simple types (integers, floating point, boolean), but only accept pointers to an std::string, character array, std::vector (‘array’ type) or std::map (‘struct’ type), with optional transfer of ownership to NymphRPC. Done this way, this means that ideally the original non-simple value is allocated once (stack or heap), and copied once into the transfer buffer for the network socket. The serialisation itself is done into a pre-allocated buffer, avoiding the use of std::string altogether.

On the receiving end the receiving character buffer is filled in with the received data, and the parsing routine creating a pointer reference to non-simple types within the received data. In the receiving application’s code, this can then be read straight from this buffer, which in the case of NymphCast means that its internal ring buffer can copy the blocks of data straight from the received data buffer into the ring buffer with a single call to memcpy(), without any intermediate copying of the data.

Running the same benchmark (adapted for the new API) with the same compilation settings results in the following results:

benchmark name                            samples    iterations          mean
-------------------------------------------------------------------------------
uint32                                          20             1    122.193 us
double                                          20             1    140.368 us
array     1:                                    20             1    173.963 us
array     5:                                    20             1    189.888 us
array    10:                                    20             1    220.653 us
array   100:                                    20             1    573.168 us
array  1000:                                    20             1    3.33472 ms
array 10000:                                    20             1    31.8041 ms
blob       1:                                   20             1    181.433 us
blob      10:                                   20             1    194.048 us
blob     100:                                   20             1    153.998 us
blob    1000:                                   20             1    174.073 us
blob   10000:                                   20             1    166.228 us
blob  100000:                                   20             1    240.223 us
blob  200000:                                   20             1    343.233 us
blob  500000:                                   20             1    716.233 us
blob 1000000:                                   20             1     2.0748 ms

Taking into account natural variation when running benchmark tests (even with network data via localhost), it can be noted that there is no significant change for simple types, and arrays (std::vector) show no major change either. For the latter type a possible further optimisation can be achieved by streamlining the determination of total binary size for the types within the array, avoiding the use of a loop. This was a compromise solution during refactoring that may deserve revisiting in the future.

The most significant change can – as expected – be observed in the character strings (‘blob’). Here entire milliseconds are shaved off for the larger transfers, making for a roughly 3.5x improvement. In the case of NymphCast, which uses 200 kB chunks, this means a reduction from about 1.4 milliseconds to 350 microseconds, or 4 times faster.

After integration of the new NymphRPC into NymphCast, this improvement was observed during a subsequent analysis with Callgrind: the use of __memcpy_avx_unaligned_erms dropped from being at the top of the list of methods the application spent time in to somewhere below the noise floor to the point of being inconsequential. In actual usage of NymphCast, the improvements were somewhat noticeable in improved response time.

Further analysis would have to be performed to characterise the improvements in memory (heap and stack) usage, but it is presumed that both are lower – along with CPU usage – due to the reduction in copies of the data, and reduction in CPU time spent on creating these copies.

Maya

[1] https://github.com/MayaPosch/NymphRPC/
[2] https://github.com/MayaPosch/NymphCast
[3] https://github.com/catchorg/Catch2
[4] https://github.com/MayaPosch/NymphRPC/blob/master/test/test_performance_nymphrpc_catch2.cpp

http://mayaposch.wordpress.com/?p=230

Extensions

Why I hate testing my own code, and NymphCast Beta things

Maya Posch Sep 2, 2021

There appears to be a trend in the world of software development where it is assumed that QA departments are an unnecessary expense, and that developers are perfectly capable of testing their own code, as well as that of their colleagues. After all, if you wrote the code, or are in the same development team, […]

Show full content

There appears to be a trend in the world of software development where it is assumed that QA departments are an unnecessary expense, and that developers are perfectly capable of testing their own code, as well as that of their colleagues. After all, if you wrote the code, or are in the same development team, you know the best way to test the code. Unfortunately this is about as wrong an assumption as one can make.

I have been constantly reminded of this during the past months as NymphCast [1] has been slowly crawling its way through the first Alpha stage. While trying to settle on the right feature set for a first release, having external input from not only the intended users, but from dedicated testers is essential.

Although your target market will give you a list of things they’d like to see, it is your QA team which will make it clear to you which of the implemented features are actually ready for prime-time, and how much air there is left in the time & sanity budget for new features or stabilising already implemented features.

In the upcoming NymphCast Beta v0.1, the core feature set has now been settled upon and a feature freeze is in place. The features that will make it in ‘officially supported’ status including the streaming of media content to a NymphCast receiver (server) from an Android device or desktop device, as well from a NymphCast MediaServer instance on the same local network (LAN).

Features that are at least partially implemented, but are considered ‘experimental’ include multicast/multi-room media playback, the standalone GUI (‘SmartTV mode’) and NymphCast Apps as a whole. These features will be present in the v0.1 release, but are more of a preview of features that will become likely supported in v0.2. As a preview feature, they are not expected to work reliably, if at all. Whether v0.2-release will have similar experimental features in place remains to be seen.

Many of these decisions were made while working with the friend who has dedicated himself to QA, and I’m grateful for every bit of feedback and every bug report I got tossed my way.

The reason why asking a developer to ‘test their code’ doesn’t work, is because they’re biased. And even if they try really hard to test well, it’s a different mindset to test code versus writing it. The best QA person is one who doesn’t care in the slightest how you as a developer expected users to use your software, but who will gleefully use it whatever way ‘seems logical’.

As a developer, when I test my own code, I tend to focus just on a single detail of the software. This is excellent when you’re processing a bug report and trying to hunt down an issue in the code or environment, but the QA mode, top-down look of the code is not something that comes naturally. Based on my experiences, this is something which I can do with my own code, but it generally requires me to not touch the project for a few months or longer.

This makes it obvious that the main reason why developers are terrible at QA-level of testing their own code is because they know too much about the code. Whether consciously or subconsciously, this makes you dodge risky tasks, and follow UX patterns that may seem reasonable in light of how the code works, but where the average user may do something entirely different. Much like you’d do yourself when confronted with your own software a few years from now.

What this also shows just how much of a team effort software development is. Of course one can expected a couple of software developers to kinda sorta band together and handle writing code, testing it, doing QA, creating and processing bug reports, handle packaging and distribution, but things work just so much better when you have dedicated developers, QA folk, people who handle packaging & distribution and so on.

I’d argue that each of those are tasks which aren’t easy to switch between, especially not all on the same day. Yes, it can be made to work, but the results will always be sub-optimal.

With the NymphCast QA department currently working hard to shake out any remaining issues before the first v0.1-beta release of NymphCast, it will soon be time to engage the community for testing this and future Betas and Release Candidates. For as awesome as QA people and other dedicated testers on a project are, it’s hard to beat a large community of users for the most diverse collection of hardware, network configurations and usage patterns imaginable.

On that note, standby for the upcoming announcement of the first NymphCast Beta. Feel free to have a peek at the software before then as well, and add your questions or feedback.

Maya

[1] https://github.com/MayaPosch/NymphCast

http://mayaposch.wordpress.com/?p=225

Extensions

Purgatory or Hell: Escape from eternal Alpha status

Maya Posch Jan 17, 2021

Many of us will have laughed and scoffed at Google’s liberal use of the tag ‘Beta software’ these past years. Is the label ‘Beta’ nothing more than an excuse for any bugs and issues that may still exist in code, even when it has been running in what is essentially a production environment for years? […]

Show full content

Many of us will have laughed and scoffed at Google’s liberal use of the tag ‘Beta software’ these past years. Is the label ‘Beta’ nothing more than an excuse for any bugs and issues that may still exist in code, even when it has been running in what is essentially a production environment for years? Similarly, the label ‘Alpha’ when given to software would also seem to seek a kind of indemnity for any issues or lacking features: to dismiss any issue or complaint raised with the excuse that the software is still ‘in alpha’.

Obviously, any software project needs time to develop. Ideally it would have a clear course through the design and requirements phase, smooth sailing through Alpha phase as all the features are bolted onto the well-designed architecture, and finally the polishing of the software during the Beta and Release Candidate (RC) phases. Yet it’s all too easy to mess things up here, which usually ends up with a prolonged stay in the Alpha phase.

A common issue that leads to this is too little time spent in the initial design and requirements phase. Without a clear idea of what the application’s architecture should look like, the result is that during the Alpha phase both the features and architecture end up being designed on the spot. This is akin to building a house before the architectural plans are drawn up, but one wants to starts building anyway, because one has a rough idea of what a house looks like.

When I began work on the NymphCast project [1] a few years back, all I had was a vague idea of ‘streaming audio’, which slowly grew over time. With the demise of Google’s ChromeCast Audio product, it gave me a hint to look at what that product did, and what people looked at it. By that time NymphCast was little more than a concept and an idea in my head, and I’m somewhat ashamed to say that it took me far too long to work out solid requirements and a workable design and architecture.

Looking back, what NymphCast was at the beginning of 2020 – when it got a sudden surge of attention after an overly enthusiastic post from me on the topic – was essentially a prototype. A prototype is somewhat like an Alpha-level construction, but never meant to be turned into a product: it’s a way to gather information for the design and requirements phase, so that a better architecture and product can be developed. Realising this was essential for me take the appropriate steps with the NymphCast project.

With only a vague idea of one’s direction and goals while in the Alpha phase, one can be doomed to stay there for a long time, or even forever. After all, when is the Alpha phase ‘done’, when one doesn’t even have a clear definition of what ‘done’ actually means in that context? Clearly one needs to have a clear feature set, clear requirements, a clear schedule and definition of ‘done’ for all of those. Even for a hobby project like NymphCast, there is no fun in being stuck in Alpha Limbo for months or even years.

After my recent post [2] on the continuation of the NymphCast project after a brief burn-out spell, I have not yet gotten the project into a Beta stage. What I have done is frozen the feature set, and together with a friend I’m gradually going through the remaining list of Things That Do Not Work Properly Yet. Most of this is small stuff, though the small stuff is usually the kind of thing that will have big consequences on user friendliness and overall system stability. This is also the point where there are big rewards for getting issues fixed.

The refactored ring buffer class has had some issues fixed, and an issue with a Stop condition was recently resolved. The user experience on the player side has seen some bug fixes as well. This is what Alpha-level testing should be like: the hunting down of issues that impede a smooth use of the software, until everything seems in order.

The moral of this story then is that before one even writes a line of code, it’s imperative that one has a clear map of where to go and what to do, lest one becomes lost. The second moral is that it’s equally imperative to set limits. Be realistic about the features one can implement this time around. Sort the essential from the ‘nice to have’. If one does it right now, there is always a new development cycle after release into production where one gets to tear everything apart again and add new things.

Ultimately, the Alpha phase ends when it’s ‘good enough’. The Beta phase ends when the issue tracker begins to run dry. Release Candidates exist because life is full of unexpected surprises, especially when it concerns new software. Yet starting the Alpha phase before putting together a plan makes as much sense as walking into the living room at night without turning a light on because ‘you know where to walk’.

Fortunately, even after you have repeatedly bumped your shins against furniture and fallen over a chair, it’s still not too late to turn on a light and do the limping walk of shame

Maya

[1] https://github.com/MayaPosch/NymphCast
[2] https://mayaposch.wordpress.com/2020/12/27/nymphcast-on-getting-a-chromecast-killer-to-a-beta-release/

http://mayaposch.wordpress.com/?p=219

Extensions

NymphCast: on getting a ‘ChromeCast killer’ to a Beta release

Maya Posch Dec 27, 2020

It’s been a solid nine months since I first wrote about the NymphCast project [1] on my personal blog [2]. That particular blog post ended up igniting a lot of media attention [3], as it also began to dawn on me how much work would still be required to truly get it to a ‘release’ […]

Show full content

It’s been a solid nine months since I first wrote about the NymphCast project [1] on my personal blog [2]. That particular blog post ended up igniting a lot of media attention [3], as it also began to dawn on me how much work would still be required to truly get it to a ‘release’ state. Amidst the stress from this, the 2020 pandemic and other factors, the project ended up slumbering for a few months as I tried to stave off burn-out on the project as a whole.

Sometimes such a break from a project is essential, to be able to step back instead of bashing one’s head against the same seemingly insurmountable problems over and over as they threaten to drown you into an ocean of despair, frustration and helplessness. You know, the usual reason why ‘grinding’, let alone a full-blown death march, is such a terrible thing in software development.

One thing I did do during that time off was to solve one particular issue that had made me rather sad during initial NymphCast development: that of auto-discovery of NymphCast servers on the local network. I had attempted to use DNS Service Discovery (DNS-SD, mDNS) for this, but ran into issue that there is no cross-platform solution for mDNS that Just Works ™. Before reading up on mDNS I had in my mind a setup where the application itself would announce its presence to the network, or to a central mDNS server on the system, as that made sense to me.

Instead I found myself dealing with a half-working solution that basically required Avahi on Linux, Bonjour on MacOS and something custom installed and configured on Windows, not to mention other desktop operating systems. On the client side things were even more miserable, with me finding only a single library for mDNS that was somewhat easy to integrate. Yet even then I had no luck making it work across different OSes, with the running server instances regularly not found, or requiring specific changes to the service name string to get a match.

The troubleshooting there was one factor that nearly made me burn out on the NymphCast project. Then, during that break I figured that I might as well write something myself to replace mDNS. After all, I just needed something that spit out a UDP Broadcast message, and something that listened for it and responded to it. This idea turned into NyanSD [4], which I wrote about before [5].

I have since integrated NyanSD into NymphCast on the server & client side, with as result that I have had no problems any more with service discovery, regardless of the platform.

Other aspects of NymphCast were less troublesome, but mostly just annoying, such as getting a mobile client for NymphCast. Originally I had planned to use a single codebase for the graphical NymphCast Player application, using Qt’s Android & iOS cross-platform functionality to target desktop and mobile platforms. Unfortunately this ran into the harsh reality of Qt’s limited Android support and spotty documentation [6]. This led me to work on a standard, native Android application written in Java for the GUI and using the JNI to use the same C++ client codebase. This way I only have to port the Qt-specific code on the Android side to the Java-Android equivalent.

Status at this point is that all features for the targeted v0.1 release have been implemented, with testing ongoing. An additional feature that got integrated at the last moment was the synchronisation of music and video playback between different NymphCast devices, for multi-room playback and similar. The project also saw the addition of a MediaServer [7], which allows clients to browse the media files shared by the server, and start playback of these files on any of the NymphCast servers (receivers) on the network. I also refactored the in-memory buffer to use a simple ringbuffer instead of the previous, more complicated buffer.

In order to get the v0.1 development branch out of Alpha and into Beta, a few more usage scenarios have to be tested, specifically the playback of large media files (100+ MB), both with a single NymphCast receiver and a group, and directly from a client as well as using a MediaServer instance. The synchronisation feature has seen some fixes recently already while testing it, but needs more testing to make it half-way usable.

A major issue I found with this synchronisation feature was the difficulty of determining local time on all the distinct devices. With the lack of a real-time clock (RTC) on Raspberry Pi SBCs in particular, I had to refactor the latency algorithm to only rely on the clock of the receiver that was used as the master receiver. Likely this issue may require more tweaking over the coming time to get synchronisation with better than 100 ms de-synchronisation.

I think that in the run-up to a v0.1 release, the Beta phase will be highly useful in figuring out the optimal end-user scenarios, both in terms of easy setup and configuration, as well as the day to day usage. This is the point where I pretty much have to rely on the community to get a solid idea of what are good ideas, and what patterns should be avoided.

That said, it’s somewhat exciting to see the project now finally progressing to a first-ever Beta release. Shouldn’t be more than a year or two before the first Release Candidate now, perhaps

Maya

[1] https://github.com/MayaPosch/NymphCast
[2] https://mayaposch.blogspot.com/2020/03/nymphcast-casual-attempt-at-open.html
[3] https://mayaposch.blogspot.com/2020/03/the-fickle-world-of-software-development.html
[4] https://github.com/MayaPosch/NyanSD
[5] https://mayaposch.wordpress.com/2020/07/26/easy-network-service-discovery-with-nyansd/
[6] https://bugreports.qt.io/browse/QTBUG-83372
[7] https://github.com/MayaPosch/NymphCast-MediaServer

http://mayaposch.wordpress.com/?p=208

Extensions

Easy network service discovery with NyanSD

Maya Posch Jul 26, 2020

In the process of developing an open alternative to ChromeCast called NymphCast [1], I found myself having to deal with DNS-SD (DNS service discovery) and mDNS [2]. This was rather frustrating, if only because one cannot simply add a standard mDNS client to a cross-platform C++ application, nor is setting up an mDNS record for […]

Show full content

In the process of developing an open alternative to ChromeCast called NymphCast [1], I found myself having to deal with DNS-SD (DNS service discovery) and mDNS [2]. This was rather frustrating, if only because one cannot simply add a standard mDNS client to a cross-platform C++ application, nor is setting up an mDNS record for a cross-platform service (daemon) an easy task, with the Linux world mostly using Avahi, while MacOS uses Bonjour, and Windows also kinda-sorta-somewhat using Bonjour if it’s been set up and configured by the user or third-party application.

As all that I wanted for NymphCast was to have an easy way to discover NymphCast receivers (services) running on the local network from a NymphCast client, this all turned out to be a bit of a tragedy, with the resulting solution only really working when running the server and client on Linux. This was clearly sub-optimal, and made me face the options of fighting some more with existing mDNS solutions, implement my own mDNS server and client, or to write something from scratch.

As mDNS (and thus DNS-SD) is a rather complex protocol, and it isn’t something which I feel a desperate need to work with when it comes to network service discovery of custom services, I decided to implement a light-weight protocol and reference implementation called ‘NyanSD’, for ‘Nyanko Service Discovery’ [3].

NyanSD is a simple binary protocol that uses a UDP broadcast socket on the client and UDP listening sockets on the server side. The client sends out a broadcast query which can optionally request responses matching a specific service name and/or network protocol (TCP/UDP). The server registers one or more services, which could be running on the local system, or somewhere else. This way the server acts more as a registry, allowing one to also specify services which do not necessarily run on the same LAN.

The way that I envisioned NyanSD originally was merely as an integrated solution within NymphCast, so that the NymphCast server can advertise itself on the UDP port, while accepting service requests on its TCP port. As I put the finishing touches on this, it hit me that I could easily make a full-blown daemon/service solution out of it as well. With the NyanSD functionality implemented in a single header and source file, it was fairly easy to create a server that would read in service files from a standard location (/etc/nyansd/services on Linux/BSD/MacOS, %ProgramData%\NyanSD\services on Windows). This also allowed me implement my first ever Windows service, which was definitely educational.

Over the coming time I’ll be integrating NyanSD into NymphCast and likely discarding the dodgy mDNS/DNS-SD attempt. It will be interesting to see whether I or others will find a use for the NyanSD server. While I think it would be a more elegant solution than the current mess with mDNS/DNS-SD and UPnP network discovery, some may disagree with this notion. I’m definitely looking forward to discussing the merits and potential improvements of NyanSD.

Maya

[1] https://github.com/MayaPosch/NymphCast
[2] https://en.wikipedia.org/wiki/Zero-configuration_networking#DNS-based_service_discovery
[3] https://github.com/MayaPosch/NyanSD

http://mayaposch.wordpress.com/?p=205

Extensions

Keeping history alive with a 1959 FACOM 128B relay-based computer

Maya Posch Aug 4, 2019

Back in the 1950s, the competition was between vacuum tube (valve) based computers and their relay-based brethren. Whereas the former type was theoretically faster, vacuum tubes suffer from reliability issues, which meant that relay-based computers would be used alongside tube-based ones. Not surprisingly, Fujitsu also designed a number of such electro-mechanical computers back then. More […]

Show full content

Back in the 1950s, the competition was between vacuum tube (valve) based computers and their relay-based brethren. Whereas the former type was theoretically faster, vacuum tubes suffer from reliability issues, which meant that relay-based computers would be used alongside tube-based ones. Not surprisingly, Fujitsu also designed a number of such electro-mechanical computers back then. More surprisingly, they are still keeping a FACOM 128B in tip-top shape.

Known in the 1950s as Fuji Tsushinki Manufacturing Corporation, Fujitsu’s Ikeda Toshio was involved in the design of first the FACOM 100, which was completed in 1954, followed by the FACOM 128A in 1956. The 128B was a 1958 upgrade of the 128A based on user experiences. Fujitsu installed a FACOM 128B at their own offices in 1959 to assist with projects ranging from the design of camera lenses to the NAMC YS-11 passenger plane, as well as calculation services.

As a successor in a long line of electro-mechanical computers (including the US’s 1944 Harvard Mark I) performance was pretty much as good as it was going to get with relays. Ratings of the FACOM 128B were listed as 0.1-0.2 seconds for addition/subtraction operations, 0.1-0.35 seconds for multiplication, with operations involving complex numbers and logarithmic operations taking in the order of seconds. Maybe not amazing by today’s (or 1970s) standards, but back then their point was to massively and consistently outperform human computers, with (ideally) unfailing accuracy.

Today, this same FACOM 128B can be found at the Toshio Ikeda Memorial Hall at Fujitsu’s Numazu Plant, where it’s lovingly maintained by the 49-year old engineer Tadao Hamada. Working as the leader of Fujitsu’s 2006 project to pass down technology that is still historically relevant, his job is basically to keep this relay-based computer working the way it has done since it was installed in 1959.

Taking up 70 square meters in the visitor center at the Numazu Plant, the most impressive thing about the machine when it’s operating is the wave of sounds that commences and dies down again whenever an operation is executed, with its hundreds of relays opening and closing the contacts that make up its circuitry. The below video provides a good overview of this computer for your own enjoyment.

Naturally we all know how this battle between relays and tubes ended; before the 1960s began, transistors became ever more reliable and cheap, gradually taking over computing and switching roles until the era of VLSI ended the era of relays and tubes for good. Which is why it’s ever more impressive that so many decades later, Fujitsu still has this computer in pristine condition where its brethren invariably have been scrapped aside from a few lucky survivors.

http://mayaposch.wordpress.com/?p=200

Extensions