GeistHaus
log in · sign up

Memory Barrier

Part of wordpress.com

A software developer's blog

stories
QNX Resource Manager in Rust: Design
Uncategorized
This is the second part in the series on writing a QNX resource manager in Rust. See the first part for background information on message passing and resource managers in QNX. This post will describe the design I came up with for writing a resource manager in Rust. It is neither complete, nor optimal, and […]
Show full content

This is the second part in the series on writing a QNX resource manager in Rust. See the first part for background information on message passing and resource managers in QNX.

This post will describe the design I came up with for writing a resource manager in Rust. It is neither complete, nor optimal, and I am certain things will change as both myself and others gain some experience in writing resource managers in Rust. Nevertheless, we have to start somewhere…

Disclaimer

I am a Rust novice (or noob, as the cool kids would say). It is quite possible that the approach I have taken in implementing a resource manager is anywhere between sub-optimal and completely misguided. Comments from bona fide Rust experts (or would-be experts) are welcome, either on the blog post or in GitLab.

Structure

The code is divided in two. The qnxmsg module provides generic message handling, and will eventually become a crate of its own. The server module implements the Raspberry Pi GPIO server, which includes hardware access to the GPIOs and the implementation of the I/O message handlers.

Message Dispatch

At the heart of a resource manager is the ability to handle messages sent by clients. To that effect, the resource manager needs to

  1. receive messages on a channel;
  2. interpret the messages;
  3. act on each message type;
  4. reply to the client.

The Channel structure is a simple wrapper around a channel ID, as returned by a call to ChannelCreate(). The implementation provides a receive() function, which calls MsgReceive() on that channel, populating a MsgBuf structure. That structure is just an array of bytes, whose size anticipates the longest message that the server can handle. Once initialized with data returned from a call to MsgReceive(), the server can determine the type of the message from the first two bytes, and then reinterpret the buffer using the concrete type of the message. This reinterpretation, as provided by the get_data_as() generic function, ensures that the data received from the client is sufficiently large for the message type, preventing one common bug in resource managers.

The buffer structure MsgBuf is an example of the benefits of Rust over C when it comes to data access. Whereas resource managers written in C are prone to out of bounds access to the message buffer, especially once it has been cast to a concrete message type, Rust makes accidental illegal access impossible. You can still crash a program with out of bounds access (not every bug is caught at build time), but overwriting adjacent memory is not an option.

The message_loop() function provides the fundamental structure of the resource manager. Each call to message_loop() receives and handles one message or one pulse (asynchronous notifications received in band with messages). Once a message or a pulse is received, the function decodes its type, and then invokes a handler for this type. These top-level handlers are implemented in the iomsg sub-module. Each handler is responsible for reinterpreting the message buffer as the concrete message type, and for invoking the resource-manager-specific implementation of that handler (e.g., the handler for the _IO_READ message invokes the implementation of read() on a Raspberry Pi GPIO pseudo-file). Some top-level handlers perform more work. For example, the top-level handler for _IO_CONNECT extracts the path safely from the message as a string.

The message loop implementation also handles combine messages, which are another source of bugs in traditional resource managers. Combine messages allow certain combinations of messages to be performed as a single transaction, primarily for performance. For example, a stat() call is implemented as a combination of _IO_CONNECT, _IO_STAT and _IO_CLOSE, while a pread() call is implemented as a combination of _IO_SEEK and _IO_READ. Handling combine messages can cause problems with C resource managers, as the handlers are invoked with an offset into the message buffer, and must account for that to avoid out-of-bounds access. The Rust implementation takes care of that.

Message Transactions

As explained in the previous post, QNX messages are synchronous, with every interaction between the client and server consisting of two messages: one from the client to the server (the request) and one from the server to the client (the response, or reply). This two-way interaction has not had a proper name historically (or, at least, I did not see one over the years), and so I decided to refer to it as a “transaction”. The kernel identifies a transaction with a value referred to as a “receive ID”, as returned by a call to MsgReceive(). This receive ID, which I refer to as the transaction ID, is then used in subsequent calls by the server to functions such as MsgReply(), MsgError(), MsgRead() and more. Each of these functions operates on actors (the client thread) and data involved in the current transaction.

The Transaction structure carries a few bits of information beyond the ID. One is the messsage information structure, filled by the kernel and returned by the MsgReceive() call. This structure includes the IDs of the client (process and thread), as well as the lengths of the request and the response buffers.

Importantly, the Transaction structure keeps track of whether the server has already replied, by maintaining a state for the transaction. A common source of problems with resource manager implementation is the double-reply: in an attempt to make developer lives easier, the C resource manager framework abstracts the concept of a reply, allowing handler functions to return in a way that tells the framework to handle the reply on behalf of the handler. However, if the programmer is unaware of this, or is just not careful, a handler can reply directly, while still instructing the framework to reply on its behalf. The Rust implementation addresses this problem by forcing replies (both successful and error indications) to go through the Transaction object. Once one of the reply*() variants, or the error() function, has been invoked, the transaction is marked as terminated, asserting on any attempt to call such a function a second time.

Nodes

A node is any entity that is associated with a path. In the case of the GPIO resource manager, there are three types of nodes:

  1. the directory /dev/gpio;
  2. a per GPIO pseudo file /dev/gpio/<PIN>, where ` is a number between 0 and 63;
  3. a pseudo file /dev/gpio/msg.

The msg node provides the main programmatic interface to the resource manager, while the <PIN> nodes are useful for quick-but-limited read()/write() access to each GPIO pin, e.g., turning on a pin with the shell command echo on > /dev/gpio/17.

The GPIONode structure implements each of these node types. The implementation includes a stat() function, which can be used to fill a struct stat structure, as defined by POSIX.

Nodes are associated with paths via a HashMap container. This is a sub-optimal choice, that was taken for expediency in completing a first-cut version of the resource manager. Since nodes needs to be owned by the hash table, while at the same time provided to functions that have to access them, the nodes are stored using a Rc<> smart pointer. In this example, nodes are not updated once created, and therefore do not require inner-mutability.

Sessions

A “session” is another new term for an existing concept that lacks a proper name. A session corresponds to a connection from the client to the server, as it exists between an _IO_CONNECT message (typically the result of a call to open()) and an _IO_CLOSE message (typically the result of a call to close(), and also sent automatically when the client exits). A session can only be established if the server accepts the _IO_CONNECT message, which it does based on various considerations, including client permissions and limits.

The GPIOSession structure, used by this particular resource manager, keeps track of the node that was opened, what access permissions were used, and, for a node which can be read from (the directory, the pin nodes), also the offset into the read data.

Session objects are kept in a HashMap container (again, a sub-optimal choice, but will do for now), where the key consists of the process and connection IDs of the client. Like nodes, sessions need to be kept using Rc<> smart pointers, so that they can be found in the table but then carried around as references in various handler functions. Unlike nodes in the GPIO server, though, sessions can be mutated by handlers (for updating the offset value), and therefore the table stores the sessions in RefCell<> types, which allow for inner mutability.

Hardware Access

Access to the Raspberry Pi GPIOs is provided, as usual, via a non-cached, strongly-ordered virtual mapping of the hardware registers. The RPiGPIO structure holds the resulting virtual address. The implementation of the structure provides functions for changing the roles of each pin, writing to output pins, reading input pins, etc. These are all achieved via straight-forward bit manipulations. The use of an initialized Pin structure in the argument to these functions avoids extra checks that the pin number is valid.

Server

Everything comes together with the GPIOServer structure, which includes:

  1. a channel for receiving messages
  2. a node table to store the different nodes by their paths
  3. a session table for established connections from clients
  4. the mapped hardware registers
  5. a vector of gpio pin structures.

The server implements the IoMsgImpl trait, which is the interface between the top-level handlers and the resource-manager-specific ones. For example, the top-level handler for the _IO_READ message invokes the trait’s read() function, implemented by the server to read the value of an input node (if the current session is associated with a pin node), or the directory (if the current session is associated with the directory node). Each server handler ends either with a reply to the client (completing the transaction), or with an error return value, which the message loop handles by propagating it to the client. It is also possible for the handler to invoke the error() function of Transaction directly, though I find the pattern of returning an error clearer.

The connect() function creates a new session and adds it to the session table. Other I/O handlers find the session in the table based on the client’s information (and fail if no session is found). The session holds a reference to the node that the client requested (identified by its path), which requires a clone of the Rc<GPIONode> object.

On top of the standard I/O messages, the server also handles the generic _IO_MSG type. This type allows for messages that don’t fit nicely in the file abstraction. It is possible to use write() to pass arbitrary bytes that can be interpreted as structured messages, but that loses type cohesion. Various flavours of UNIX came up with devctl() and ioctl() for this purpose, but these attempt to solve a problem that doesn’t exist with QNX message passing to begin with: structured messages are built into the system.

The benefit of _IO_MSG is that it has just a small header that identifies its type (allowing the server to identify it alongside other I/O messages), and then leaves the payload to be defined by the implementation. The GPIO server defines messages for setting and getting the role of each pin, for writing to output pins and for reading from input pins. Future versions will catch up with the C version, to provide PWM and event registration.

Verdict

Is Rust a good choice for implementing resource managers? I will discuss the challenges I faced when implementing this resource manager, how they affected the design, and what are the trade-offs I see so far in moving from C to Rust.

elahav
http://membarrier.wordpress.com/?p=532
Extensions
QNX Resource Manager in Rust: Message Passing and Resource Managers
Uncategorized
Welcome to the first part of a series of posts on how to write a QNX resource manager in Rust. This post will, in fact, not discuss Rust at all, but rather provide some background information on resource managers. Disclaimer I am a Rust novice (or noob, as the cool kids would say). It is […]
Show full content

Welcome to the first part of a series of posts on how to write a QNX resource manager in Rust. This post will, in fact, not discuss Rust at all, but rather provide some background information on resource managers.

Disclaimer

I am a Rust novice (or noob, as the cool kids would say). It is quite possible that the approach I have taken in implementing a resource manager is anywhere between sub-optimal and completely misguided. Comments from bona fide Rust experts (or would-be experts) are welcome, either on the blog post or in GitLab.

No talking! No new crap! Give us the code! Now!

If you just want to take care of business, go ahead and check out the code for the Rust Raspberry Pi 4 resource manager.

Message Passing

QNX is a micro-kernel based operating system. Most of the services provided by the operating system (file system, network stack, graphics, USB) are implemented outside of the kernel, in stand-alone user-mode processes. These processes expose their services (read a file, send a packet) via inter-process communication (IPC), implemented as synchronous messages. Here, “synchronous” means that the process requesting the service (the client), sends a message to the process providing the service (the server) and waits for a reply. The server receives the message, handles it, and then replies to the client. A client thread cannot send more than one message at a time, as it is blocked until the server replies.

A channel is a kernel object that a server provides for clients to connect to. A server can have more than one channel, allowing it to provide different types of services, or different quality of service, via different channels. A channel is created with the ChannelCreate() kernel call.

A connection is a kernel object that associates a client with a server’s channel. A client can (and almost always does) have multiple connections to different channels provided by different servers. For example, every process in a QNX system has a connection to the channel provided by the system manager for memory operations (such as mmap()), a connection to a file system that provides its own executable, as well as connections for files, sockets, the compositor (for graphical applications), etc. A client connects to a server’s channel with the ConnectAttach() kernel call.

Once a client has connected to a channel, it can use the connection ID to send messages via the MsgSend*() family of kernel calls (with variants for vectorized messages that support scatter-gather). The server gets these messages by calling MsgReceive() on the channel ID. When done processing the request, the server responds with a call to MsgReply(), which allows it to provide both a return status and a response payload.

QNX messages are just raw bytes. The micro-kernel does not interpret the data being sent. It is up to the client and the server to agree on a protocol that gives meaning to these bytes as requests for service and responses from this service. For example, the C library’s getppid() function (get the ID of the parent process) is implemented as a message to the system manager’s channel, where the first two bytes have the value 0x13 (_PROC_GETSETID) and the next two bytes have the value 0x8 (_PROC_ID_GETID_NO_CRED). The system manager interprets the first two bytes as the message’s type and the next two bytes as the sub-type. It then replies with a structure, where the 4 bytes starting at offset 8 are the parent process’ ID. This is, of course, a trivial example, as messages can be much more complicated and much, much longer (up to 256 petabyte, at least in theory…).

Resource Managers

QNX native message passing is powerful and efficient, but it requires clients and servers to agree on a few details. In particular, the client needs to know the process and channel IDs of the server in order to connect to it, and both need to know the semantics of the messages and the replies. A Resource Manager helps a client and a server talk to each other by

  1. providing a path that can be used for finding the server and its channel,
  2. handling a set of pre-defined messages that correspond to POSIX file I/O operations.

In QNX, the path space is decentralized (a design, I believe, originating with Plan 9). A kernel component called the Path Manager maintains a tree of nodes, each corresponding to a resource manager that has attached its channel to a given path. For example, a serial driver can attach to /dev/ser1, while a file system with two partitions can attach both to / and /home. When a process calls open() (or any other function that takes a path), it first sends a message to the path manager, which resolves the path to a process ID and a channel ID. The process then calls ConnectAttach() to connect to the channel found by the path manager.

The second benefit of a resource manager is that it understands the POSIX I/O messages, whose format is common to all resource managers. For example, the _IO_WRITE message includes a payload of bytes to send to the server, while the _IO_READ message includes the number of bytes requested from the server. Different resource managers interpret these requests differently (e.g., read from a file, write to a serial device). The C library then implements the POSIX write() function as a thin wrapper around MsgSend() with the _IO_WRITE message type, the read() function with the _IO_READ message type, the fstat() function with the _IO_STAT message type, etc.

Finally, resource managers add a session layer on top of message passing. After a client connects to a server with a call to ConnectAttach(), it must send an _IO_CONNECT message first to the resource manager in order to establish a session. This allows the server to check for access permissions and create an open control block (OCB) for the connection. The session exists until the client sends the _IO_CLOSE message. All messages sent on this connection in between are considered as part of the session. For example, if the resource manager associates a file offset with the OCB, then all messages on this connection are subject to the same offset. In essence, the session turns a connection into a file descriptor.

Almost all drivers and services provided by QNX are implemented as resource managers, in accordance with the UNIX mantra of “everything is a file” (except for all the things that are not). The C library provides an infrastructure for writing resource managers, as described in detail here. For a concrete example, see my Raspberry Pi GPIO resource manager. This will also serve as a way to compare the C code with the Rust version, as both provide the same protocol and functionality (or, at least, will provide at some point in the future).

Next, Rust…

The next blog post will discuss how these concepts were translated into Rust, and what obstacles were encountered in trying to move from C to Rust.

elahav
http://membarrier.wordpress.com/?p=520
Extensions
It’s a QNX system, I know this!
Uncategorized
QNX is billed as a real time operating system, or RTOS. The same term is used to describe systems such as FreeRTOS, RT-Threads, Zephyr, PX5 and many more, which suggests that these operating systems are interchangeable. Unfortunately, that is a common misconception that leads to much confusion. In this post I will attempt to explain […]
Show full content

QNX is billed as a real time operating system, or RTOS. The same term is used to describe systems such as FreeRTOS, RT-Threads, Zephyr, PX5 and many more, which suggests that these operating systems are interchangeable. Unfortunately, that is a common misconception that leads to much confusion. In this post I will attempt to explain the main differences among the various systems that fall under the RTOS umbrella.

Important At no point will I try to suggest that these differences make one OS better than another. Each of these has its place, with its own target hardware, software and audience. But if you are trying to pick a RTOS, you should be aware of the very different nature of each OS.

Hardware: From micro-controllers to server-class monsters

The choice of hardware can limit immediately the choice of RTOS, based on the level of support (if any) that the RTOS provides for the chosen board. While some people claim that you should choose your hardware based on your choice of software, rather than the other way around, things don’t always work this way, and often a project starts with a pre-selected board.

At the very low end of the spectrum we find micro-controllers with very little RAM and code space. These used to be so restricted that there was never any discussion of an operating system (think 8-bit PIC with less than 1K of memory), but modern offerings are much more capable. These run 32-bit processors and have hundreds of kilobytes to a few megabytes of RAM. Such hardware is not far off from the PCs of the early 90s, and are complex enough to warrant an operating system to manage multiple tasks.

Memory Protection

The important thing to realize with these micro-controllers is that almost none of these has a memory management unit (MMU), or even a memory protection unit (MPU). Consequently, there is no notion of process separation, or even user/kernel separation, and all code executes within the same space. Here is a little nugget from the source code of one of the RTOSs mentioned above, taken from the documentation of one of the scheduler’s functions:

Please do not invoke this function in user application.

The fact that it is possible to invoke an internal kernel function from a user application tells us a lot about this system: it means that any code executing on such a system has to be trusted. A bug in any of the applications can affect the entire system, and malicious code has no problem getting access to every bit in memory.

More capable systems, with 32-bit (and now 64-bit) processors are available with a memory protection unit (MPU). MPUs allow proper separation of user mode from kernel mode, and the isolation of user tasks, but are typically limited and much less flexible than full MMUs. In particular, MPUs tend to segment a single address space instead of supporting multiple address spaces, which makes it hard to support POSIX process semantics.

Full MMU support started on expensive mainframes and gradually made its way to cheaper and cheaper systems. Today every ARM Cortex-A processor (and some Cortex-R as well) carries a full MMU, including the one on the sub C$20 RaspberryPi Zero 2 (that’s US$4000, with tariffs). Nevertheless, not all RTOSs that support these processors make use of the MMU, so make sure you understand what level of isolation is provided (if that is important to you).

Multi-Processors

Like MMUs, multi-processors have made their way from very high-end systems to everyday electronics. Today, practically all smartphones (and smartwatches) sport more than one processing unit (or core) in a system on chip (SoC). Many low-end micro-controllers also have more than one core, but there is usually a significant difference between these and the kind of systems you find on phones, tablets, PCs and server machines.

Symmetric Multi-Processing (SMP) is an abstraction presented by a combination of hardware and software, in which threads can be assigned (mostly) seamlessly to different processing elements, with little to no intervention from the application code. The operating system scheduler chooses the unit on which a thread executes, and can also migrate a thread from one unit to another. The symmetry arises from the assumption that any thread can run on any unit and there is no material difference among these. This is no longer completely true on many modern systems which have different classes of processing units with different characteristics, but it is still assumed that most threads can be migrated even among such different cores.

Many modern micro-controllers also feature multiple processing units, but these do not provide SMP: typically there is no cache coherency across the cores, nor an implementation of multi-processor atomic operations. An operating system can still implement SMP at the software level, but at a much higher cost in time and complexity. As far as I can tell, the hardware vendors intend each core to have its own software stack, with, potentially, some form of communication channel among these, but not symmetric multi-processing.

Does SMP matter? The appeal of creating more threads for certain workloads and having these magically distributed to multiple cores is definitely something people are looking for. But SMP also comes at a cost, both in complexity and, especially, latency, which is critical to RTOSs.

QNX Hardware Requirements

QNX8 is a 64-bit only operating system, and requires a full MMU. These are conscious design choices. A micro-kernel design without a MMU (or at least a MPU) is, in my opinion, pure overhead, that does not result in any real benefit. It is the ability of the operating system to have key features (file system, network stack, graphics) each in its own isolated process, which provides the safety and security guarantees of such a design.

While I was able not to long ago to boot QNX on a system with 8MB of RAM (using QEMU, as there is no suitable hardware), the system is designed for boards with multiple gigabytes of RAM. QNX assumes a complex system with many processes, each with many threads, all requiring heap and stack memory. Modern workloads are very memory intensive (just open a web browser and check its memory footprint).

Finally, QNX8 was designed with SMP in mind, emphasizing scalability (sometimes at the cost of single-core operations). The system can be deployed on boards with up to 64 processing units and up to 16TB of RAM. This is not an operating system designed for micro-controllers.

Software: POSIX and the promise of compatibility

POSIX emerged in the 1980s as an attempt to create a standard interface for operating systems, based on some common grounds across multiple flavours of UNIX. As a standard it is far from perfect: much of the interface is quite clearly an attempt to formalize the behaviour of existing systems at the time, rather than a rational approach to operating system design. It has also fallen behind the times in certain areas.

Nevertheless, POSIX is a popular choice when it comes to picking up an interface for an operation system. Many systems have a native API, but then provide some compatibility layer that implements the POSIX interface. For operating systems with a small user base, the allure of POSIX is that it opens up a world of third-party software that was written to run on other systems. Otherwise, there is little hope of establishing a rich-enough software stack for such systems.

POSIX compliance is a well-defined term. It means that a system has passed a suite of tests from the body governing POSIX (the Open Group), and has been awarded a certificate of compliance by that body. Nevertheless, the certification process is expensive and, at times, misses the point (more on that later), and many systems eschew it in favour of a vague statement regarding “POSIX compatibility”.1

But what kind of compatibility does POSIX provide? To understand that we first need to look at the various flavours of POSIX. POSIX is a common standard, but certification is based on different profiles. At the top level, a system can be certified either to the base specification, or to one of the embedded profiles. Within the embedded space, the Open Group provides 4 different profiles, PSE51, PSE52, PSE53 and PSE54. The base specification profile is quite lax, and allows a system not to support significant parts of the standard, while still being declared as conforming. For example, a system currently listed as compliant does not implement memory mapped files or the posix_spawn() call.

The embedded profiles are stricter. A system that chooses one of these is expected to pass all tests provided by the Open Group for that profile. These profiles are positioned on a non-linear scale. The differences among PSE51, 52 and 53 are relatively small, while 54 represents a major jump to what can be considered a full UNIX-like operating system. There are currently two systems listed as complying with PSE52, and none with any of the other profiles. It should be realized that PSE52 does not require the notion of a process, nor a complete file system. In fact, PSE52 assumes that the system cannot compile the conformance tests by itself.

QNX used to be certified to PSE52, but that was dropped as it is really meaningless for a system that provides a full operating system. Some years ago we attempted to get certification for PSE54 and came tantalizingly close (about 97% of the tests passing). That required quite a bit of work on areas that are really of no importance to a modern embedded operating system, such as the exact behaviour of terminals, originally designed to work with mainframes (true story: I spent an inordinate amount of time getting collate sequences in Czech working). In the end it was never a priority to finish this task, due to diminishing returns. To the best of my knowledge, no operating system is certified to PSE54.

Does POSIX compatibility matter? I believe it does only as far as it allows the much larger corpus of software written for Linux and the various *BSDs to run (perhaps with some minor modifications) on your target operating system. Note that Linux itself is not POSIX compliant, but it is close enough to allow the porting of third-party software written for Linux to FreeBSD, QNX or Haiku. On the other hand, if a system only implements the interface required by PSE52, or, as some RTOSs do, has a wrapper for the pthread API around its native threading, then there is no hope for real portability.

Here’s a quick litmus test. If the system you are using doesn’t return the expected results, then it may not be as portable as advertised (error checks have been omitted for brevity:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <semaphore.h>
#include <sys/mman.h>
#include <sys/wait.h>

static struct
{
    sem_t sem;
    char msg[128];
} *shared, *private;

int
main(int argc, char **argv)
{
    int fd = shm_open("posix_litmus", O_RDWR | O_CREAT | O_TRUNC, 0600);
    ftruncate(fd, sizeof(*shared));

    shared = mmap(0, sizeof(*shared), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    snprintf(shared->msg, sizeof(shared->msg), "Hello from parent");

    private = mmap(0, sizeof(*private), PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
    snprintf(private->msg, sizeof(private->msg), "Hello again from parent");

    sem_init(&shared->sem, 1, 0);

    pid_t pid = fork();

    if (pid == 0) {
        printf("Parent shared message: %s\n", shared->msg);
        printf("Parent private message: %s\n", private->msg);
        snprintf(shared->msg, sizeof(shared->msg), "Hello from child");
        snprintf(private->msg, sizeof(private->msg), "Hello again from child");
        sem_post(&shared->sem);
        return EXIT_SUCCESS;
    }

    sem_wait(&shared->sem);
    printf("Child shared message: %s\n", shared->msg);
    printf("Child private message: %s\n", private->msg);

    waitpid(pid, NULL, 0);

    return EXIT_SUCCESS;
}
Conclusion

Modern RTOSs running on tiny micro-controllers can achieve a lot, and may be more than adequate for your project. Just be aware of what each one provides in terms of compatibility, safety, security, latency and scalability. Each combination of hardware and software has its own strengths and weaknesses. And remember, not every system that advertises itself as POSIX allows you to say “I know this!”.2

  1. Technically, the term POSIX is a trademark that cannot be used unless a system has been certified by the OpenGroup. ↩
  2. This is my very first attempt at OpenGL programming. Still feeling my way around. ↩

elahav
http://membarrier.wordpress.com/?p=417
Extensions
Memory Management: Changes in QNX8
Uncategorizedmemory managerqnx
Along with a new micro-kernel, QNX 8 ships with a new memory manager. This is the fourth incarnation of this component since the introduction of the QNX Neutrino operating system. To understand the changes, let’s examine the previous version of the memory manager. Memory Management in QNX 7 The memory manager shipped with QNX 7 […]
Show full content

Along with a new micro-kernel, QNX 8 ships with a new memory manager. This is the fourth incarnation of this component since the introduction of the QNX Neutrino operating system. To understand the changes, let’s examine the previous version of the memory manager.

Memory Management in QNX 7

The memory manager shipped with QNX 7 is the result of a project started around 2012. The goal was to produce a memory manager that is optimized for on-demand paging. While some form of on-demand paging was supported in QNX 6.x, it was more of an add-on to an existing design, rather than a first-class citizen.

Recall that on-demand paging means that physical memory is not (typically) allocated at mmap() time. Instead, the mmap() function just records what kind of memory needs to back a new range of virtual addresses. When a virtual address in this new range is first referenced (read from or written to), a page fault occurs, which causes the memory manager to allocate a physical page, initialize it with the right contents, and install a new page table entry to reflect the virtual-to-physical translation.

On-demand paging provides a few desirable properties:

  1. much cheaper mmap() calls, since mmap() does not need to allocate memory, nor manipulate page tables;
  2. copy-on-write for fork(), which is especially beneficial if fork() is followed by exec*() and the cloned address space is discarded;
  3. the ability to update virtual-to-physical translations at run time, which enables features such as page stealing and swapping;
  4. lower memory footprint, in case a process does not reference all of the memory it has asked for with calls to mmap().

The memory manager shipped with QNX 7 provides these features, and yet we replaced it in QNX 8. Why?

What’s Wrong with On-Demand Paging?

Before we move on, I would like to emphasize that the criticism of on-demand paging expressed below should be taken in context. Specifically, the context is for a low-latency real-time operation system in safety applications. On-demand paging has proved successful in other environments for decades, and I am not advocating for its replacement there.

Over-Committing Memory

Let’s start from the last point: lower memory usage. This is true only when a process allocates memory via mmap(), but does not use all of it. If a process uses all the memory it has allocated, then on-demand paging does not have an advantage in terms of memory usage.1 These perceived savings only matter if the sum total of allocations exceed the amount of memory available in the system. In such a case, the system over-commits memory: it has promised to processes that they can have more memory than can be had. As long as these processes never (at once) use all memory promised, then all is well, but if they happen to do so then they will start failing (manifested as a SIGBUS signal on reading or writing memory).

Linux (and other systems) have no problem over-committing memory. QNX has made the decision not to do that: if a mmap() call succeeds, then, regardless of on-demand paging, the process can not fail to access memory it was promised. This is important to safety systems, which need to ensure that critical processes cannot fail arbitrarily due to changes to the state of the system. As long as these processes allocate their resources up-front, these resources are guaranteed to be available.

How do you implement on-demand paging without over-committing? At mmap() time, even though physical memory is not allocated, the system does account for it, using a reservation. Reserving memory is a simple matter of subtracting the number of requested pages from the total available. A call to mmap() fails if it cannot reserve memory.2

The reservation scheme solves the SIGBUS problem, but it negates any benefits of on-demand paging with respect to memory usage. The fact that physical memory has not been allocated does not mean it is available to a mmap() call that needs more memory than the system can provide.

Time to Map

On-demand paging speeds up the mmap() call, but at the cost of later page faults that perform the bulk of the work. The overall time taken by mmap() plus handling these faults is actually higher than performing the same work up front, due to:

  1. the cost of page fault handling (which is higher on more sophisticated hardware);
  2. loss of optimizations available when allocating and setting up page tables for larger chunks of memory.

From a performance point of view, on-demand paging benefits the same misbehaving programs that show lower memory usage, i.e., those that map considerably more memory than what they need.

Things get worse when trying to avoid over-committing memory. Even though the mmap() call does not allocate physical memory nor install last-level page table entries, it does need to ensure that all meta-data and any higher-level page table entries are in place for the mapping. Otherwise, a page fault can fail to resolve. This explains why the map-without-access value for QNX 7 with on-demand paging is higher than Linux’s, as the latter does nothing other than carve the virtual address space at mmap() time.

Latency

Page faults introduce variability in execution time, which can be quite high. For time-sensitive, real-time applications, such variability is unacceptable. Consequently, real-time applications tend to ensure all memory is backed at mmap() time (using the POSIX mlock() or mlockall() calls), which disables on-demand paging.

For safety system, QNX recommends that the entire system disables on-demand paging, a feature known as superlocking. This is the equivalent of calling mlockall(MCL_CURRENT | MCL_FUTURE) on each process when it is created. In such a system, all the data and time spent on supporting on-demand paging is pure overhead.

Swapping

One of the major features of the memory manager shipped with QNX 7 is support for swap devices. Swap allows the system to behave as though there is more RAM than what is available, by stashing the contents of non-file-backed pages on the swap device, allowing these pages to be stolen and reused. We have tried to use this feature on the BB10 phones circa 2013 and the results were not good. Page stealing is always costly, as it requires expensive page table manipulation. But when it also involves writing the contents of a page, and then reading back when restoring the page, page stealing performs much worse.

Swap devices typically fall into one of two categories: storage-based, and RAM-based. The latter is feasible when using compression, as the amount of space in memory taken by page contents on the swap device is expected to be much lower than the size of the page. This is only true, however, if the page contents can be compressed efficiently. On modern systems, a considerable part of memory is taken by data that does not compress well (because it is already compressed, like MP3 files or JPEG images), or by data that cannot be moved into swap (graphics surfaces).

On the other hand, the storage devices on embedded systems tend to have a limited number of write cycles, and are subject to excessive wear if used for swap.

Enter QNX8

QNX8 features a brand new micro-kernel design. During the work on the new kernel I struggled with the question of how to incorporate support for on-demand paging. Handling recoverable page faults that occur in the context of user-mode execution is relatively straight-forward (and even easier with the new design than with the old one). However, handling such faults in the context of the kernel (i.e., faulting on user addresses inside kernel calls) is much harder.

Since the primary focus of QNX these days is on safety systems, and since we recommend superlocking on these systems anyway, it occurred to me that we may not need on-demand paging at all. I wanted to see what gains can be had by dropping support for on-demand paging from the memory manager, simplifying its design, and allowing for greater optimization opportunities.

Support for on-demand paging in the memory manager has resulted in a single-page oriented design: all code paths are meant to deal with one page at a time. While there are some opportunities for batching operations, they still end up doing quite a bit of work for each page individually, simply because of the way the code is structured. When applying superlocking, these operations are repeated for each page in every mmap() call: carve a virtual address range, allocate meta-data structures, allocate physical memory, initialize the memory, set up page-table entries. A single-page oriented design ends up doing these individually per page instead of each operation on the range.

At this point, I wish I could have claimed that I invented some sophisticated algorithms for memory management that improve performance dramatically. The mundane reality, though, is that I simply did the following:

  1. Remove any code and data structure required only for the support of on-demand paging, page stealing and swapping.
  2. Consolidate and optimize loops to facilitate the batching of all operations involved in a mmap() call.

The results of changing the memory manager to a range-oriented design were beyond what I had hoped for. The memory manager is much simpler (and thus easier to analyze for safety), and is much faster for any mmap() operation that involves more than one page. The graph below shows the time it takes to map a 1MB region using anonymous memory (i.e., where the source physical backing is anywhere in allocatable RAM and where the memory is zero-initialized). The tests were conducted on a SolidRun Honeycomb board, with 16 ARMv8 A72 cores at 2GHz. Not only is QNX8 faster than QNX7, it is also faster than Linux, if you assume that the memory is actually used, not just mapped.3

The change didn’t just have an effect on micro-benchmarks, though. My Dell desktop (Gen 12 Intel i7) boots to a browser displaying a page in 700ms vs close to 3 seconds with on-demand paging. On a SolidRun Honeycomb board, parallel compilation (using make -j) now takes advantage of up to 12 cores, where before it peaked at 4.

This change to the memory manager does not remove any functionality, other than support for swap devices.4 The system still supports anonymous, physical and file-backed memory objects, shared and private mappings, the handling of fork(), exec*() and posix_spawn(), etc. It just does it with all physical memory backed at mmap() time, and with this backing remaining constant until it is unmapped.

So what are the downsides?

Any program that is written to ask for significantly more memory than it needs suffers from superlocking. While QNX 8 is much faster in the superlocking case than QNX 7, it is slower compared with on-demand paging when a program asks for 1GB of memory and touches 1MB. This is actually quite common in the Linux world (I’ve seen VS Code reporting memory usage of over 1TB, which clearly it doesn’t use). The answer to the problem, especially in the context of safety systems, is “don’t do that”. Your program either needs this much memory, or it doesn’t. And yes, the analysis may be hard, but it needs to be done.

A special case of this problem is stacks. The default thread stack size on QNX is 256KB. Traditionally, stacks are allocated lazily, and C/C++ programmers are not used to analyzing and specifying the required stack sizes (which sometimes are not even known, as in the case of recursion controlled by some program state). Without on-demand paging the stacks are fully backed at thread-creation time, with a one-size-fits-all allocation of 64 pages. With the proliferation of threads, this has an impact both on memory usage and thread creation time. For safety systems we can still make the claim that you should analyze your code to determine the maximum stack size for each thread, and then create the thread with the appropriate stack size. For non-safety systems this is a much harder argument to make, especially in terms of cost/benefit.

Copy-on-write is no longer available for fork(), which means that fork() followed by exec*() does too much: all private mappings are copied at fork() time, and then lost as a new address space is created. This can be solved by replacing fork()+exec() with posix_spawn().

Will on-demand paging make a comeback in QNX8? Maybe, though probably in a limited way (perhaps only for stacks). For now, however, the benefits of the new memory manager appear to outweigh its shortcomings considerably, especially when you look at complete, real-world systems.

  1. There are still ways in which on-demand paging can save on memory, such as with copy-on-write pages that are only read but not written. ↩
  2. Reservation is actually more complicated, as shared mappings only need to be reserved once, file-backed read-only mappings do not require reservation as they can be replaced, swap space needs to be taken into account, etc. ↩
  3. Results are for Linux 5.10. I have also tested the latest kernel at the time the post was written, which is 6.12. It shows somewhat worse results than 5.10, but not enough to change the overall picture. ↩
  4. One interesting side-effect of superlocking is that writes to pages are not detected, which necessitates treating all shared-writable pages as dirty. While not technically incorrect, this limitation restricts the usefulness of such mappings. ↩

elahav
http://membarrier.wordpress.com/?p=367
Extensions
QNX Raspberry Pi Book
Uncategorized
It took a while, but QNX 8 is now free for non-commercial use. Along with access to the software development platform (SDP), users get a Raspberry Pi 4 image, which serves as the basis for learning about the OS, prototyping, research activity and hobbyist projects. If you want to get acquainted with QNX, you can […]
Show full content

It took a while, but QNX 8 is now free for non-commercial use. Along with access to the software development platform (SDP), users get a Raspberry Pi 4 image, which serves as the basis for learning about the OS, prototyping, research activity and hobbyist projects. If you want to get acquainted with QNX, you can try the book I wrote for use with this image. Source code, PDF and HTML versions are available here.

elahav
http://membarrier.wordpress.com/?p=361
Extensions
Memory Management: Shared-Memory Objects
Uncategorizedmemory managerosqnx
In the post “Virtual Memory”, we saw how each process has its own view of memory, and how processes are isolated from each other. We also saw how access controls can be used to allow for the safe sharing of data in various scenarios, such as : In this post we will take a closer […]
Show full content

In the post “Virtual Memory”, we saw how each process has its own view of memory, and how processes are isolated from each other. We also saw how access controls can be used to allow for the safe sharing of data in various scenarios, such as :

  1. code and data sharing for executables and libraries;
  2. a process publishing data that can be used by other processes;
  3. a cheap form of inter-process communication.

In this post we will take a closer look at shared-memory objects as defined by POSIX, as well as various extensions implemented by QNX. Such an object represents a subset of the system’s memory that has been allocated for it. It is created at runtime, populated with memory, and can be accessed by multiple processes. The object exists until all references to it are removed, or until the system is shut down. Shared-memory objects are used in scenarios 2 and 3 above, while scenario 1 makes use of file-backed memory.

Opening an Object

The shm_open() function is used to establish a connection between a process and a shared memory object. The function looks and feels very much like an open() call, but it is important to note that the string argument it accepts is not a path, but just a name that can be separated by slashes. POSIX leaves it open as to how this name is interpreted, except that two calls with the same name that starts with a slash must result in connections to the same underlying object:

fd1 = shm_open("/my/shared/memory", O_RDWR);
fd2 = shm_open("/my/shared/memory", O_RDONLY);

In this case fd1 and fd2 must be file descriptors connected to the same object, even when the call is made in different processes. On the other hand, the following calls make no such guarantee:

fd1 = shm_open("another/shared/memory", O_RDWR);
fd2 = shm_open("another/shared/memory", O_RDONLY);

A new object can be created with the O_CREAT flag. The object has the same user and group ID as the creator and permission bits as specified by the call. These provide the creator with a way to control (albeit coarsely) which processes can open the object and in which mode. For example, the following call creates a new object which can be read and written by processes with the same user ID as the caller, read by processes with the same group ID, and is inaccessible to all other processes:

fd = shm_open("/my/shared/memory", O_RDWR | O_CREAT, 0640);

On QNX, and a few other systems, The special name SHM_ANON can be used to create an object without a name. Such an object can still be shared with other processes in ways that will be described below. This API provides the benefits of avoiding potential collisions in the name space, and of not exposing the object to processes that do not need to know about it.

While POSIX makes it clear that shared-memory object names need not appear in the path space, QNX provides access to all named objects under the /dev/shmem prefix. While this is a useful feature for debugging (such as using cat on the command-line to dump the contents of a shared memory object), it has also led to some confusion, as people treat /dev/shmem as a cheap, memory-resident file system. It is not.

Operations Populate

A shared memory object is created empty: no memory is yet associated with it. The object needs to be sized and populated before anything useful can be done with it. The ftruncate() call sets a new size for the object, and should be called at least once before the object is used.

fd = shm_open("/my/shared/memory", O_RDWR | O_CREAT, 0640);
ftruncate(fd, 64 * 1024UL);

While the ftruncate() call sets a new size on the object, it may not populate the object with memory just yet. The implementation depends on the memory manager. On QNX, for reasons that will be explained in a future post, it does, but on other systems the object may be populated on demand, as various locations within the object are accessed.

The physical memory used to populate the object has no special characteristics. It can come from anywhere in RAM and need not be contiguous. Sometimes, however, we need to create objects with more control over the backing memory. QNX provides the shm_ctl() call as an alternative to using ftruncate() to populate a shared-memory object. By specifying different flags to the function we can create objects that are:

  • physically-contiguous;
  • use a specific range of physical addresses;
  • populated from typed-memory.

Such shared-memory objects are better suited for interaction with hardware. The mechanism allows for such memory to be shared in a controlled manner. For example, a DMA buffer from a typed-memory pool controlled by a network stack, can be exposed to a non-privileged process for writing, without giving that process general access to the pool.

A special flag to shm_ctl() allows for an object to be sealed, which means that its physical layout can no longer be changed (and in particular it cannot be resized). The object remains writable by any process that has a suitable file descriptor. Sealing protects processes that have already mapped the object from faults caused by changes to the physical layout.

Map

The primary operation on a shared memory object is to map it into the process’ address space. Mapping exposes the object to the process as a range of virtual addresses, which can then be accessed via load and store instructions. As mentioned in the post about the mmap() call, the object can be mapped as shared, in which case any writes via store instructions are reflected in the underlying object and are made visible to other mappings of the same object. Likewise, mapping as shared means that any updates to the object via other shared mappings are seen by this mapping. The object can also be mapped as private, in which case the mapping reflects a snapshot of the object’s contents, but is otherwise disconnected from the object itself.

A typical mapping of a shared memory object looks like this:

ptr = mmap(0, 64 * 1024UL, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

It is also possible to map just a portion of the object, by setting the offset and size arguments, as discussed in a previous post.

Destroy

The is no explicit way to destroy a shared-memory object. Instead, the lifetime of such an object is controlled by references to it from three sources:

  1. its name (if it has one);
  2. any open file descriptors;
  3. any shared mappings.

The shm_unlink() function deletes the name of the object, such that it can no longer be opened via a call to shm_open(). However, the function does not destroy the object. Conversely, the object persists if all file descriptors are closed and all shared mappings are removed, but the name is not unlinked. This is true even if the process that created the object exits. While this persistence can be a useful feature, it can also lead to memory leaks that are hard to diagnose, as the object may consume memory without any mappings from any address space in the system.

What about read() and write()?

POSIX does not define the outcome of using read() and write() calls on shared memory objects. It works on QNX, but is not very efficient (as the memory manager needs to create and destroy its own mappings of the object). Stick to mmap().

Sharing Objects

With shared memory objects, one process creates the object, and then other processes gain access to it. To do that, these other processes need to be aware of the object and how to obtain such access. There are several methods that can be used to let processes know about an object.

Unique Name

As we saw before, the shm_open() function takes a name argument that identifies the object. If that name starts with a slash then it is globally-unique, and can be opened by any process with the right access permissions. The easiest way for processes to find a particular shared-memory object is therefore to decide, a-priori, on a name for that object, and code that into all the programs that make use of that object.

The problem with such an approach is that the name may clash with that used for another object in the system. Such a clash can happen by chance. It can also be the result of a malicious process creating one in order to block the operation of other processes, or to intercept the data.

There are a few ways to avoid this problem:

  1. Generate a unique name at creation time. If a clash with an existing name is detected, a different name is chosen (similar to how functions like mkstemp() work). This option requires the name to be communicated to other processes via mechanisms such as IPC (messages, pipes, sockets), environment variables, files, etc.
  2. Rely on a system-specific mechanism that provides process-specific “paths” within the shared-memory namespace.

Regardless of the solution, the reliance on global names is both error-prone and potentially insecure.

File Descriptor Inheritance

When a new process is created via fork() or posix_spawn(), it inherits the file descriptors of its parent (unless tagged to avoid such inheritance, or to be closed as part of the operation). Consequently, a parent process can grant access to a shared memory object to any of its children, simply by allowing them to inherit its file descriptor.

Send a File Descriptor

It is possible for a process to pass an open file descriptor to another process over a Unix Domain Socket (UDS). This method avoids the need to communicate a name and then have the other process open that name. Consequently, the object can be created without a name and shared explicitly only with those processes the creator vets. Since file descriptors are unique to each process, the transfer creates a clone of the sender’s file descriptor in the receiving process.

Unfortunately, this method requires that the processes interact using UDS, and is quite cumbersome to use. Implementation-wise, the kernel needs to parse the messages sent over UDS to know that a file descriptor is being passed and take the necessary action to clone it for the destination process.

File-descriptor sending over UDS is further complicated in a micro-kernel-based operating system, such as QNX. The mechanism requires a third process, namely the one that implements UDS, to be involved in the cloning of a file descriptor. Since such cloning is a potential security nightmare, the system needs to ensure that the UDS process cannot just clone any file descriptor in any process.

Shared-Memory Handles

On QNX, shared-memory handles provide a simple, secure way to give other processes access to shared-memory objects. A process, which already has a file descriptor to the object, registers a handle to the object with the system. The handle is specific to the source process (the one that registers it) and to destination process (the process that will be given the handle). The handle can be communicated to the target process via any mechanism, as it is simply a 64-bit value. When the target process receives this handle, it converts it to a file descriptor.

In the following example, a source process (PID 1234) creates a 1MB shared-memory object, obtains a read-only handle for target process (PID 5678), and then sends that handle to that process over a pipe connection.

fd = shm_open(SHM_ANON, O_RDWR, O_CREAT, 0600);
ftruncate(fd, 1024 * 1024UL);
shm_create_handle(fd, 5678, O_RDONLY, &handle, 0);
write(pipe[1], &handle, sizeof(handle));

The target process receives the handle and converts it into a local file descriptor:

read(pipe[0], &handle, sizeof(handle));
fd = shm_open_handle_pid(handle, O_RDONLY, 1234);
ptr = mmap(0, 1024 * 1024UL, PROT_READ, MAP_SHARED, fd, 0);

Note that shm_open_handle_pid() will fail if any of the following is true:

  1. the handle is not registered;
  2. the handle was already consumed once;
  3. the target process attempts to open the handle with permissions greater than those assigned by the source process;
  4. the process that registered the handle doesn’t match the source ID.

The file descriptor obtained by the target is no different than a descriptor obtained with a call to shm_open(). As such, the target process can map the descriptor at any time, and as many times as it wants. It can also register its own handle for a third process. The original source process can prevent that by registering the handle such that it cannot be converted to a file descriptor:

shm_create_handle(fd, 5678, O_RDONLY, &handle, SHM_CREATE_HANDLE_OPT_NOFD);

The target process can only map this handle directly, and only once:

ptr = mmap_handle(0, 1024 * 1024UL, PROT_READ, MAP_SHARED, handle, 0);

Finally, access to shared-memory objects can be revoked by the process that created the object, causing all mappings to the object, either for a particular process or all of them, to be invalidated.

elahav
http://membarrier.wordpress.com/?p=350
Extensions
Memory Management: The mmap() Call
Uncategorizedmemory managerosqnx
(If you have not done so already, you may wish to read the previous two posts about memory management, as they contain information that is relevant to following discussion: the basics and virtual memory.) This post may read a bit like a manual page (especially the description of the arguments to the function), but since […]
Show full content

(If you have not done so already, you may wish to read the previous two posts about memory management, as they contain information that is relevant to following discussion: the basics and virtual memory.)

This post may read a bit like a manual page (especially the description of the arguments to the function), but since mmap() plays such a vital part in memory management, I thought it would be good to describe it in detail.

The Role of mmap()

As this post will show, mmap() is the Swiss Army Knife of memory management: the function does many different things, depending on the myriad of combinations of options passed to it. We can, nevertheless, express the role of mmap() in simple terms: to carve a range of addresses out of the process’ virtual address space for a specific purpose. What is that specific purpose depends on the arguments passed to the function, which will be discussed in detail below.

Almost all interaction between a user process and the system’s memory manager happens via calls to the mmap() function, along with its counterpart munmap(). This may come as a surprise to some people, who expect calls such as malloc() (or new in C++) to provide the primary interface to the memory manager. That is, however, a common misconception: malloc() does not allocate memory from the system. What malloc() does is allocate memory from the process’ heap, a pool of memory that has already been allocated from the system and now belongs to the process. It is only when the heap does not have sufficient memory to satisfy a malloc() request that it needs to grow by allocating system memory, which is done via a call to mmap().1

The Prototype

Let’s take a look at the arguments to mmap() and its return value:

void *mmap(void *addr, size_t len, int prot, int flags, int fd, off_t off)
Requested Address

The addr argument can be used to request a virtual address for the range carved by the call. In most cases, this will be set to NULL, which means that the memory manager can pick any free range of sufficient size within the address space. If the value is not NULL, then it is taken as a hint by the memory manager, but need not be respected.

A special case occurs when the MAP_FIXED flag is set in the flags argument. In this case addr is no longer a hint. If it is a valid value (i.e., page-aligned, within the address space’s limits) then it will be used for the range returned by mmap(). Note that if this range overlaps any other address range that is in use (i.e., carved by an earlier call to mmap()) then the overlapping parts are first unmapped.

Length

The len argument specifies the size of the carved address range, in bytes. Recall that virtual to physical translations always occur at the granularity of a single page. While the mmap() interface allows for the length to be any value, the argument will be rounded up to the next multiple of a page size. Keep the following in mind when using mmap() to access physical memory:

  1. If allocating memory, the actual amount allocated will be bigger than requested if the length is not a multiple of a page size. A request to allocate 4 bytes via mmap() will allocate 4KB (or more, depending on the system).
  2. If mapping existing physical memory, the window exposed to the process may be bigger than requested. This can have implications on safety and security. For example, if the hardware places GPIO control registers and I2C control registers within the same 4KB range of physical addresses, it is not possible to expose just the GPIO registers to a process, without also giving it control over I2C. The same applies to shared memory, DMA buffers, etc. Isolation is only guaranteed at the granularity of a page.
Protection Flags

As mentioned in virtual memory, a translation entry in a page table can have different bits that control access to the underlying physical memory. These access control bits are populated according to the protection flags passed to the mmap() call that spans the virtual address range for these translation entries (recall that each translation entry represents a page-sized sub-range of the virtual address space). It is no surprise, therefore, that these flags correspond to the different access control schemes, with PROT_READ, PROT_WRITE and PROT_EXEC controlling read, write and executable access, respectively.

These flags can be combined in various ways using bitwise-OR operators, though most hardware architectures have PROT_WRITE imply PROT_READ as well. Also, many modern systems deny a combination of all three at once, to avoid the security hazard of modifiable code. Code is modified first by mapping the memory as PROT_READ|PROT_WRITE and then by changing the protection to PROT_READ|PROT_EXEC,using the mprotect() call. Further restrictions on protection are imposed by the use of file descriptors, as will be discussed below.

The special PROT_NONE symbol can be used to signify “no protection flags”. This option is useful when creating a range of virtual addresses that is initially inaccessible, and does not need to be backed by any memory. A common use case of this option is to reserve a large virtually-contiguous range of addresses to be populated later, using the MAP_FIXED option.

Flags

The various MAP_* flags control the behaviour of mmap() in many different ways. POSIX defines the semantics for some of these. Many *NIX systems have common extensions, while some (including QNX) provide flags that are only meaningful on that system.

Any call to mmap() must specify exactly one of the MAP_PRIVATE and MAP_SHARED flags, and the distinction of private vs shared is one of the most important aspects to understand when dealing with memory management. The MAP_SHARED flag is used to expose either an existing, or a new memory object to an address space (where object can be a file, shared memory or memory-mapped registers). The matching translation entries connect the virtual addresses directly to the physical memory that is defined by the object. Any updates to memory via store instructions are reflected in all shared mappings of the same object in all other address spaces. The update is also reflected in the object itself: for example, a store to a mapped file will write to the file, while a store to mapped hardware registers will write to those registers.2

On the other hand, when using the MAP_PRIVATE flag, the underlying physical memory belongs to the process. Any updates made via store instructions remain isolated within this process, and are not observed by other processes. When mapping a memory object as private, the process obtains a copy of the contents of that object, backed by newly-allocated physical memory. It is therefore almost always wrong to map memory used for interacting with hardware as private.

The MAP_ANON flag (or its alias, MAP_ANONYMOUS) is one of the most commonly-used, and has been provided by many systems for decades; yet is has only been added to POSIX with Issue 8 in 2024. This flag is used to allocate memory from the system that is not backed by any existing object, and in which the value of all bytes is initially 0. When this flag is combined with MAP_PRIVATE it provides the primary interface for allocating memory from the system to a process’ heap. When combined with MAP_SHARED it can be used as a quick way to share memory between a parent process and a child process created with a call to fork(), but without exposing it to any other process.

MAP_FIXED is a POSIX flag whose semantics were described above.

On a QNX system, MAP_PHYS is a common flag used by processes interacting with hardware. When combined with MAP_ANON this flag requests contiguous physical memory for backing the virtual range. Without MAP_ANON, the offset field is interpreted as a physical address to be mapped by a process with the necessary privileges (i.e., that has been granted access to the corresponding physical range via system abilities). I hope to get rid of this flag in a future release, as there are much better ways to handle either use case on a modern QNX system.

File Descriptor

A file descriptor is used to identify the object being mapped. Common cases include:

  1. a file, as obtained with a call to open();
  2. a shared-memory object, as obtained with a call to shm_open();
  3. a typed-memory object, as obtained with a call to posix_typed_mem_open();
  4. -1 for anonymous memory.

When using a file descriptor and the MAP_SHARED flag (see below), the requested protection flags must be a subset of the protection flags used to open the file. For example, a file opened with the O_RDONLY flag cannot be mapped as shared with PROT_WRITE, as that would provide a way for the process to bypass file access control. On the other hand, no such restriction is required when mapping the file as private, as writing to the memory does not affect the file.

Offset

When using a file descriptor, the offset specifies the location within the corresponding object to map. For example, when mapping a file and providing an offset of 0x2000, reading the first byte from the returned virtual address reflects the contents of the file 8192 bytes from its beginning. Note that while it is possible to specify an offset that is not page-aligned, the call will expose the contents of the file staring at the rounded-down page-aligned offset. If a process maps at offset 0x2100, it will see the file contents stating at offset 0x2000 (assuming a 4KB page size).

Return Value

A successful call to mmap() returns the first virtual address in the range carved as a result of this call. If the returned address is V, then the resulting range is [VV+L)3, where L is the requested length, rounded up to a page size. The returned address is almost always page-aligned. The exception is when the requested offset is itself not page-aligned, in which case the result will be congruent to the offset modulo the page size. As mentioned before, however, the mapped range is actually page-aligned. For example, a request to map at offset 0x2100 can return the virtual address 0x5320100, but in fact the mapped range starts at 0x5320000.

When using MAP_FIXED the function either returns the requested address, or fails.

On failure, the returned value is the constant MAP_FAILED. This is usually not NULL, but rather some illegal virtual address (e.g., (void *)-1UL on a QNX system). I touched on the reason for that in a previous blog post: the virtual address 0 is in fact valid, and while it will never be returned by a regular call to mmap(), it can be used along with MAP_FIXED.

Semantics

When discussing the semantics of a specific combination of options to the mmap() call, we need to look at four different aspects:

  1. the virtual address range returned by the call;
  2. the physical memory that backs the virtual range (if any);
  3. The contents of the physical memory on the first load instruction from an address in the range;
  4. Whether the memory is mapped private or shared, which determines the effect of store instructions on addresses in the range.

Instead of going through an exhaustive list of argument combinations, we will look at the semantics of several examples for some of the more common scenarios. All of the examples assume a page size of 4KB. For brevity, we assume that the macro KB multiplies a value by 1024, that the macro MB multiples by 1024 * 1024, and so on.

Heap Memory
ptr = mmap(NULL, MB(10), PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);

This call is used to allocate 10MB of private-anonymous memory, as used by the process’ heap.

  1. The virtual address returned is anywhere in the address space that has an unused 10MB range.
  2. The physical memory used to back the range can come from anywhere in the range of RAM available to the system’s physical allocator. The physical pages backing the virtual range need not be contiguous.
  3. The entire range is zero-initialized.
  4. The memory is private to the process. Unrelated processes will not be able to see this memory at all. Child processes created with fork() will see a copy of the contents as last written by the parent, but any future write (either by child or parent) will not be observed by the other process.
Shared Memory
fd = shm_open("myshmem", O_RDWR | OCREAT | O_TRUNC, 0600);
ftruncate(fd, KB(64));
ptr1 = mmap(NULL, KB(64), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
strcpy(ptr1, "hello");
ptr2 = mmap(NULL, KB(4), PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
strcpy(ptr2, "world");

This code creates a shared memory file and then expands it to 64KB. The first call to mmap() has the following semantics:

  1. The virtual address returned is anywhere in the address space that has an unused 64KB range.
  2. If the shared-memory file has already been assigned physical memory, then this physical memory is used to back the range. Otherwise, new physical memory can be allocated from anywhere in RAM, associated with the shared-memory file, and then used to back the virtual range.
  3. The entire range is zero-initialized, as this mapping follows the first time the object was populated (assuming no race conditions with other threads/processes that manage to map the object in between).
  4. The memory is mapped as shared. This means that the following strcpy() call updates the shared-memory object. Any other shared mappings of this object, and any future private mappings, will see “hello” at offset 0.

The second call to mmap() is a private mapping of the first 4KB of the object.

  1. The virtual address returned is anywhere in the address space that has an unused 4KB range.
  2. The physical memory consists of a newly allocated page, which is not assigned to the object, and can come from anywhere in RAM.
  3. The initial contents of the memory match what was written to the object so far. In this example, the first 5 bytes will contain the string “hello”.
  4. Since the memory is mapped as private, the second strcpy() call will overwrite the copy of the contents with “world”, but will not be reflected in the object.
File Mapping
fd = open("/system/lib/lib foo.so.1", O_RDONLY);
ptr1 = mmap(0x1000000, KB(40), PROT_READ | PROT_EXEC, MAP_SHARED, fd, KB(16));
ptr2 = mmap(0x80000, KB(12), PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, KB(84));

This example shows the typical mappings of a dynamic linker when it loads a shared library. The file is opened as read-only, and therefore cannot be mapped as shared for writing. For the first mapping:

  1. The virtual address returned may be 0x1000000, if that range [0x1000000 – 0x1000000+40KB) is currently unused, if the system does not force ASLR, and if the memory manager feels like satisfying the request. Otherwise, it can come from anywhere in the address range with a 40KB hole.
  2. The physical pages are those associated with a file object handled by the memory manager. Such an object may already exist, if the file is currently mapped by another process, or if the memory manager caches this file. The physical pages themselves can be from anywhere in RAM.
  3. The contents of the mapping match those of the file, starting at offset 16KB. This is probably the location of a code section.
  4. The file is mapped as shared, which means that there is no need to allocate private memory for it. On systems that provide on-demand paging this may not be important, as the memory is not writable. Nevertheless, on systems that privatize memory upfront (such as QNX 8), using a shared mapping here avoids a redundant memory allocation.

For the second mapping:

  1. The virtual address returned may be 0x80000, subject to the same conditions as the first mapping.
  2. The physical pages are newly allocated (either up-front or on demand, when the memory is written) and can come from anywhere in RAM.
  3. The initial contents of memory are those of the file, starting at offset 84KB. This is likely a data segment, containing values initialized to non-zero values (as with the C statement static int foo = 12;).
  4. The memory is mapped as private, to allow the process to update such data, without the updates being seen by other processes using the same shared library.
Reserving Virtual Address Ranges
ptr1 = mmap(NULL, GB(4), PROT_NONE, MAP_SHARED, -1, 0);
ptr2 = mmap(ptr1 + MB(8), MB(30), PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);

For the first mmap() call

  1. The virtual address range can come from anywhere in the address space, where a 4GB hole is found.
  2. No physical memory is used to back the address range.
  3. There is no notion of initial contents. Any access to the range is going to fault.
  4. The range is mapped as shared, but that’s irrelevant in this case.

For the second mmap() call:

  1. The virtual address is 8MB from the value returned for the first call.
  2. Memory is backed by newly-allocated physical pages.
  3. The memory is zero-initialized.
  4. Memory is mapped as private.

What is the purpose of such code? The need sometimes arises to have a contiguous range of virtual addresses dedicated to some function. The first call ensures that this address range cannot be accidentally used to satisfy requests that do not specify MAP_FIXED. In this particular example, we created a heap where 64-bit addresses can be packed into 32-bit values, as the basic offset is known to be the address returned by the first call.

Accessing Hardware Registers
ptr = mmap(NULL, KB(4), PROT_READ | PROT_WRITE | PROT_NOCACHE, MAP_PHYS | MAP_SHARED, -1, 0xfe200000);

This call is the QNX version for mapping a specific physical address, in this case the GPIOs on a Raspberry Pi 4. A common alternative on other systems is to open a special device, /dev/mem, and map the resulting file descriptor at an offset equal to the address.

  1. The virtual address range can come from anywhere in the address space, where a 4KB hole is found.
  2. The physical memory is the 4KB range that starts at address 0xfe200000.
  3. The value of the memory is whatever is presented by the hardware.
  4. The range is mapped as shared, non-cached, which means that writes to memory go directly to the hardware registers.4

It goes without saying, but will be said nevertheless, that the caller must have privileges to map this physical address, and that the system had better lock down such access only to processes that need it, and then only to the required physical range. A process that can map arbitrary physical memory has unlimited access to the system, which pretty much means that the system offers no safety or security guarantees.

A better alternative to this call is to have the system define a typed-memory object for the hardware device:

fd = posix_typed_mem_open("/gpio", O_RDWR, POSIX_TYPED_MEM_ALLOCATE);
ptr = mmap(NULL, KB(4), PROT_READ | PROT_WRITE | PROT_NOCACHE, MAP_SHARED, fd, 0);

This scheme avoid the need for the driver to know where the GPIOs are located in memory. Moreover, the mmap() call “allocates” the physical range, which means that it cannot be used by other processes. QNX also provides ways to “package” the result into a shared memory object, and provide controlled access to other processes. We will discuss this option in detail in the next post.

  1. Traditional UNIX systems used the sbrk() call for obtaining system memory, but that call is inadequate, as it assumes that the heap is contiguous in the virtual address space, which it need not be. ↩
  2. As long as the registers are mapped properly, which typically means non-cached. Memory-ordering restrictions may also apply. ↩
  3. For those not familar with the notation [ab) means a range that starts at a (i.e., includes a) and ends just before b (i.e., excludes b). ↩
  4. I dislike the PROT_NOCACHE “protection” flag, which really should have been MAP_NOCACHE. I may change that in a future release. ↩
elahav
http://membarrier.wordpress.com/?p=329
Extensions
Memory Management: Virtual Memory
Uncategorizedmemory manageros
In the previous post we saw how the memory management unit (MMU) uses page tables to translate virtual addresses into physical ones. We will now consider the various features that such a translation enables in an operating system. In the discussion below, it is important to remember that the granularity of translation is a single […]
Show full content

In the previous post we saw how the memory management unit (MMU) uses page tables to translate virtual addresses into physical ones. We will now consider the various features that such a translation enables in an operating system. In the discussion below, it is important to remember that the granularity of translation is a single page (e.g., 4KB) and not a single byte or a single word. This fact has some important consequences that will be discussed in a future post.

Virtual Address Spaces

We can create different views of memory using different sets of page tables: one set translates the virtual address 0x1000 to the physical address 0x2000, and another set translates the virtual address 0x1000 to the physical address 0x7000. In many systems, such a view has a one-to-one relationship with a process, and is known as the process’ virtual address space (or sometimes just address space). In almost all cases, if a virtual address has a translation in two different address spaces, that translation results in two different physical addresses. (A notable exception is a shared mapping following a fork() call, in which the translation is guaranteed to be the same in the parent and the child.)
The use of a per-process virtual address space provides several advantages to a system that supports this feature over a system that doesn’t:

Isolation

As mentioned in “Memory Management: The Basics”, once the MMU has been turned on, all load and store instructions are restricted to using virtual addresses. This means that a process can only access physical memory that is the destination of some translation in its page tables. Moreover, the page tables are handled by the operating system’s memory manager, and are not directly exposed to the process. Consequently, a process has no access to physical memory to which it has not been granted access by the operating system.1 The latter can ensure that no other address space has a translation to physical memory that holds the process’ private data. Note that the term “private” here does not necessarily mean “secret”, but rather all data that is not explicitly shared by the program. By default this includes all program data that is on stacks, the heap or the process’ data segment.
Process isolation is one of the most important features of a system. Without it, any bug and any exploit in any code can affect the entire system. Isolation dramatically enhances the system’s safety and security.

Linear View of Memory

The view of memory provided by a virtual address space to a process is linear: the memory starts (typically) at virtual address 0, and extends, usually without gaps, to the limits allowed by hardware and the operating system. On a 64-bit QNX system, for example, every process sees an address space that is 512GB in size. By contrast, physical memory is never linear from the point of view of a single process:

  1. Physical memory addresses need not start at 0 and can be broken up into multiple disjoint ranges.
  2. Physical memory is a shared resource, which means that memory allocated to one process creates a gap that cannot be used by another.

The linear view of memory provided by a virtual address space makes programming for a system that supports this feature considerably easier than programming for a system that doesn’t. The programmer need not be concerned with the topology of the underlying physical memory, nor with the existence of other processes in the system. Consider a request to allocate 10MB of memory for some large array. In a system that does not implement virtual address spaces, once memory has been fragmented by other processes such that no contiguous 10MB range exists, the request cannot be satisfied, even if the system still has gigabytes of free memory. By contrast, as long as there is a big enough hole in the address space, the memory manager can satisfy the request using several disjoint physical pages. For the most part,2 the program cannot tell the difference.
Linking is also made easier by the linear view, which allows code and data for the program to reside anywhere within the limits provided by a virtual address space. This flexibility opens up the option to load the data and the code into different addresses on each invocation of the program, and in particular into random locations, a feature known as address-space layout randomization (ASLR).

Access Control and Sharing

So far we have only considered page tables as a way to translate a virtual address into a physical one. But an entry in a page table provides more than just this translation. Each entry can also be tagged with various attribute bits. The most important of these are the bits that determine the type of access the entry provides:

  • No access: the same as not having a translation at all. Any access to the virtual address results in a fault.
  • Read only: a load instruction on the relevant virtual address space succeeds, but a store results in a fault.
  • Read-write: both a load instruction and a store instruction on the relevant address space succeed.3
  • Executable: the processor can fetch instructions from the corresponding virtual addresses. Typically combined with read-only to prevent an attacker from modifying the process’ code.

The ability to expose memory with restricted access to different address spaces allows for safe sharing of memory. When sharing memory, two (or more) address spaces have translations for the same physical memory (though the virtual address can be different for each address space). Memory access attributes in the page tables enforce restrictions on sharing that prevent one process from compromising another. Common patterns for sharing include:

  • Shared code and data (especially in libraries): each process that requires access to such memory maps it as read-only/executable (code) or read-only (data). The operating system prevents a process from getting write access to the memory. Without access control, one process could modify the code used by another process and shared libraries would not be a viable option.
  • Producer-consumer: one process (the producer) maps memory as read-write, while another (the consumer) maps the same physical memory as read-only.
  • IPC via shared memory: two processes that trust each other can map the same physical memory as read-write and use that as an efficient way to interact.
On-Demand Paging

The association of a virtual address with a physical one via a page table translation is referred to as backing: when a translation is added to the page table, virtual address V is backed by physical address P. An attempt to access a virtual address that is not backed by any physical address results in an exception, known as a page fault, which is handled by the operating system kernel. Typically such a page fault results in a signal delivered to the process, which, by default, terminates it.
When an address space is created, no virtual address is backed. The new process starts populating the address space, initially with the code and data provided by the program, then with any shared libraries it is linked against, and finally with explicit memory calls from the program itself. Every operation that creates a range of used virtual addresses in the address space can also back these immediately with the relevant memory (e.g., the physical memory that holds the program’s code, or zero-initialized memory to populate the process’ heap). An alternative approach is to defer the backing of virtual addresses. In this case, the memory manager simply records the fact that a given range of virtual addresses needs to be backed by physical memory with the necessary properties. The first access by a process to such a virtual address results in a page fault, but since the memory manager knows that the virtual address is in use it can resolve the exception by backing the virtual address with the right physical page. Once backing has occurred, the faulting thread within the process can be resumed, repeating the memory access that caused the fault.
On-demand paging is not restricted to the first access to a new virtual address. The memory manager can sever the association of a virtual address to a physical one at any point, by invalidating the translation in the page table.4 This severance can be done at any point, as long as the association can be restored in the future in a way that is transparent to the process. A virtual address that is backed by memory that serves the contents of a file (see below) can be restored easily by allocating a new physical page and copying the contents of the file into it. A page that contains private data written by the process requires the memory manager to store these contents somewhere if they are to be restored at a later time. (That somewhere is known as a swap device.) If the process never again accesses any virtual address that was backed by the old physical page then there is nothing else that needs to be done. If it does access the virtual address, then a new page fault occurs, which is again dealt with by the memory manager.
Breaking the association between a virtual address and a physical page allows the memory manager to “steal” physical pages and use these to serve other virtual addresses, potentially in other address spaces. This feature allows the system to behave as though it has more physical memory than what is actually available, as long as simultaneous demand for memory by all processes is below the amount of physical memory installed.
There are cases where the association between a virtual address and a physical one must not be broken, typically when the memory is used to interact with hardware. In such cases the association needs to be fixed, which is known as locking the memory.
While on-demand paging offers great flexibility to the system, it also comes at a cost. If the system does not guarantee the availability of memory at the time a virtual address range is initialized, then a future access can find that there is no physical memory to back the address, causing a terminal fault that usually terminates the process. On the other hand, guaranteeing memory availability significantly complicates the memory manager, and also requires considerable work at the time the virtual address range is initialized, negating some of the performance benefits of deferred backing.
A second deficiency of on-demand paging is that it creates variability in process execution time. A simple load instruction can cost orders of magnitudes more when it faults than when it doesn’t. Such variability can be disastrous for a real-time operating system.5
Finally, the cost of exception handling has grown significantly in recent years. As processors become faster by optimizing sequential execution, they become slower (relatively, but sometimes even absolutely) at handling events that interfere with these long sequences. Removing support for on-demand paging in QNX 8 has resulted, somewhat counter-intuitively, in significant performance improvements for various memory operations. This will be discussed in more detail in a future post.

Files as Memory

Our definition of memory includes any device that can be accessed via load and store instructions using a unique physical address. Specifically, this definition excludes access to files in a file system. For a file system that resides on a separate storage device, a device driver is required to read data from the files and write the data back; but even a memory-resident file system cannot (typically) provide access to files via simple load and store instructions, as the layout of data in the file need not correspond to the layout of the physical memory: file data can be interleaved with metadata, and be located in disjoint and out-of-order memory addresses.
Virtual addressing provides a way to expose files as memory. The mmap() call (on which we will spend considerable time in a later post) can be used to create an association between a range of virtual addresses and some open file. Because the file is not part of physical memory, this association cannot manifest itself as translations in the page table. Instead, the system allocates physical memory, reads file data into that memory, and then provides the translation from the virtual addresses to the respective physical pages. A load instruction from such a virtual address thus results in the matching file data being read from memory into a register.
The behaviour of a store instruction depends on the type of mapping. A private mapping of a file creates a process-local copy of the file’s contents. Any updates to memory via these virtual addresses affects only that process’ view of the mapped file. A shared mapping, on the other hand, causes store instructions to reflect memory writes in the underlying file, eventually updating the contents of the file in the storage device.
Memory-mapped files pose significant challenges to micro-kernel-based operating systems, such as QNX, where the memory manager and the file system(s) are separate processes. A monolithic kernel has the advantage of hosting both, simplifying the design and providing better opportunities for optimizations, in particular via a unified memory/file cache.

The following diagram illustrates some of the concepts described in this post:

  1. Assuming no bugs in the operating system or hardware vulnerabilities. ↩
  2. There are cases, especially when dealing with hardware, where the program expects contiguous virtual addresses to be backed by contiguous physical memory. Such a requirement needs a specialized interface to the memory manager. ↩
  3. While write-only memory is semantically valid, I am unaware of hardware that allows such access. ↩
  4. And also invalidating the TLB, which will be discussed in a future post. ↩
  5. Yes, caches also introduce variability, but to a much lesser extent. ↩
elahav
http://membarrier.wordpress.com/?p=311
Extensions
Memory Management: The Basics
Uncategorizedmemory managerosqnx
I have recently re-designed the memory manager for QNX 8, getting rid of certain features and improving performance. The description of these changes, the trade-offs and the challenges encountered, requires some knowledge of memory management in an operating system. In this series of blog posts I hope to provide the necessary background that would allow […]
Show full content

I have recently re-designed the memory manager for QNX 8, getting rid of certain features and improving performance. The description of these changes, the trade-offs and the challenges encountered, requires some knowledge of memory management in an operating system. In this series of blog posts I hope to provide the necessary background that would allow anyone interested in the subject to understand these changes and the justification for the new design.

The central processing unit (CPU) and the memory are the only two components that are absolutely necessary to every computer. No program, no matter how trivial, can execute on the CPU before its code and data have been loaded into memory. In a computer system capable of running multiple instances of various programs (processes), both the CPU and the memory become shared resources, requiring some form of management to ensure the fair allocation of each resource. An operating system manages both resources, with the scheduler responsible for allocating CPU time to the threads in each process, and the memory manager allocating memory.

What is Memory?

“What is memory?” is one of those questions that can be answered in many different ways. In the context of this post, memory is any data facility that can be read from and/or written to by the CPU, using load and store instructions, respectively. Each of these instructions takes an address as an argument, which identifies the location in memory accessed by the instruction. The smallest granularity of such an address is typically one byte, though it can be restricted to larger sizes in some cases, with 2, 4, 8 and 16 bytes being common values.

By this definition, a computer system can have multiple sources of memory, including RAM (random-access memory) ROM (read-only memory) and NVRAM (non-volatile RAM). Memory-mapped registers are also part of memory, allowing access to devices (e.g., disk drives, network cards, display adapters) through load and store instructions, rather than via dedicated I/O instructions.

On the other hand, storage devices that cannot be accessed by load/store are not part of memory. Data stored in a file on such a device (SSD, NVMe, MMC, SD card, etc.) needs to be transferred into memory before it becomes available to the CPU. We will see later how virtual memory can integrate such devices into the memory system.

Processor caches, while forming a crucial part of the way a CPU interacts with hardware, also fall outside this definition of memory. Caches cannot be addressed separately, instead servicing load and store instructions to subsets of the addresses available via the system’s memory. In most cases caches are transparent to programs executing on the system.

The following pseudo-assembly code shows a typical interaction of a CPU with memory, reading a value into a register, manipulating it, and then writing the result back to memory:

load reg1, 0x1000
add reg1, 1
store reg1, 0x1000

The result of this sequence is that the value stored in address 0x1000 is incremented by 1, and any future1 load from this address observes the new value. Note that there are further memory accesses in this sequence: each instruction needs to be loaded from memory before it can be executed by the CPU.

Memory Translation

Readers interested in micro-controllers that provide no hardware facilities for memory management, or only a memory protection unit (MPU), can stop reading at this point. The remainder of this post, along with the rest of the series, assumes that the computer system provides a memory management unit (MMU), which is essential for implementing the features discussed below.

The primary task of a memory management unit (MMU) is to translate memory addresses, such that the address specified in a load or a store instruction can be different from the one accessed by the hardware. The address used by the instruction is then called the virtual address, while the one seen by the hardware is the physical address.2 In the example above, the virtual address 0x1000 can be translated to the physical address 0x2000, which means that the memory location manipulated by the sequence of instructions resides in address 0x2000 on the hardware memory module accessed by these instructions.

In order to perform address translation, the MMU uses page tables, a set of mappings from virtual to physical addresses. These page tables are typically hierarchical: the virtual address is broken into multiple fields, each allowing the next level of page tables to be found from the previous one, until the last level that provides the physical address of a page. A page is a contiguous range of physical addresses, commonly 4KB or 64KB in size (though larger sizes are also possible). Virtual addresses that have the same upper bits up to the ones defined by the page size (e.g., lower 12 bits for 4KB or lower 16 bits for 64KB) thus map to the same physical page.

The page tables themselves reside in memory, and it is the task of the operating system’s memory manager to create, change, and destroy these as needed. In a manner analogous to how the CPU only executes instructions provided to it by programs, the MMU only performs translations according to the page tables provided to it by software.

It is crucial to understand that once the MMU has been turned on, all memory accesses in the system use virtual addresses. No software component in the system, including privileged code such as the operating system kernel, can specify physical addresses to load and store instructions. This fact means that any memory location to which software requires access must first have a translation installed in the relevant page tables.3

Address translation paves the way to many different features provided by the memory manager. These will be discussed in the next post.

  1. Entire books have been written on the exact meaning of “future” in this case, but the topic is outside the scope of this post. ↩
  2. Many modern systems provide more than one level of translation, which is especially important to hypervisors. In this case the virtual address can be translated into one or more intermediate addresses, before the final translation to a physical address. ↩
  3. Including the page tables themselves, which makes access to these particularly challenging for the memory manager. ↩
elahav
http://membarrier.wordpress.com/?p=294
Extensions
Qt Tutorial
Uncategorized
I have been a fan of the Qt cross-platform toolkit ever since I first used it in 1999. While I have never developed Qt applications professionally, it has been my go-to UI framework for personal projects over the years. The fact that Qt is fully supported on QNX is a nice bonus, and I have […]
Show full content

I have been a fan of the Qt cross-platform toolkit ever since I first used it in 1999. While I have never developed Qt applications professionally, it has been my go-to UI framework for personal projects over the years. The fact that Qt is fully supported on QNX is a nice bonus, and I have used it extensively in my desktop, and, most recently, in the treadmill project.

Now that my son is learning C++ in school, I thought it would be a good opportunity to introduce him to Qt. I started C++ programming with UI frameworks (Borland OWL and MFC). A graphical user interface toolkit lends itself quite nicely to object-oriented design, while also providing instant, visible results. This has always seemed to me to be a much better way to understand OOD than abstract examples.

Looking at the available tutorials, I was unable to find one that I really liked. They all seemed to be skipping over the basics. I decided instead to write my own, with an emphasis on simplicity and thoroughness. The tutorial takes baby steps, while explaining how things work at each of these steps.

The source code for the tutorial, both for the text and the examples, is available in this GitLab project. You can also just grab the PDF version.

elahav
http://membarrier.wordpress.com/?p=277
Extensions