GeistHaus
log in · sign up

Bannalia: trivial notes on themes diverse

Part of feedburner.com

stories
Boost.MultiIndex refactored
Show full content

Boost.MultiIndex was launched as part of Boost 1.32 in November 2004. The library is still actively maintained and in use by some notable projects such as BitcoinCoreCERN ATLASClickHouseFolly and Redpanda, to name a few.

Back in 2004, variadic templates and typelists were emulated in C++03 with the help of libraries like Boost.Preprocessor and Boost.MPL. These libraries were ground breaking at the time but they have been largely obsoleted by language features available since C++11. Given that Boost.MultiIndex is no longer usable in C++03 (some internal dependencies have moved in the last few years to requiring C++11 as a minimum), it was about time to give the library an upgrade.

Starting in Boost 1.91 (target date April 2026), all the internal machinery of Boost.MultiIndex dependent on Boost.Preprocessor and Boost.MPL is refactored to use C++11 variadic templates and Boost.Mp11:

  • All type lists accepted or provided by the library (indexed_by, tag, nested typedefs index_specifier_type_list, index_type_list, iterator_type_list and const_iterator_type_list) are no longer based on Boost.MPL but instead they are now Boost.Mp11 lists.
  • composite_key and associated class templates (composite_key_equal_to, composite_key_compare, composite_key_hash) have been made truly variadic (previously the maximum number of template arguments was limited by the macro BOOST_MULTI_INDEX_LIMIT_COMPOSITE_KEY_SIZE). 

The upgrade should be transparent to end users in the overwhelming majority of cases, although we discuss some potential backwards compatibility issues later.

Reduction in lengths of type and symbol names

Consider:

using namespace boost::multi_index;

struct element
{
  int x, y;
};

using container = multi_index_container<
  element,
  indexed_by<
    random_access<tag<struct i0>>,
    ordered_unique<tag<struct i1>, key<&element::x, &element::y>>
  >
>;

container c;
auto&     idx = c.get<0>(); // first index of the container

Prior to Boost 1.91, typeid(c).name() and typeid(idx).name() were the following in Visual Studio (after formatting):

class boost::multi_index::multi_index_container<
  struct element,
  struct boost::multi_index::indexed_by<
    struct boost::multi_index::random_access<
      struct boost::multi_index::tag<
        struct i0,
        struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
        struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
        struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
        struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
        struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
        struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
        struct boost::mpl::na
      >
    >,
    struct boost::multi_index::ordered_unique<
      struct boost::multi_index::tag<
        struct i1,
        struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
        struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
        struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
        struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
        struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
        struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
        struct boost::mpl::na
      >,
      struct boost::multi_index::composite_key<
        struct element,
        struct boost::multi_index::member<struct element, int, 0>,
        struct boost::multi_index::member<struct element, int, 4>,
        struct boost::tuples::null_type, struct boost::tuples::null_type,
        struct boost::tuples::null_type, struct boost::tuples::null_type,
        struct boost::tuples::null_type, struct boost::tuples::null_type,
        struct boost::tuples::null_type, struct boost::tuples::null_type
      >,
      struct boost::mpl::na
    >,
    struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
    struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
    struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
    struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
    struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
    struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na
  >,
  class std::allocator<struct element>
>

class boost::multi_index::detail::random_access_index<
  struct boost::multi_index::detail::nth_layer<
    1,
    struct element,
    struct boost::multi_index::indexed_by<
      struct boost::multi_index::random_access<
        struct boost::multi_index::tag<
          struct i0,
          struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
          struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
          struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
          struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
          struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
          struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
          struct boost::mpl::na
        >
      >,
      struct boost::multi_index::ordered_unique<
        struct boost::multi_index::tag<
          struct i1,
          struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
          struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
          struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
          struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
          struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
          struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
          struct boost::mpl::na
        >,
        struct boost::multi_index::composite_key<
          struct element,
          struct boost::multi_index::member<struct element, int, 0>,
          struct boost::multi_index::member<struct element, int, 4>,
          struct boost::tuples::null_type, struct boost::tuples::null_type,
          struct boost::tuples::null_type, struct boost::tuples::null_type,
          struct boost::tuples::null_type, struct boost::tuples::null_type,
          struct boost::tuples::null_type, struct boost::tuples::null_type
        >,
        struct boost::mpl::na
      >,
      struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
      struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
      struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
      struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
      struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na,
      struct boost::mpl::na, struct boost::mpl::na, struct boost::mpl::na
    >,
    class std::allocator<struct element>
  >,
  struct boost::mpl::vector1<struct i0>
>

Those many boost::mpl::nas are default template arguments used by Boost.MPL to emulate variadic class templates; similarly, boost::tuples::null_type is the default argument for non-variadic boost::tuple. With the upgrade, the corresponding type names are: 

class boost::multi_index::multi_index_container<
  struct element,
  struct boost::multi_index::indexed_by<
    struct boost::multi_index::random_access<
      struct boost::multi_index::tag<struct i0>
    >,
    struct boost::multi_index::ordered_unique<
      struct boost::multi_index::tag<struct i1>,
      struct boost::multi_index::composite_key<
        struct element,
        struct boost::multi_index::member<struct element, int, 0>,
        struct boost::multi_index::member<struct element, int, 4>
      >,
      void
    >
  >,
  class std::allocator<struct element>
>

class boost::multi_index::detail::random_access_index<
  struct boost::multi_index::detail::nth_layer<
    1,
    struct element,
    struct boost::multi_index::indexed_by<
      struct boost::multi_index::random_access<
        struct boost::multi_index::tag<struct i0>
      >,
      struct boost::multi_index::ordered_unique<
        struct boost::multi_index::tag<struct i1>,
        struct boost::multi_index::composite_key<
          struct element,
          struct boost::multi_index::member<struct element, int, 0>,
          struct boost::multi_index::member<struct element, int, 4>
        >,
        void
      >
    >,
    class std::allocator<struct element>
  >,
  struct boost::multi_index::tag<struct i0>
>

Terser type names are beneficial when inspecting compile error messages related to the use of the library. Internal symbol names are also drastically reduced, which can improve compile and link times.

Faster compilation

We have measured compile times for a synthetic example program using Boost.MultiIndex 1.90 and the upcoming 1.91 version under Clang 20, GCC 15 and Visual Studio 2022 (benchmark setup here). The new version is faster in all three compilers by around 20% (Clang 1.19x, GCC 1.25x, Visual Studio 1.20x). Your mileage of course may vary.

Backwards compatibility

We foresee that most existing users of Boost.MultiIndex won't be affected by the upgrade beyond the collateral benefits described above. Some changes to user code may be needed, though, in rare situations:

  • If you were using Boost.MPL to synthezise or analyze the typelists featured in the library, your code will stop working as these are now Boost.Mp11 lists. If you are not in a position to do the necessary changes, the old Boost.MPL-based frontend can be restored by globally defining the macro BOOST_MULTI_INDEX_ENABLE_MPL_SUPPORT.
  • composite_key::key_extractors returns a std::tuple instead of a boost::tuple (and similarly for composite_key_equal_to, composite_key_compare and composite_key_hash). This change is needed because boost::tuple has a limit on the number of arguments it accepts, which is no longer the case for composite_key. If you're using key_extractors, chances are you may need to modify your code (for instance, std::tuple does not provide member function extractors of the form t.get<N>()).

Call to action

If you're a Boost.MultiIndex user, please test your project with the new version of the library to ensure there won't be any issues with the upgrade; Boost 1.91 will ship in April 2026, so as of this writing there's still plenty of time to fix any detected problem. The simplest way to do the test is to clone the develop branch of boostorg/multi_index and add its include directory to your include list before the path to your local installation of Boost. Please report your results through the usual channels. Thank you!

 

tag:blogger.com,1999:blog-2715968472735546962.post-3395523721832873193
Extensions
Comparing the run-time performance of Fil-C and ASAN
Show full content

After the publication of the experiments with Boost.Unordered on Fil-C, some readers asked for a comparison of run-time performances between Fil-C and Clang's AddressSanitizer (ASAN).

Warning

Please do not construe this article as implying that Fil-C and ASAN are competing technologies within the same application space. Whereas ASAN is designed to detect bugs resulting in memory access violations, Fil-C sports a stricter notion of memory safety including UB situations where a pointer is directed to a valid memory region that is nonetheless out of bounds with respect to the pointer's provenance. That said, there's some overlapping between both tools, so it's only natural to question about their relative impact on execution times.

Results

Our previous benchmarking repo has been updated to include results for plain Clang 18, Clang 18 with ASAN enabled, and Fil-C v0.674, all with release mode settings. The following figures show execution times in ns per element for Clang/ASAN (solid lines) and Fil-C (dashed lines) for three Boost.Unordered containers (boost::unordered_map,  boost::unordered_flat_map and boost::unordered_node_map) and four scenarios.


Running insertion Running erasure
Successful lookup Unsuccessful lookup

In summary:

  • Insertion:
    Fil-C is between 1.8x slower and 4.1x faster than ASAN (avg. 1.3x faster).
  • Erasure:
    Fil-C is between 1.3x slower and 9.2x faster than ASAN (avg. 1.9x faster).
  • Successful lookup:
    Fil-C is between 2.5x slower and 1.9x faster than ASAN (avg. 1.6x slower).
  • Unsuccessful lookup:
    Fil-C is between 2.6x slower  and 1.4x faster than ASAN (avg. 1.9x slower).

So, results don't allow us to establish a clear-cut "winner". When allocation/deallocation is involved, Fil-C seems to perform better (except for insertion when the memory working set gets past a certain threshold). For lookup, Fil-C is generally worse, and, again, the gap increases as more memory is used. A deeper analysis would require knowledge of the internals of both tools that I, unfortunately, lack.

(Update Nov 16, 2025) Memory usage

By request, we've measured peak memory usage in GB (as reported by time -v) for the three environments and three scenarios (insertion, erasure, combined successful and unsuccessful lookup) involving five different containers from Boost.Unordered and Abseil, each holding 10M elements. The combination of containers doesn't allow us to discern how closed vs. open addressing affect memory usage overheads in ASAN and Fil-C.

ASAN uses between 2.1x and 2.6x more memory than regular Clang, whereas Fil-C ranges between 1.8x and 3.9x. Results are again a mixed bag. Fil-C performs worst for the erasure scenario, perhaps because of delays in memory reclamation by this technology's embedded garbage collector.

tag:blogger.com,1999:blog-2715968472735546962.post-8604852498936732126
Extensions
Some experiments with Boost.Unordered on Fil-C
Show full content

Fil-C is a C and C++ compiler built on top of LLVM that adds run-time memory-safety mechanisms preventing out-of-bounds and use-after-free accesses. This naturally comes at a price in execution time, so I was curious about how much of a penalty that is for a performance-oriented, relatively low-level library like Boost.Unordered.

Compiling and testing

From the user's perspective, Fil-C is basically a Clang clone, so it is fairly easy to integrate in previously existing toolchains. This repo shows how to plug Fil-C into Boost.Unordered's CI, which runs on GitHub Actions and is powered by Boost's own B2 build system. The most straightforward way to make B2 use Fil-C is by having a user-config.jam file like this:

using clang : : fil++ ;

which instructs B2 to use the clang toolset with the only change that the compiler name is not the default clang++ but fil++.

We've encountered only minor difficulties during the process:

  •  In the enviroment used (Linux x64), B2 automatically includes --target=x86_64-pc-linux as part of the commandline, which confuses the adapted version of libc++ shipping with Fil-C. This option had to be overridden with --target=x86_64-unknown-linux-gnu (which is the default for Clang).
  • As of this writing, Fil-C does not accept inline assembly code (asm or __asm__ blocks), which Boost.Unordered uses to provide embedded GDB pretty-printers. The feature was disabled with the macro BOOST_ALL_NO_EMBEDDED_GDB_SCRIPTS. 

Other than this, the extensive Boost.Unordered test suite compiled and ran successfully, except for some tests involving Boost.Interprocess, which uses inline assembly in some places. CI completed in around 2.5x the time it takes with a regular compiler. It is worth noting that Fil-C happily accepted SSE2 SIMD intrinsics crucially used by Boost.Unordered.

Run-time performance

We ran some performance tests compiled with Fil-C v0.674 on a Linux machine, release settings (benchmark code and setup here). The figures show execution times in ns per element for Clang 15 (solid lines) and Fil-C (dashed lines) and three containers: boost::unordered_map (closed-addressing hashmap), and boost::unordered_flat_map and boost::unordered_node_map (open addressing).


Running insertion Running erasure
Successful lookup Unsuccessful lookup

Execution with Fil-C is around 2x-4x slower, with wide variations depending on the benchmarked scenario and container of choice. Closed-addressing boost::unordered_map is the container experiencing the largest degradation, presumably because it does the most amount of pointer chasing.

tag:blogger.com,1999:blog-2715968472735546962.post-7116981756330663969
Extensions
Bulk operations in Boost.Bloom
Show full content

Starting in Boost 1.90, Boost.Bloom will provide so-called bulk operations, which, in general, can speed up insertion and lookup by a sizable factor. The key idea behind this optimization is to separate in time the calculation of a position in the Bloom filter's array from its actual access. For instance, if this is the the algorithm for regular insertion into a Bloom filter with k = 1 (all the snippets in this article are simplified, illustrative versions of the actual source code):

void insert(const value_type& x)
{
  auto h = hash(x);
  auto p = position(h);
  set(position, 1);
}

then the bulk-mode variant for insertion of a range of values would look like:

void insert(const std::array<value_type, N>& x)
{
  std::size_t positions[N];
  
  // pipeline position calculation and memory access
  
  for(std::size_t i = 0; i < N; ++i) {
    auto h = hash(x[i]);
    positions[i] = position(h);
    prefetch(positions[i]);
  }
  
  for(std::size_t i = 0; i < N; ++i) {
    set(positions[i], 1);
  }
}

By prefetching the address of positions[i] way in advance of its actual usage in set(positions[i], 1), we make sure that the latter is accessing a cached value and avoid (or minimize) the CPU stalling that would result from reaching out to cold memory. We have studied bulk optimization in more detail in the context of boost::concurrent_flat_map. You can see actual measurements of the performance gains achieved in a dedicated repo; as expected, gains are higher for larger bit arrays not fitting in the lower levels of the CPU's cache hierarchy.

From an algorithmic point of view, the most interesting case is that of lookup operations for k > 1, since the baseline non-bulk procedure is not easily amenable to pipelining:

bool may_contain(const value_type& x)
{ 
  auto h = hash(x);
  for(int i = 0; i < k; ++i) {
    auto p = position(h);
    if(check(position) == false) return false;
    if(i < k - 1) h = next_hash(h);
  }
  return true;
}

This algorithm is branchful and can take anywhere from 1 to k iterations, the latter being the case for elements present in the filter and false positives. For instance, this diagram shows the number of steps taken to look up n = 64 elements on a filter with k = 10 and FPR = 1%, where the successful lookup rate (proportion of looked up elements actually in the filter) is p = 0.1:

As it can be seen, for non-successful lookups may_contain typically stops at the first few positions: the average number of positions checked (grayed cells) is  \[n\left(pk +(1-p)\frac{1-p_b^{k}}{1-p_b}\right),\] where \(p_b=\sqrt[k]{\text{FPR}}\) is the probability that an arbitrary bit in the filter's array is set to 1. In the example used, this results in only 34% of the total nk = 640 positions being checked.

Now, a naïve bulk-mode version could look as follows:

template<typename F>
void may_contain(
  const std::array<value_type, N>& x,
  F f) // f is fed lookup results
{ 
  std::uint64_t hashes[N];
  std::size_t   positions[N];
  bool          results[N];
  
  // initial round of hash calculation and prefetching
  
  for(std::size_t i = 0; i < N; ++i) {
    hashes[i] = hash(x[i]);
    positions[i] = position(hashes[i]);
    results[i] = true;
    prefetch(positions[i]);
  }
  
  // main loop
  
  for(int j = 0; j < k; ++i) {
    for(std::size_t i = 0; i < N; ++i) {
      if(results[i]) { // conditional branch X
        results[i] &= check(positions[i]);
        if(results[i] && j < k - 1) {
          hashes[i] = next_hash(hashes[i]);
          positions[i] = position(hashes[i]);
          prefetch(positions[i]);
        }
      }
    }
  }
  
  // feed results
  
  for(int i = 0; i < k; ++i) {
    f(results[i]);
  }
}

This simply stores partial results in an array and iterates row-first instead of column-first:

The problem with this approach is that, even though it calls check exactly the same number of times as the non-bulk algorithm, the conditional branch labeled X is executed \(nk\) times, and this has a huge impact on the CPU's branch predictor. Conditional branches could in principle be eliminated altogether:

for(int j = 0; j < k; ++i) {
  for(std::size_t i = 0; i < N; ++i) {
    results[i] &= check(positions[i]);
    if(j < k - 1) { // this check is optimized away at compile time
      hashes[i] = next_hash(hashes[i]);
      positions[i] = position(hashes[i]);
      prefetch(positions[i]);
    }
  }
}

but this would result in \(nk\) calls to check for ~3 times more computational work than the non-bulk version.

The challenge then is to reduce the number of iterations on each row to only those positions that still need to be evaluated. This is the solution adopted by Boost.Bloom:

template<typename F>
void may_contain(
  const std::array<value_type, N>& x,
  F f) // f is fed lookup results
{ 
  std::uint64_t hashes[N];
  std::size_t   positions[N];
  std::uint64_t results = 0; // mask of N bits
  
  // initial round of hash calculation and prefetching
  
  for(std::size_t i = 0; i < N; ++i) {
    hashes[i] = hash(x[i]);
    positions[i] = position(hashes[i]);
    results |= 1ull << i;
    prefetch(positions[i]);
  }
  
  // main loop
  
  for(int j = 0; j < k; ++i) {
    auto mask = results;
    if(!mask) break;
    do{
      auto i = std::countr_zero(mask);
      auto b = check(positions[i]);
      results &= ~(std::uint64_t(!b) << i);
      if(j < k - 1) { // this check is optimized away at compile time
        hashes[i] = next_hash(hashes[i]);
        positions[i] = position(hashes[i]);
        prefetch(positions[i]);
      }
      mask &= mask - 1; // reset least significant 1
    } while(mask);
  }
  
  // feed results
  
  for(int i = 0; i < k; ++i) {
    f(results & 1);
    results >>= 1;
  }
}

Instead of an array of partial results, we keep these as a bitmask, so that we can skip groups of terminated columns in constant time using std::countr_zero. For instance, in the 7th row the main loop does 11 iterations instead of n = 64.

In summary, the bulk version of may_contain only does n more conditional branches than the non-bulk version, plus \(n(1-p)\) superfluous memory fetches —the latter could be omitted at the expense of \(n(1-p)\) additional conditional branches, but benchmarks showed that the version with extra memory fetches is actually faster. These are measured speedups of bulk vs. non-bulk lookup for a boost::bloom::filter<int, K> containing 10M elements under GCC, 64-bit mode:

array
size K p = 1 p = 0 p = 0.1 8M 6 0.78 2.11 1.43 12M 9 1.54 2.27 1.38 16M 11 2.08 2.45 1.46 20M 14 2.24 2.57 1.43

(More results here.)

Conclusions

Boost.Bloom will introduce bulk insertion and lookup capabilities in Boost 1.90, resulting in speedups of up to 3x, though results vary greatly depending on the filter configuration and its size, and may even have less performance than the regular case in some situations. We have shown how bulk lookup is implemented for the case k > 1, where the regular, non-bulk version is highly branched and so not readily amenable to pipelining. The key technique, based on iteration reduction with std::countr_zero, can be applied outside the context of Boost.Bloom to implement efficient pipelining of early-exit operations.

tag:blogger.com,1999:blog-2715968472735546962.post-2846012463579415672
Extensions
Maps on chains
Show full content

(From a conversation with Vassil Vassilev.) Suppose we want to have a C++ map where the keys are disjoint, integer intervals of the form [a, b]:

struct interval
{
  int min, max;
};

std::map<interval, std::string> m;

m[{0, 9}] = "ABC";
m[{10, 19}] = "DEF";

This looks easy enough, we just have to write the proper comparison operator for intervals, right?

bool operator<(const interval& x, const interval& y)
{
  return x.max < y.min;
}

But what happens if we try to insert an interval which is not disjoint with those already in the map?

m[{5, 14}] = "GHI"; // intersects both {0, 9} and {10, 19}

The short answer is that this is undefined behavior, but let's try to undertand why. C++ associative containers depend on the comparison function (typically, std::less<Key>) inducing a so-called strict weak ordering on elements of Key. In layman terms, a strict weak order < behaves as the "less than" relationship does for numbers, except that there may be incomparable elements x, y such that xy and y ≮ x; for numbers, this only happens if x = y, but in the case of a general SWO we allow for distinct, incomparable elements as long as they form equivalence classes. A convenient way to rephrase this condition is to require that incomparable elements are totally equivalent in how they compare to the rest of the elements, that is, they're truly indistinguishable from the point of view of the SWO. Getting back to our interval scenario, we have three possible cases when comparing [a, b] and [c, d]:

  • If b < c, the intervals don't overlap and [a, b] < [c, d].
  • If d < a, the intervals don't overlap and [c, d] < [a, b].
  • Otherwise, the intervals are incomparable. This can happen when [a, b] and [c, d] overlap partially or when they are exactly the same interval.

What we have described is a well known relationship called interval order. The problem is that the interval order is not a strict weak order. Let's depict a Hasse diagram for the interval order on integer intervals [a,b] between 0 and 4:

A Hasse diagram works like this: given two elements x and y, x < y iff there is a path going upwards that connects x to y. For instance, the fact that [1, 1] < [3, 4] is confirmed by the path [1, 1] → [2, 2] → [3, 4]. But the diagram also serves to show why this relationship is not a strict weak order: for it to be so, incomparable elements (those not connected) should be indistinguishable in that they are connected upwards and downwards with the same elements, and this is clearly not the case (in fact, it is not the case for any pair of incomparable elements). In mathematical terms, our relationship is of a more general type called a strict partial order.

Going back to C++, associative containers assume that the elements inserted form a linear arrangement with respect to <: when we try to insert a new element y that is incomparable with some previously inserted element x, the properties of strict weak orders allows us to determine that x and y are equivalent, so nothing breaks up (the insertion fails as a duplicate for a std::map, or y is added next to x for a std::multimap).

There's a way to accommodate our interval scenario with std::map, though. As long as the elements we are inserting belong to the same connecting path or chain, std::map can't possibly "know" if our relationship is a strict weak order or not: it certainly looks like one for the limited subset of elements it has seen so far. Implementation-wise, we just have to make sure we're not comparing partially overlapping intervals:

struct interval_overlap: std::runtime_error
{
  interval_overlap(): std::runtime_error("interval overlap"){}
};

bool operator<(const interval& x, const interval& y)
{
  if(x.min == y.min) {
    if(x.max != y.max) throw interval_overlap();
    return false;
  }
  else if(x.min < y.min) {
    if(x.max >= y.min) throw interval_overlap();
    return true;
  }
  else /* x.min > y.min */
  {
    if(x.min <= y.max) throw interval_overlap();
    return false;
  }
}

std::map<interval, std::string> m;

m[{0, 9}] = "ABC";
m[{10, 19}] = "DEF";
m[{5, 14}] = "GHI"; // throws interval_overlap

So, when we try to insert an element that would violate the strict weak ordering constraints (that is, it lies outside the chain connecting the intervals inserted so far), an exception is thrown and no undefined behavior is hit. A strict reading of the standard would not allow this workaround, as it is required that the comparison object for the map induce a strict weak ordering for all possible values of Key, not only those in the container (or that is my interpretation, at least): for all practical purposes, though, this works and will foreseeably continue to work for all future revisions of the standard.

Bonus point. Thanks to heterogeneous lookup, we can extend our use case to support lookup for integers inside the intervals:

struct less_interval
{
  using is_transparent = void;

  bool operator()(const interval& x, const interval& y) const
  {
    // as operator< before
  }

  bool operator()(int x, const interval& y) const
  {
    return x < y.min;
  }
  
  bool operator()(const interval& x, int y) const
  {
    return x.max < y; 
  }    
};

std::map<interval, std::string, less_interval> m;
  
m[{0, 9}] = "ABC";
m[{10, 19}] = "DEF";

std::cout << m.find(5)->second << "\n"; // prints "ABC"

Exercise for the reader: Can you formally prove that this works? (Hint: define a strict weak order on ℕ ∪ I, where ℕ is the set of natural numbers and I is a collection of disjoint integer intervals.)

tag:blogger.com,1999:blog-2715968472735546962.post-5914447325330122492
Extensions
WG21, Boost, and the ways of standardization
Show full content
Goals of standardization

Standardization, in a form resembling our contemporary practices, began in the Industrial Revolution as a means to harmonize incipient mass production and their associated supply chains through the concept of interchangeability of parts. Some early technical standards are the Gribeauval system (1765, artillery pieces) and the British Standard Whitworth (1841, screw threads). Taylorism expanded standardization efforts from machinery to assembly processes themselves with the goal of increasing productivity (and, it could be said, achieving interchangeability of workers). Standards for metric systems, such as that of Revolutionary France (1791) were deemed "scientific" (as befitted the enlightenment spirit of the era) in that they were defined by exact, reproducible methods, but their main motivation was to facilitate local and international trade rather than support the advancement of science. We see a common theme here: standardization normalizes or leverages technology to favor industry and trade, that is, technology precedes standards.

This approach is embraced by 20th century standards organizations (DIN 1917, ANSI 1918, ISO 1947) through the advent of electronics, telecommunications and IT, and up to our days. Technological advancement, or, more generally, innovation (a concept coined around 1940 and ubiquitous today) is not seen as the focus of standardization, even though standards can promote innovation by consolidating advancements and best practices upon which further cycles of innovation can be built —and potentially be standardized in their turn. This interplay between standardization and innovation has been discussed extensively within standards organizations and outside. The old term "interchangeability of parts" has been replaced today by the more abstract concepts of compatibility, interoperability and (within the realm of IT) portability.

Standardizing programming languages

Most programming languages are not officially standardized, but some are. As of today, these are the ISO-standardized languages actively maintained by dedicated working groups within the ISO/IEC JTC1/SC22 subcommittee for programming languages:

  • COBOL (WG4)
  • Fortran (WG5)
  • Ada (WG9)
  • C (WG14)
  • Prolog (WG17)
  • C++ (WG21)

What's the purpose of standardizing a programming language? JC22 has a sort of foundational paper which centers on the benefits of portability, understood as both portability across systems/environments and portability of people (a rather blunt allusion to old-school Taylorism). The paper does not mention the subject of implementation certification, which can play a significant role for languages such as Ada that are used in heavily regulated sectors. More importantly to our discussion, it does not either mention what position SC22 holds with respect to innovation: regardless, we will see that innovation does indeed happen within SC22 workgroups, in what represents a radical departure from classical standardization practices.

WG21

C++ was mostly a one man's effort since its inception in the early 80s until the publication of The Annotated C++ Reference Manual (ARM, 1990), which served as the basis for the creation of an ANSI/ISO standardization committee that would eventually release its first C++ standard in 1998. Bjarne Stroustrup cited avoidance of compiler vendor lock-in (a variant of portability) as a major reason for having the language standardized —a concern that made much sense in a scene then dominated by company-owned languages such as Java.

Innovation was seen as WG21's business from its very beginning: some features of the core language, such as templates and exceptions, were labeled as experimental in the ARM, and the first version of the standard library, notably including Alexander Stepanov's STL, was introduced by the committee in the 1990-1998 period with little or no field experience. After a minor update to C++98 in 2003, the innovation pace picked up again in subsequent revisions of the standard (2011, 2014, 2017, 2020, 2023), and the current innovation backlog does not seem to falter; if anything, we could say that the main blocker for innovation within the standard is lack of human resources in WG21 rather than lack of proposals.

Innovation vs. adoption

Not all new features in the C++ standard have originated within WG21. We must distinguish here between the core language and the standard library:

  • External innovation in the core language is generally hard as it requires writing or modifying a C++ compiler, a task outside the capabilities of many even though this has been made much more accessible with the emergence of open-source, extensible compiler frameworks such as LLVM. As a result, most innovation activity here happens within WG21, with some notable exceptions like Circle and Cpp2. Others have chosen to depart from the C++ language completely (Carbon, Hylo), so their potential impact on C++ standardization is remote at best.
  • As for the standard library, the situation is more varied. These are some examples:
In general, the trend for the evolution of the standard library seems to be towards proposing new components straight into the standard with very little field experience.Pros and cons of standardization

The history of C++ standardization has met with some resounding successes (STL, templates, concurrency, most vocabulary types) as well as failures (exported templates, GC support, exception specifications, std::auto_ptr)  and in-between scenarios (std::regexranges).

Focusing on the standard library, we can identify benefits of standardization vs. having a separate, non-standard component:

  • The level of exposure to C++ users increases dramatically. Some companies have bans on the usage of external libraries, and even if no bans are in place, consuming the standard library is much more convenient than having to manage external dependencies —though this is changing.
  • Standardization ensures a high level of (system) portability, potentially beyond the reach of external library authors without access to exotic environments.
  • For components with high interoperability potential (think vocabulary types), having them in the standard library guarantees that they become the tool of choice for API-level module integration.

But there are drawbacks as well that must be taken into consideration:

  • The evolution of a library halts or reduces significantly once it is standardized. One major factor for this is WG21's self-imposed restriction to preserve backwards compatibility, and in particular ABI compatibility. For example:
Another factor contributing to library freeze may be the lack of motivation from the authors once they succeed in getting their proposals accepted, as the process involved is very demanding and can last for years.
  • Some libraries cover specialized domains that standard library implementors cannot be expected to master. Some cases in point:
    • Current implementations of std::regex are notoriously slower than Boost.Regex, a situation aggravated by the need to keep ABI compatibility.
    • Correct and efficient implementations of mathematical special functions require ample expertise in the area of numerical computation. As a result, Microsoft standard library implements these as mere wrappers over Boost.Math, and libc++ seems to be following suit. This is technically valid, but begs the question of what the standardization of these functions was useful for to begin with.
  • Additions to the upcoming standard (as of this writing, C++26) don't benefit users immediately because the community typically lags behind by two or three revisions of the language.

So, standardizing a library component is not always the best course of action for the benefit of current and future users of that component. Back in 2001, Stroustrup remarked that "[p]eople sometime forget that a library doesn't have to be part of the standard to be useful", but, to this day, WG21 does not seem to have formal guidelines as to what constitutes a worthy addition to the standard, or how to engage with the community in a world of ever-expanding and more accessible external libraries. We would like to contribute some modest ideas in that direction.

An assessment model for library standardization

Going back to the basic principles of standards, the main benefits to be derived from standardizing a technology (in our case, a C++ library) are connected to higher compatibility and interoperability as a means to increase overall productivity (assumedly correlated to the level of usage of the library within the community). Leaving aside for the moment the size of the potential target audience, we identify two characteristics of a given library that make it suitable for standardization:

  • Its portability requirements, defined as the level of coupling that an optimal implementation has with the underlying OS, CPU architecture, etc. The higher these requirements the more sense it makes to include the library as a mandatory part of the standard.
  • Its interoperability potential, that is, how much the library is expected to be used as part of public APIs interconnecting different program modules vs. as a private implementation detail. A library with high interoperability potential is maximally useful when included in the common software "stack" shared by the community.

So, the baseline standardization value of a library, denoted V0, can be modeled as:

V0 = aP + bI,

where P denotes the library's portability requirements and I its interoperability potential. The figure shows the baseline standardization value of some library domains within the P-I plane. The color red indicates that this value is low, green that it is high.

A low baseline standardization value for a library does not mean that the library is not useful, but rather that there is little gain to be obtained from standardizing it as opposed to making it available externally. The locations of the exemplified domains in the P-I plane reflect the author's estimation and may differ from that of the reader.

Now, we have seen that the adoption of a library requires some prior field experience, defined as

E = T·U,

where T is the age of the library and U is average number of users.

  • When E is very low, the library is not mature enough and standardizing it can result in a defective design that will be much harder to fix within the standard going forward; this effectively decreases the net value of standardization.
  • On the contrary, if E is very high, which is correlated to the library having already reached its maximum target audience, the benefits of standardization are vanishingly small: most people are already using the library and including it into the official standard has little value added —the library has become a de facto standard.

So, we may expect to attain an optimum standardization opportunity S between the extremes E = 0 and Emax.

Finally, the net standardization value of a library is defined as

V = VS·Umax,

where Umax is the library's maximum target audience. Being a conceptual model, the purpose of this framework is not so much to establish a precise evaluation formula as to help stakeholders raise the right questions when considering a library for standardization:

  • How high are the library's portability requirements?
  • How high its interoperability potential?
  • Is it too immature yet? Does it have actual field experience?
  • Or, on the contrary, has it already reached its maximum target audience?
  • How big is this audience?
Boost, the standard and beyond

Boost was launched in 1998 upon the idea that "[a] world-wide web site containing a repository of free C++ class libraries would be of great benefit to the C++ community". Serving as a venue for future standardization was mentioned only as a secondary goal, yet very soon many saw the project as a launching pad towards the standard library, a perception that has changed since. We analyze the different stages of this 25+-year-old project in connection with its contributions to the standard and to the community.

Golden era: 1998-2011

In its first 14 years of existence, the project grew from 0 to 113 libraries, for a total uncompressed size of 324 MB. Out of these 113 libraries, 12 would later be included in C++11, typically with modifications (Array, Bind, Chrono, EnableIf, Function, Random, Ref, Regex, SmartPtr, Thread, Tuple, TypeTraits); it may be noted that, even at this initial stage, most Boost libraries were not standardized or meant for standardization. From the point of view of the C++ standard library, however, Boost was the first contributor by far. We may venture some reasons for this success:

  • There was much low-hanging fruit in the form of small vocabulary types and obvious utilities.
  • Maybe due to a combination of scarce competition and sheer luck, Boost positioned itself very quickly as the go-to place for contributing and consuming high-quality C++ libraries. This ensured a great deal of field experience with the project.
  • Many of the authors of the most relevant libraries were also prominent figures within the C++ community and WG21.
Middle-age issues: 2012-2020

By 2020, Boost had reached 164 libraries totaling 717 MB in uncompressed size (so, the size of the average library, including source, tests and documentation, grew by 1.5 with respect to 2011). Five Boost libraries were standardized between C++14 and C++20 (Any, Filesystem, Math/Special Functions, Optional, Variant): all of these, however, were already in existence before 2012, so the rate of successful new contributions from Boost to the standard decreased effectively to zero in this period. There were some additional unsuccessful proposals (Mp11).

The transition of Boost from the initial ramp-up to a more mature stage met with several scale problems that impacted negatively the public perception of the project (and, to some extent that we haven't able to determine, its level of usage). Of particular interest is a public discussion that took place in 2022 on Reddit and touched on several issues more or less recognized within the community of Boost authors:

  • The default/advertised way to consume Boost as a monolithic download introduces a bulky, hard to manage dependency on projects.
  • B2, Boost's native build technology, is unfamiliar to users more accustomed to widespread tools such as CMake.
  • Individual Boost libraries are perceived as bloated in terms of size, internal dependences and compile times. Alternative competing libraries are self-contained, easier to install and smaller as they rely on newer versions of the C++ standard.
  • Many useful components are already provided by the standard library.
  • There are great differences between libraries in terms of their quality; some libraries are all but abandoned.
  • Documentation is not good enough, in particular if compared to cppreference.com, which is regarded as the golden standard in this area.

A deeper analysis reveals some root causes for this state of affairs:

  • Overall, the Boost project is very conservative and strives not to break users' code on each version upgrade (even though, unlike the standard, backwards API/ABI compatibility is not guaranteed). In particular, many Boost authors are reluctant to increase the minimum C++ standard version required for their libraries. Also, there is no mechanism in place to retire libraries from the project.
  • Supporting older versions of the C++ standard locks in some libraries with suboptimal internal dependencies, the most infamous being Boost.MPL, which many identify (with or without reason) as responsible for long compile times and cryptic error messages.
  • Boost's distribution and build mechanisms were invented in an era where package managers and build systems were not prevalent. This works well for smaller footprints but presents scaling problems that were not foreseen at the beginning of the project.
  • Ultimately, Boost is a federation of libraries with different authors and sensibilities. This fact accounts for the various levels of documentation quality, user support, maintenance, etc.

Some of these characteristics are not negative per se, and have in fact resulted in an extremely durable and available service to the C++ community that some may mistakenly take for granted. Supporting "legacy C++" users is, by definition, neglected by WG21, and maintaining libraries that were already standardized is of great value to those who don't live on the edge (and, in the case of the std::regex fiasco, those who do). Confronted with the choice of serving the community today vs. tomorrow (via standardization proposals), the Boost project took, perhaps unplannedly, the first option. This is not to say that all is good with the Boost project, as many of the problems found in 2012-2020 are strictly operational.

Evolution: 2021-2024 and the future

Boost 1.85 (April 2024) contains 176 libraries (7% increase with respect to 2020) and has a size of 731 MB (2% increase). Only one Boost component has partially contributed to the C++23 standard library (boost::container::flat_map), though there has been some unsuccessful proposals (the most notable being Boost.Asio).

In response to the operational problems we have described before, some authors have embarked on a number of improvement and modernization tasks:

  • Beginning in Boost 1.82 (Apr 2023), some core libraries announced the upcoming abandonment of C++03 support as part of a plan to reduce code base sizes, maintenance costs, and internal dependencies on "polyfill" components. This initiative has a cascading effect on dependent libraries that is still ongoing.
  • Alongside C++03 support drop, many libraries have been updated to reduce the number of internal dependencies (that even were, in some cases, cyclic). The figure shows the cumulative histograms of the number of dependencies for Boost libraries in versions 1.66 (2017), 1.75 (2020) and 1.85 (2024):
  • Official CMake support for the entire Boost project was announced in Oct 2023. This support also allows for downloading and building of individual libraries (and their dependencies).
  • On the same front of modular consumption, there is work in progress to modularize B2-based library builds, which will enable package managers such as Conan to offer Boost libraries individually. vcpkg already offers this option.
  • Starting in July 2023, boost.org includes a search widget indexing the documentation of all libraries. The ongoing MrDocs project seeks to provide a Doxygen-like tool for automatic C++ documentation generation that could eventually support Boost authors  —library docs are currently written more or less manually in a plethora of languages such as raw HTML, Quickbook, Asciidoc, etc. There is a new Boost website in the works scheduled for launch during mid-2024.

Where is Boost headed? It must be stressed again that the project is a federation of authors without a central governing authority in strategic matters, so the following should be taken as an interpretation of detected current trends:

  • Most of the recently added libraries cover relatively specific application-level domains (networking/database protocols, parsing) or else provide utilities likely to be superseded by future C++ standards, as is the case with reflection (Describe, PFR). One library is a direct backport of a C++17 standard library component (Charconv). Boost.JSON provides yet another solution in an area already rich with alternatives external to the standard library. Boost.LEAF proposes an approach to error handling radically different to that of the latest standard (std::expected). Boost.Scope implements and augment a WG21 proposal currently on hold (<experimental/scope>).
  • In some cases, standard compatibility has been abandoned to provide faster performance or richer functionality (Container, Unordered, Variant2).
  • No new library supports C++03, which reduces drastically their number of internal dependencies (except in the case of networking libraries depending on Boost.Asio).
  • On the other hand, most new libraries are still conservative in that they only require C++11/14, with some exceptions (Parser and Redis require C++17, Cobalt requires C++20).
  • There are some experimental initiatives like the proposal to serve Boost libraries as C++ modules, which has been met with much interest and support from the Visual Studio team. An important appeal of this idea is that it will allow compiler vendors and the committee to obtain field experience from a large, non-trivial codebase.

The current focus of Boost seems then to have shifted from standards-bound innovation to higher-level and domain-specific libraries directly available to users of C++11/14 and later. More stress is increasingly being put on maintenance, reduced internal dependencies and modular availability, which further cements the thesis that Boost authors are more concerned about serving the C++ community from Boost itself than eventually migrating to the standard. There is still a flow of ideas from Boost to WG21, but they do not represent the bulk of the project activity.

Conclusions

Traditionally, the role of standardization has been to consolidate previous innovations that have reached maturity so as to maximize their potential for industry vendors and users. In the very specific case of programming languages, and WG21/LEWG in particular, the standards committee has taken on the role of innovator and is pushing the industry rather than adopting external advancements or coexisting with them. This presents some problems related to lack of field experience, limitations to internal evolution imposed by backwards compatibility and an associated workload that may exceed the capacity of the committee. Thanks to open developer platforms (GitHub, GitLab), widespread build systems (CMake) and package managers (Conan, vcpkg), the world of C++ libraries is richer and more available than ever. WG21 could reconsider its role as part of an ecosystem that thrives outside and alongside its own activity. We have proposed a conceptual evaluation model for standardization of C++ libraries that may help in the conversations around these issues. Boost has shifted its focus from being a primary venue for standardization to serving the C++ community (including users of previous versions of the language) through increasingly modular, high level and domain-specific libraries. Hopefully, the availability and reach of the Boost project will help gain much needed field experience that could eventually lead to further collaborations with and contributions to WG21 in a non-preordained way.

tag:blogger.com,1999:blog-2715968472735546962.post-1527440803497983238
Extensions
A case in API ergonomics for ordered containers
Show full content

Suppose we have a std::set<int> and would like to retrieve the elements between values a and b, both inclusive. This task is served by operations std::set::lower_bound and std::set::upper_bound:

std::set<int> x=...;

// elements in [a,b]
auto first = x.lower_bound(a);
auto last = x.upper_bound(b);

while(first != last) std::cout<< *first++ <<" ";

Why do we use lower_bound for the first iterator and upper_bound for the second? The well-known STL convention is that a range of elements is determined by two iterators first and last, where first points to the first element of the range and last points to the position right after the last element. This is done so that empty ranges can be handled without special provisions (first == last).

Now, with this convention in mind and considering that

  • lower_bound(a) returns an iterator to the first element not less than a,
  • upper_bound(b) returns an iterator to the first element greater than b,

we can convince ourselves that the code above is indeed correct. The situations where one or both of the interval endpoints are not inclusive can also be handled:

// elements in [a,b)
auto first = x.lower_bound(a);
auto last  = x.lower_bound(b);

// elements in (a,b]
auto first = x.upper_bound(a);
auto last  = x.upper_bound(b);

// elements in (a,b)
auto first = x.upper_bound(a);
auto last  = x.lower_bound(b);

but getting them right requires some thinking.

Boost.MultiIndex introduces the operation range to handle this type of queries:

template<typename LowerBounder,typename UpperBounder>
std::pair<iterator,iterator>
range(LowerBounder lower, UpperBounder upper);

lower and upper are user-provided predicates that determine whether an element is not to the left and not to the right of the considered interval, respectively. The formal specification of LowerBounder and UpperBounder is quite impenetrable, but using this facility, in particular in combination with Boost.Lambda2, is actually straightforward:

// equivalent to std::set<int>
boost::multi_index_container<int> x=...;

using namespace boost::lambda2;

// [a,b] auto [first, last] = x.range(_1 >= a, _1 <= b);
// [a,b) auto [first, last] = x.range(_1 >= a, _1 < b); // (a,b] auto [first, last] = x.range(_1 > a, _1 <= b); // (a,b) auto [first, last] = x.range(_1 > a, _1 < b);

The resulting code is much easier to read and to get right in the first place, and is also more efficient than two separate calls to [lower|upper]_bound   (because the two internal rb-tree top-to-bottom traversals can be partially joined in the implementation of range). Just as importantly, range handles situations such as this:

int a = 5;
int b = 2; // note a > b

// elements in [a,b]
auto first = x.lower_bound(a);
auto last = x.upper_bound(b);

// undefined behavior
while(first != last) std::cout<< *first++ <<" ";

When a > b, first may be strictly to the right of last, and consequently the while loop will crash or never terminate. range, on the other hand, handles the situation gracefully and returns an empty range.

We have seen an example of how API design can help reduce programming errors and increase efficiency by providing higher-level facilities that model and encapsulate scenarios otherwise served by a combination of lower-level operations. It may be interesting to have range-like operations introduced for standard associative containers.

tag:blogger.com,1999:blog-2715968472735546962.post-5419054245994007782
Extensions
Bulk visitation in boost::concurrent_flat_map
Show full content

Introduction

boost::concurrent_flat_map and its boost::concurrent_flat_set counterpart are Boost.Unordered's associative containers for high-performance concurrent scenarios. These containers dispense with iterators in favor of a visitation-based interface:

boost::concurrent_flat_map<int, int> m;
...
// find the element with key k and increment its associated value
m.visit(k, [](auto& x) {
  ++x.second;
});

This design choice was made because visitation is not affected by some inherent problems afflicting iterators in multithreaded environments.

Starting in Boost 1.84, code like the following:

std::array<int, N> keys;
...
for(const auto& key: keys) {
  m.visit(key, [](auto& x) { ++x.second; });
}

can be written more succintly via the so-called bulk visitation API:

m.visit(keys.begin(), keys.end(), [](auto& x) { ++x.second; });

As it happens, bulk visitation is not provided merely for syntactic convenience: this operation is internally optimized so that it performs significantly faster than the original for-loop. We discuss here the key ideas behind bulk visitation internal design and analyze its performance.

Prior art

In their paper "DRAMHiT: A Hash Table Architected for the Speed of DRAM", Narayanan et al. explore some optimization techniques from the domain of distributed system as translated to concurrent hash tables running on modern multi-core architectures with hierarchical caches. In particular, they note that cache misses can be avoided by batching requests to the hash table, prefetching the memory positions required by those requests and then completing the operations asynchronously when enough time has passed for the data to be effectively retrieved. Our bulk visitation implementation draws inspiration from this technique, although in our case visitation is fully synchronous and in-order, and it is the responsibility of the user to batch keys before calling the bulk overload of boost::concurrent_flat_map::visit.

Bulk visitation design

As discussed in a previous article, boost::concurrent_flat_map uses an open-addressing data structure comprising:

  • A bucket array split into 2n groups of N = 15 slots.
  • A metadata array associating a 16-byte metadata word to each slot group, used for SIMD-based reduced-hash matching.
  • An access array with a spinlock (and some additional information) for locked access to each group.

The happy path for successful visitation looks like this:

  1. The hash value for the looked-up key and its mapped group position are calculated.
  2. The metadata for the group is retrieved and matched against the hash value.
  3. If the match is positive (which is the case for the happy path), the group is locked for access and the element indicated by the matching mask is retrieved and compared with the key. Again, in the happy path this comparison is positive (the element is found); in the unhappy path, more elements (within this group or beyond) need be checked.

(Note that happy unsuccessful visitation simply terminates at step 2, so we focus our analysis on successful visitation.) As the diagram shows, the CPU has to wait for memory retrieval between steps 1 and 2 and between steps 2 and 3 (in the latter case, retrievals of mutex and element are parallelized through manual prefetching). A key insight is that, under normal circumstances, these memory accesses will almost always be cache misses: successive visitation operations, unless for the very same key, won't have any cache locality. In bulk visitation, the stages of the algorithm are pipelined as follows (the diagram shows the case of three operations in the bulk batch):

The data required at step N+1 is prefetched at the end of step N. Now, if a sufficiently large number of operations are pipelined, we can effectively eliminate cache miss stalls: all memory addresses will be already cached by the time they are used.

The operation visit(first, last, f) internally splits [first, last) into chunks of bulk_visit_size elements that are then processed as described above. This chunk size has to be sufficiently large to give time for memory to be actually cached at the point of usage. On the upper side, the chunk size is limited by the number of outstanding memory requests that the CPU can handle at a time: in Intel architectures, this is limited by the size of the line fill buffer, typically 10-12. We have empirically confirmed that bulk visitation maxes at around bulk_visit_size = 16, and stabilizes beyond that.

Performance analysis

For our study of bulk visitation performance, we have used a computer with a Skylake-based Intel Core i5-8265U CPU:


Size/core Latency [ns] L1 data cache 32 KB 3.13 L2 cache 256 KB 6.88 L3 cache 6 MB 25.00 DDR4 RAM
77.25

We measure the throughput in Mops/sec of single-threaded lookup (50/50 successful/unsuccessful) for both regular and bulk visitation on a boost::concurrent_flat_map<int, int> with sizes N = 3k, 25k, 600k, and 10M: for the three first values, the container fits entirely into L1, L2 and L3, respectively. The test program has been compiled with clang-cl for Visual Studio 2022 in release mode.

As expected, the relative performance of bulk vs. regular visitation grows as data is fetched from a slower cache (or RAM in the latter case). The theoretical throughput achievable by bulk visitation has been estimated from regular visitation by subtracting memory retrieval times as calculated with the following model:

  • If the container fits in Ln (L4 = RAM), Ln−1 is entirely occupied by metadata and access objects (and some of this data spills over to Ln).
  • Mutex and element retrieval times (which only apply to successful visitation) are dominated by the latter.

Actual and theoretical figures match quite well, which sugggests that the algorithmic overhead imposed by bulk visitation is negligible.

We have also run benchmarks under conditions more similar to real-life for boost::concurrent_flat_map, with and without bulk visitation, and other concurrent containers, using different compilers and architectures. As an example, these are the results for a workload of 50M insert/lookup mixed operations distributed across several concurrent threads for different data distributions with Clang 12 on an ARM64 computer:

5M updates, 45M lookups
skew=0.01 5M updates, 45M lookups
skew=0.5 5M updates, 45M lookups
skew=0.99

Again, bulk visitation increases performance noticeably. Please refer to the benchmark site for further information and results.

Conclusions and next steps

Bulk visitation is an addition to the interface of boost::concurrent_flat_map and boost::concurrent_flat_set that improves lookup performance by pipelining the internal visitation operations for chunked groups of keys. The tradeoff for this increased throughput is higher latency, as keys need to be batched by the user code before issuing the visit operation.

The insights we have gained with bulk visitation for concurrent containers can be leveraged for future Boost.Unordered features:

  • In principle, insertion can also be made to operate in bulk mode, although the resulting pipelined algorithm is likely more complex than in the visitation case, and thus performance increases are expected to be lower.
  • Bulk visitation (and insertion) is directly applicable to non-concurrent containers such as boost::unordered_flat_map: the main problem for this is one of interface design because we are not using visitation here as the default lookup API (classical iterator-based lookup is provided instead). Some possible options are:
    1. Use visitation as in the concurrent case.
    2. Use an iterator-based lookup API that outputs the resulting iterators to some user-provided buffer (probably modelled as an output "meta" iterator taking container iterators).

Bulk visitation will be officially shipping in Boost 1.84 (December 2023) but is already available by checking out the Boost.Unordered repo. If you are interested in this feature, please try it and report your local results and suggestions for improvement. Your feedback on our current and future work is much welcome.

tag:blogger.com,1999:blog-2715968472735546962.post-792492574681886300
Extensions
User-defined class qualifiers in C++23
Show full content

It is generally known that type qualifiers (such as const and volatile in C++) can be regarded as a form of subtyping: for instance, const T is a supertype of T because the interface (available operations) of T are strictly wider than those of const T. Foster et al. call a qualifier q positive if q T is a supertype of T, and negative it if is the other way around. Without real loss of generality, in what follows we only consider negative qualifiers, where q T is a subtype of (extends the interface of) T.

C++23 explicit object parameters (coloquially known as "deducing this") allow for a particularly concise and effective realization of user-defined qualifiers for class types beyond what the language provides natively. For instance, this is a syntactically complete implementation of qualifier mut, the dual/inverse of const (not to be confused with mutable):

template<typename T>
struct mut: T
{
using T::T;
};

template<typename T>
T& as_const(T& x) { return x;}

template<typename T>
T& as_const(mut<T>& x) { return x;}

struct X
{
void foo() {}
void bar(this mut<X>&) {}
};

int main()
{
mut<X> x;
x.foo();
x.bar();

auto& y = as_const(x);
y.foo();
y.bar(); // error: cannot convert argument 1 from 'X' to 'mut<X> &'

X& z = x;
z.foo();
z.bar(); // error: cannot convert argument 1 from 'X' to 'mut<X> &'
}

The class X has a regular (generally accessible) member function foo and then bar, which is only accessible to instances of the form mut<X>. Access checking and implicit and explicit conversion between subtype mut<X> and mut<X> work as expected.

With some help fom Boost.Mp11, the idiom can be generalized to the case of several qualifiers:

#include <boost/mp11/algorithm.hpp>
#include <boost/mp11/list.hpp>
#include <type_traits>

template<typename T,typename... Qualifiers>
struct access: T
{
using qualifier_list=boost::mp11::mp_list<Qualifiers...>;

using T::T;
};

template<typename T, typename... Qualifiers>
concept qualified =
(boost::mp11::mp_contains<
typename std::remove_cvref_t<T>::qualifier_list,
Qualifiers>::value && ...);

// some qualifiers
struct mut;
struct synchronized;

template<typename T>
concept is_mut = qualified<T, mut>;

template<typename T>
concept is_synchronized = qualified<T, synchronized>;

struct X
{
void foo() {}

template<is_mut Self>
void bar(this Self&&) {}

template<is_synchronized Self>
void baz(this Self&&) {}

template<typename Self>
void qux(this Self&&)
requires qualified<Self, mut, synchronized>
{}
};

int main()
{
access<X, mut> x;

x.foo();
x.bar();
x.baz(); // error: associated constraints are not satisfied
x.qux(); // error: associated constraints are not satisfied

X y;
x.foo();
y.bar(); // error: associated constraints are not satisfied

access<X, mut, synchronized> z;
z.bar();
z.baz();
z.qux();
}
One difficulty remains, though:
int main()
{
access<X, mut, synchronized> z;
//...
access<X, mut>& w=z; // error: cannot convert from
// 'access<X,mut,synchronized>'
// to 'access<X,mut> &'
}
access<T,Qualifiers...>& converts to T&, but not to access<T,Qualifiers2...>&  where Qualifiers2 is a subset of  Qualifiers (for the mathematically inclined, qualifiers q1, ... , qN over a type T induce a lattice of subtypes Q T, Q ⊆ {q1, ... , qN}, ordered by qualifier inclusion). Incurring undefined behavior, we could do the following:
template<typename T,typename... Qualifiers>
struct access: T
{
using qualifier_list=boost::mp11::mp_list<Qualifiers...>;

using T::T;

template<typename... Qualifiers2>
operator access<T, Qualifiers2...>&()
requires qualified<access, Qualifiers2...>
{
return reinterpret_cast<access<T, Qualifiers2...>&>(*this);
}
};
A more interesting challenge is the following: As laid out, this technique implements syntactic qualifier subtyping, but does not do anything towards enforcing the semantics associated to each qualifier: for instance, synchronized should lock a mutex automatically, and a qualifier associated to some particular invariant should assert it after each invocation to a qualifier-constraied member function. I don't know if this functionality can be more or less easily integrated into the presented framework: feedback on the matter is much welcome.
tag:blogger.com,1999:blog-2715968472735546962.post-7587804841869976956
Extensions
Inside boost::concurrent_flat_map
Show full content

Introduction

Starting in Boost 1.83, Boost.Unordered provides boost::concurrent_flat_map, an associative container suitable for high-load parallel scenarios. boost::concurrent_flat_map leverages much of the work done for boost::unordered_flat_map, but also introduces innovations, particularly in the areas of low-contention operation and API design, that we find worth discussing.

State of the art

The space of C++ concurrent hashmaps spans a diversity of competing techniques, from traditional ones such as lock-based structures or sharding, to very specialized approaches relying on CAS instructions, hazard pointers, Read-Copy-Update (RCU), etc. We list some prominent examples:

  • tbb::concurrent_hash_map uses closed addressing combined with bucket-level read-write locking. The bucket array is split in a number of segments to allow for incremental rehashing without locking the entire table. Concurrent insertion, lookup and erasure are supported, but iterators are not thread safe. Locked access to elements is done via so-called accessors.
  • tbb::concurrent_unordered_map also uses closed addressing, but buckets are organized into lock-free split-ordered lists. Concurrent insertion, lookup, and traversal are supported, whereas erasure is not thread safe. Element access via iterators is not protected against data races.
  • Sharding consists in dividing the hashmap into a fixed number N of submaps indexed by hash (typically, the element x goes into the submap with index hash(x) mod N). Sharding is extremely easy to implement starting from a non-concurrent hashmap and provides incremental rehashing, but the degree of concurrency is limited by N. As an example, gtl::parallel_flat_hash_map uses sharding with submaps essentially derived from absl::flat_hash_map, and inherits the excellent performance of this base container.
  • libcuckoo::cuckoohash_map adds efficient thread safety to classical cuckoo hashing by means of a number of carefully engineered techniques including fine-grained locking of slot groups or "strips" (of size 4 by default), optimistic insertion and data prefetching.
  • Meta's folly::ConcurrentHashMap combines closed addressing, sharding and hazard pointers to elements to achieve lock-free lookup (modifying operations such as insertion and erasure lock the affected shard). Iterators, which internally hold a hazard pointer to the element, can be validly dereferenced even after the element has been erased from the map; access, on the other hand, is constant and elements are basically treated as immutable.
  • folly::AtomicHashMap is a very specialized hashmap that imposes severe usage restrictions in exchange for very high time and space performance. Keys must be trivially copyable and 32 or 64 bits in size so that they can be handled internally by means of atomic instructions; also, some key values must be reserved to mark empty slots, tombstones and locked elements, so that no extra memory is required for bookkeeping information and locks. The internal data structure is based on open addressing with linear probing. Non-modifying operations are lock-free. Rehashing is not provided: instead, extra bucket arrays are appended when the map becomes full, the expectation being that the user provide the estimated final size at construction time to avoid this rather inefficient growth mechanism. Element access is not protected against data races.
  • On a more experimental/academic note, we can mention initiatives such as Junction, Folklore and DRAMHiT. In general, these do not provide industry-grade container implementations but explore interesting ideas that could eventually be adopted by mainstream libraries, such as RCU-based data structures, lock-free algorithms relying on CAS and/or transactional memory, parallel rehashing and operation batching.
Design principles

Unlike non-concurrent C++ containers, where the STL acts as a sort of reference interface, concurrent hashmaps in the market differ wildly in terms of requirements, API and provided functionality. When designing boost::concurrent_flat_map, we have aimed for a general-purpose container

  • with no special restrictions on key and mapped types,
  • providing full thread safety without external synchronization mechanisms,
  • and disrupting as little as possible the conceptual and operational model of "traditional" containers.

These principles rule out some scenarios such as requiring that keys be of an integral type or putting an extra burden on the user in terms of access synchronization or active garbage collection. They also inform concrete design decisions:

  • boost::concurrent_flat_map<Key, T, Hash, Pred, Allocator> must be a valid instantiation in all practical cases where boost::unordered_flat_map<Key, T, Hash, Pred, Allocator> is.
  • Thread-safe value semantics are provided (including copy construction, assignment, swap, etc.)
  • All member functions in boost::unordered_flat_map are provided by boost::concurrent_flat_map except if there's a fundamental reason why they can't work safely or efficiently in a concurrent setting.

The last guideline has the most impact on API design. In particular, we have decided not to provide iterators, either blocking or not: if not-blocking, they're unsafe, and if blocking they increase contention when not properly used, and can very easily lead to deadlocks:

// Thread 1
map_type::iterator it1=map.find(x1), it2=map.find(x2);

// Thread 2
map_type::iterator it2=map.find(x2), it1=map.find(x1);

In place of iterators, boost::concurrent_flat_map offers an access API based on internal visitation, as described in a later section.

Data structure

boost::concurrent_flat_map uses the same open-addressing layout as boost::unordered_flat_map, where the bucket array is split into 2n groups of N = 15 slots and each group has an associated 16-byte metadata word for SIMD-based reduced-hash matching and insertion overflow control.

On top of this layout, two synchronization levels are added:

  • Container level: A read-write mutex is used to control access from any operation to the container. This access is always requested in read mode (i.e. shared) except for operations that require that the whole bucket array be replaced, like rehashing, swapping, assignment, etc. This means that, in practice, this level of synchronization does not cause any contention at all, even for modifying operations like insertion and erasure. To reduce cache coherence traffic, the mutex is implemented as an array of read-write spinlocks occupying separate cache lines, and each thread is assigned one spinlock in a round-robin fashion at thread_local construction time: read/shared access does only involve the assigned spinlock, whereas write/exclusive access, which is comparatively much rarer, requires that all spinlocks be locked.
  • Group level: Each group has a dedicated read-write spinlock to control access to its slots, plus an atomic insertion counter used for transactional optimistic insertion as described below.
Algorithms

The core algorithms of boost::concurrent_flat_map are variations of those of boost::unordered_flat_map with minimal changes to prevent data races while keeping group-level contention to a minimum.

In the following diagrams, white boxes represent lock-free steps, while gray boxes are executed within the scope of a group lock. Metadata is handled atomically both in locked and lock-free scenarios.

Lookup

Most steps of the lookup algorithm (hash calculation, probing, element pre-checking via SIMD matching with the value's reduced hash) are lock-free and do not synchronize with any operation on the metadata. When SIMD matching detects a potential candidate, double-checking for slot occupancy and the actual comparison with the element are done within the group lock; note that the occupancy double check is necessary precisely because SIMD matching is lock-free and the status of the identified slot may have changed before group locking.

Insertion

The main challenge of any concurrent insertion algorithm is to prevent an element x from being inserted twice by different threads running at the same time. As open-addressing probing starts at a position p0 univocally determined by the hash value of x, a naïve (and flawed) approach is to lock p0 for the entire duration of the insertion procedure: this leads to deadlocking if the probing sequences of two different elements intersect.

We have implemented the following transactional optimistic insertion algorithm: At the beginning of insertion, the value of the insertion counter for the group at position p0 is saved locally and insertion proceeds normally, first checking that an element equivalent to x does not exist and then looking for available slots starting at p0 and locking only one group of the probing sequence at a time; when an available slot is found, the associated metadata is updated, the insertion counter at p0 is incremented, and:

  • If no other thread got in the way (i.e. if the pre-increment value of the counter coincides with the local value stored at the beginning), then the transaction is successful and insertion can be finished by storing the element into the slot before releasing the group lock.
  • Otherwise, metadata changes are rolled back and the entire insertion process is started over.

Our measurements indicate that, even under adversarial situations, the ratio of start-overs to successful insertions ranges in the parts per million.

Visitation API

From an operational point of view, container iterators serve two main purposes: combining lookup/insertion with further access to the relevant element:

auto it = m.find(k);
if (it != m.end()) {
  it->second = 0;
}

and container traversal:

// iterators used internally by range-for
for(auto& x: m) {
  x.second = 0;
}

Having decided that boost::concurrent_flat_map not rely on iterators due to their inherent concurrency problems, a design alternative is to move element access into the container operations themselves, where it can be done in a thread-safe manner. This is just a form of the familiar visitation pattern:

m.visit(k, [](auto& x) {
  x.second = 0;
});

m.visit_all([](auto& x) {
  x.second = 0;
});

boost::concurrent_flat_map provides visitation-enabled variations of classical map operations wherever it makes sense:

  • visit, cvisit (in place of find)
  • visit_all, cvisit_all (as a substitute of container traversal)
  • emplace_or_visit, emplace_or_cvisit
  • insert_or_visit, insert_or_cvisit
  • try_emplace_or_visit, try_emplace_or_cvisit

cvisit stands for constant visitation, that is, the visitation function is granted read-only access to the element, which has less contention than write access.

Traversal functions [c]visit_all and erase_if have also parallel versions:

m.visit_all(std::execution::par, [](auto& x) {
  x.second = 0;
});
Benchmarks

We've tested boost::concurrent_flat_map against tbb::concurrent_hash_map and gtl::parallel_flat_hash_map for the following synthetic scenario: T threads concurrently perform N operations update, successful lookup and unsuccessful lookup, randomly chosen with probabilities 10%, 45% and 45%, respectively, on a concurrent map of (int, int) pairs. The keys used by all operations are also random, where update and successful lookup follow a Zipf distribution over [1, N/10] with skew exponent s, and unsuccessful lookup follows a Zip distribution with the same skew s over an interval not overlapping with the former.

We provide the full benchmark code and results for different 64- and 32-bit architectures in a dedicated repository; here, we just show as an example the plots for Visual Studio 2022 in x64 mode on an AMD Ryzen 5 3600 6-Core @ 3.60 GHz without hyperthreading and 64 GB of RAM.

500k updates, 4.5M lookups
skew=0.01 500k updates, 4.5M lookups
skew=0.5 500k updates, 4.5M lookups
skew=0.99 5M updates, 45M lookups
skew=0.01 5M updates, 45M lookups
skew=0.5 5M updates, 45M lookups
skew=0.99

Note that, for the scenario with 500k updates, boost::concurrent_flat_map continues to improve after the number of threads exceed the number of cores (6), a phenomenon for which we don't have a readily explanation —we could hypothesize that execution is limited by memory latency, but the behavior does not reproduce in the scenario with 5M updates, where the cache miss ratio is necessarily higher. Note also that gtl::parallel_flat_hash_map performs comparatively worse for high-skew scenarios where the load is concentrated on a very small number of keys: this may be due to gtl::parallel_flat_hash_map having a much coarser lock granularity (256 shards in the configuration used) than the other two containers.

In general, results are very dependent on the particular CPU and memory system used; you are welcome to try out the benchmark in your architecture of interest and report back.

Conclusions and next steps

boost::concurrent_flat_map is a new, general-purpose concurrent hashmap that leverages the very performant open-addressing techniques of boost::unordered_flat_map and provides a fully thread-safe, iterator-free API we hope future users will find flexible and convenient.

We are considering a number of new functionalities for upcoming releases:

  • As boost::concurrent_flat_map and boost::unordered_flat_map basically share the same data layout, it's possible to efficiently implement move construction from one to another by simply transferring the internal structure. There are scenarios where this feature can lead to more performant execution, like, for instance, multithreaded population of a boost::concurrent_flat_map followed by single- or multithreaded read-only lookup on a boost::unordered_flat_map move-constructed from the former.
  • DRAMHiT shows that pipelining/batching several map operations on the same thread in combination with heavy memory prefetching can reduce or eliminate waiting CPU cycles. We have conducted some preliminary experiments using this idea for a feature we dubbed bulk lookup (providing an array of keys to look for at once), with promising results.

We're launching this new container with trepidation: we cannot possibly try the vast array of different CPU architectures and scenarios where concurrent hashmaps are used, and we don't have yet field data on the suitability of the novel API we're proposing for boost::concurrent_flat_map. For these reasons, your feedback and proposals for improvement are most welcome.

tag:blogger.com,1999:blog-2715968472735546962.post-1771705204450027343
Extensions
Inside boost::unordered_flat_map
Show full content
Introduction

Starting in Boost 1.81 (December 2022), Boost.Unordered provides, in addition to its previous implementations of C++ unordered associative containers, the new containers boost::unordered_flat_map and boost::unordered_flat_set (for the sake of brevity, we will only refer to the former in the remaining of this article). If boost::unordered_map strictly adheres to the C++ specification for std::unordered_map, boost::unordered_flat_map deviates in a number of ways from the standard to offer dramatic performance improvements in exchange; in fact, boost::unordered_flat_map ranks amongst the fastest hash containers currently available to C++ users.

We describe the internal structure of boost::unordered_flat_map and provide theoretical analyses and benchmarking data to help readers gain insights into the key design elements behind this container's excellent performance. Interface and behavioral differences with the standard are also discussed.

The case for open addressing

We have previously discussed why closed addressing was chosen back in 2003 as the implicit layout for std::unordered_map. 20 years after, open addressing techniques have taken the lead in terms of performance, and the fastest hash containers in the market all rely on some variation of open addressing, even if that means that some deviations have to be introduced from the baseline interface of std::unordered_map.

The defining aspect of open addressing is that elements are stored directly within the bucket array (as opposed to closed addressing, where multiple elements can be held into the same bucket, usually by means of a linked list of nodes). In modern CPU architectures, this layout is extremely cache friendly:

  • There's no indirection needed to go from the bucket position to the element contained.
  • Buckets are stored contiguously in memory, which improves cache locality.

The main technical challenge introduced by open addressing is what to do when elements are mapped into the same bucket, i.e. when a collision happens: in fact, all open-addressing variations are basically characterized by their collision management techniques. We can divide these techniques into two broad classes:

  • Non-relocating: if an element is mapped to an occupied bucket, a probing sequence is started from that position until a vacant bucket is located, and the element is inserted there permanently (except, of course, if the element is deleted or if the bucket array is grown and elements rehashed). Popular probing mechanisms are linear probing (buckets inspected at regular intervals), quadratic probing and double hashing. There is a tradeoff between cache locality, which is better when the buckets probed are close to each other, and average probe length (the expected number of buckets probed until a vacant one is located), which grows larger (worse) precisely when probed buckets are close —elements tend to form clusters instead of spreading uniformly throughout the bucket array.
  • Relocating: as part of the search process for a vacant bucket, elements can be moved from their position to make room for the new element. This is done in order to improve cache locality by keeping elements close to their "natural" location (that indicated by the hash → bucket mapping). Well known relocating algorithms are cuckoo hashing, hopscotch hashing and Robin Hood hashing.

If we take it as an important consideration to stay reasonably close to the original behavior of std::unordered_map, relocating techniques pose the problem that insert may invalidate iterators to other elements (so, they work more like std::vector::insert).

On the other hand, non-relocating open addressing faces issues on deletion: lookup starts at the original hash → bucket position and then keeps probing till the element is found or probing terminates, which is signalled by the presence of a vacant bucket:

So, erasing an element can't just restore its holding bucket as vacant, since that would preclude lookup from reaching elements further down the probe sequence:

A common techique to deal with this problem is to label buckets previously containing an element with a tombstone marker: tombstones are good for inserting new elements but do not stop probing on lookup:

Note that the introduction of tombstones implies that the average lookup probe length of the container won't decrease on deletion —again, special measures can be taken to counter this.

SIMD-accelerated lookup

SIMD technologies, such as SSE2 and Neon, provide advanced CPU instructions for parallel arithmetic and logical operations on groups of contiguous data values: for instance, SSE2 _mm_cmpeq_epi8 takes two packs of 16 bytes and compares them for equality pointwise, returning the result as another pack of bytes. Although SIMD was originally meant for acceleration of multimedia processing applications, the implementors of some unordered containers, notably Google's Abseil's Swiss tables and Meta's F14, realized they could leverage this technology to improve lookup times in hash tables.

The key idea is to maintain, in addition to the bucket array itself, a separate metadata array holding reduced hash values (usually one byte in size) obtained from the hash values of the elements stored in the corresponding buckets. When looking up for an element, SIMD can be used on a pack of contiguous reduced hash values to quickly discard non-matching buckets and move on to full comparison for matching positions. This technique effectively checks a moderate number of buckets (16 for Abseil, 14 for F14) in constant time. Another beneficial effect of this approach is that special bucket markers (vacant, tombstone, etc.) can be moved to the metadata array —otherwise, these markers would take up extra space in the bucket itself, or else some representation values of the elements would have to be restricted from user code and reserved for marking purposes.

boost::unordered_flat_map data structure

boost::unordered_flat_map's bucket array is logically split into 2n groups of N = 15 buckets, and has a companion metadata array consisting of 2n 16-byte words. Hash mapping is done at the group level rather than on individual buckets: so, to insert an element with hash value h, the group at position h / 2Wn is selected and its first available bucket used (W is 64 or 32 depending on whether the CPU architecture is 64- or 32-bit, respectively); if the group is full, further groups are checked using a quadratic probing sequence.

The associated metadata is organized as follows (least significant byte depicted rightmost):

hi holds information about the i-th bucket of the group:

  • 0 if the bucket is empty,
  • 1 to signal a sentinel (a special value at the end of the bucket array used to finish container iteration).
  • otherwise, a reduced hash value in the range [2, 255] obtained from the least significant byte of the element's hash value.

When looking up within a group for an element with hash value h, SIMD operations, if available, are used to match the reduced value of h against the pack of values {h0, h1, ... , h14}. Locating an empty bucket for insertion is equivalent to matching for 0.

ofw is the so-called overflow byte: when inserting an element with hash value h, if the group is full then the (h mod 8)-th bit of ofw is set to 1 before moving to the next group in the probing sequence. Lookup probing can then terminate when the corresponding overflow bit is 0. Note that this procedure removes the need to use tombstones.

If neither SSE2 nor Neon is available on the target architecture, the logical organization of metadata stays the same, but information is mapped to two physical 64-bit words using bit interleaving as shown in the figure:

Bit interleaving allows for a reasonably fast implementation of matching operations in the absence of SIMD.

Rehashing

The maximum load factor of boost::unordered_flat_map is 0.875 and can't be changed by the user. As discussed previously, non-relocating open addressing has the problem that average probe length doesn't decrease on deletion when the erased elements are in mid-sequence: so, continously inserting and erasing elements without triggering a rehash will slowly degrade the container's performance; we call this phenomenon drifting. boost::unordered_flat_map introduces the following anti-drift mechanism: rehashing is controled by the container's maximum load, initially 0.875 times the size of the bucket array; when erasing an element whose associated overflow bit is not zero, the maximum load is decreased by one. Anti-drift guarantees that rehashing will be eventually triggered in a scenario of repeated insertions and deletions.

Hash post-mixing

It is well known that open-addressing containers require that the hash function be of good quality, in the sense that close input values (for some natural notion of closeness) are mapped to distant hash values. In particular, a hash function is said to have the avalanching property if flipping a bit in the physical representation of the input changes all bits of the output value with probability 50%. Note that avalanching hash functions are extremely well behaved, and less stringent behaviors are generally good enough in most open-addressing scenarios.

Being a general-purpose container, boost::unordered_flat_map does not impose any condition on the user-provided hash function beyond what is required by the C++ standard for unordered associative containers. In order to cope with poor-quality hash functions (such as the identity for integral types), an automatic bit-mixing stage is added to hash values:

  • 64-bit architectures: we use the xmx function defined in Jon Maiga's "The construct of a bit mixer".
  • 32-bit architectures: the chosen mixer has been automatically generated by Hash Function Prospector and selected as the best overall performer in internal benchmarks. Score assigned by Hash Prospector: 333.7934929677524.

There's an opt-out mechanism available to end users so that avalanching hash functions can be marked as such and thus be used without post-mixing. In particular, the specializations of boost::hash for string types are marked as avalanching.

Statistical properties of boost::unordered_flat_map

We have written a simulation program to calculate some statistical properties of boost::unordered_flat_map as compared with Abseil's absl::flat_hash_map, which is generally regarded as one of the fastest hash containers available. For the purposes of this analysis, the main design characteristics of absl::flat_hash_map are:

  • Bucket array sizes are of the form 2n, n ≥ 4.
  • Hash mapping is done at the bucket level (rather than at the group level as in boost::unordered_flat_map).
  • Metadata consists of one byte per bucket, where the most significant bit is set to 1 if the bucket is empty, deleted (tombstone) or a sentinel. The remaining 7 bits hold the reduced hash value for occupied buckets.
  • Lookup/insertion uses SIMD to inspect the 16 contiguous buckets beginning at the hash-mapped position, and then continues with further 16-bucket groups using quadratic probing. Probing ends when a non-full group is found. Note that the start positions of these groups are not aligned modulo 16.

The figure shows:

  • the probability that a randomly selected group is full,
  • the average number of hops (i.e. the average probe length minus one) for successful and unsuccessful lookup

as functions of the load factor, with perfectly random input and without intervening deletions. Solid line is boost::unordered_flat_map, dashed line is absl::flat_hash_map.

Some observations:

  • Pr(group is full) is higher for boost::unordered_flat_map. This follows from the fact that free buckets cluster at the end of 15-aligned groups, whereas for absl::flat_hash_map free buckets are uniformly distributed across the array, which increases the probability that a contiguous 16-bucket chunk contains at least one free position. Consequently, E(num hops) for successful lookup is also higher in boost::unordered_flat_map.
  • By contrast, E(num hops) for unsuccessful lookup is considerably lower in boost::unordered_flat_map: absl::flat_hash_map uses an all-or-nothing condition for probe termination (group is non-full/full), whereas boost::unordered_flat_map uses the 8 bits of information in the overflow byte to allow for more finely-grained termination —effectively, making probe termination ~1.75 times more likely. The overflow byte acts as a sort of Bloom filter to check for probe termination based on reduced hash value.

The next figure shows the average number of actual comparisons (i.e. when the reduced hash value matched) for successful and unsuccessful lookup. Again, solid line is boost::unordered_flat_map and dashed line is absl::flat_hash_map.

E(num cmps) is a function of:

  • E(num hops) (lower better),
  • the size of the group (lower better),
  • the number of bits of the reduced hash value (higher better).

We see then that boost::unordered_flat_map approaches absl::flat_hash_map on E(num cmps) for successful lookup (1% higher or less), despite its poorer E(num hops) figures: this is so because boost::unordered_flat_map uses smaller groups (15 vs. 16) and, most importantly, because its reduced hash values contain log2(254) = 7.99 bits vs. 7 bits in absl::flat_hash_map, and each additional bit in the hash reduced value decreases the number of negative comparisons roughly by half. In the case of E(num cmps) for unsuccessful lookup, boost::unordered_flat_map figures are up to 3.2 times lower under high-load conditions.

Benchmarks Running-n plots

We have measured the execution times of boost::unordered_flat_map against absl::flat_hash_map and boost::unordered_map for basic operations (insertion, erasure during iteration, successful lookup, unsuccessful lookup) with container size n ranging from 10,000 to 10M. We provide the full benchmark code and results for different 64- and 32-bit architectures in a dedicated repository; here, we just show the plots for GCC 11 in x64 mode on an AMD EPYC Rome 7302P @ 3.0GHz. Please note that each container uses its own default hash function, so a direct comparison of execution times may be slightly biased.

Running insertion Running erasure
Successful lookup Unsuccessful lookup

As predicted by our statistical analysis, boost::unordered_flat_map is considerably faster than absl::flat_hash_map for unsuccessful lookup because the average probe length and number of (negative) comparisons are much lower; this effect translates also to insertion, since insert needs to first check that the element is not present, so it internally performs an unsuccessful lookup. Note how performance is less impacted (stays flatter) when the load factor increases.

As for successful lookup, boost::unordered_flat_map is still faster, which may be due to its better cache locality, particularly for low load factors: in this situation, elements are clustered at the beginning portion of each group, while for absl::flat_hash_map they are uniformly distributed with more empty space in between.

boost::unordered_flat_map is slower than absl::flat_hash_map for runnning erasure (erasure of some elements during container traversal). The actual culprit here is iteration, which is particularly slow; this is a collateral effect of having SIMD operations work only on 16-aligned metadata words, while absl::flat_hash_map iteration looks ahead 16 metadata bytes beyond the current iterator position.

Aggregate performance

Boost.Unordered provides a series of benchmarks emulating real-life scenarios combining several operations for a number of hash containers and key types (std::string, std::string_view, std::uint32_t, std::uint64_t and a UUID class of size 16). The interested reader can build and run the benchmarks on her environment of choice; as an example, these are the results for GCC 11 in x64 mode on an Intel Xeon E5-2683 @ 2.10GHz:

std::string
               std::unordered_map: 38021 ms, 175723032 bytes in 3999509 allocations
             boost::unordered_map: 30785 ms, 149465712 bytes in 3999510 allocations
        boost::unordered_flat_map: 14486 ms, 134217728 bytes in 1 allocations
                  multi_index_map: 30162 ms, 178316048 bytes in 3999510 allocations
              absl::node_hash_map: 15403 ms, 139489608 bytes in 3999509 allocations
              absl::flat_hash_map: 13018 ms, 142606336 bytes in 1 allocations
       std::unordered_map, FNV-1a: 43893 ms, 175723032 bytes in 3999509 allocations
     boost::unordered_map, FNV-1a: 33730 ms, 149465712 bytes in 3999510 allocations
boost::unordered_flat_map, FNV-1a: 15541 ms, 134217728 bytes in 1 allocations
          multi_index_map, FNV-1a: 33915 ms, 178316048 bytes in 3999510 allocations
      absl::node_hash_map, FNV-1a: 20701 ms, 139489608 bytes in 3999509 allocations
      absl::flat_hash_map, FNV-1a: 18234 ms, 142606336 bytes in 1 allocations

std::string_view std::unordered_map: 38481 ms, 207719096 bytes in 3999509 allocations boost::unordered_map: 26066 ms, 181461776 bytes in 3999510 allocations boost::unordered_flat_map: 14923 ms, 197132280 bytes in 1 allocations multi_index_map: 27582 ms, 210312120 bytes in 3999510 allocations absl::node_hash_map: 14670 ms, 171485672 bytes in 3999509 allocations absl::flat_hash_map: 12966 ms, 209715192 bytes in 1 allocations std::unordered_map, FNV-1a: 45070 ms, 207719096 bytes in 3999509 allocations boost::unordered_map, FNV-1a: 29148 ms, 181461776 bytes in 3999510 allocations boost::unordered_flat_map, FNV-1a: 15397 ms, 197132280 bytes in 1 allocations multi_index_map, FNV-1a: 30371 ms, 210312120 bytes in 3999510 allocations absl::node_hash_map, FNV-1a: 19251 ms, 171485672 bytes in 3999509 allocations absl::flat_hash_map, FNV-1a: 17622 ms, 209715192 bytes in 1 allocations
std::uint32_t std::unordered_map: 21297 ms, 192888392 bytes in 5996681 allocations boost::unordered_map: 9423 ms, 149424400 bytes in 5996682 allocations boost::unordered_flat_map: 4974 ms, 71303176 bytes in 1 allocations multi_index_map: 10543 ms, 194252104 bytes in 5996682 allocations absl::node_hash_map: 10653 ms, 123470920 bytes in 5996681 allocations absl::flat_hash_map: 6400 ms, 75497480 bytes in 1 allocations
std::uint64_t std::unordered_map: 21463 ms, 240941512 bytes in 6000001 allocations boost::unordered_map: 10320 ms, 197477520 bytes in 6000002 allocations boost::unordered_flat_map: 5447 ms, 134217728 bytes in 1 allocations multi_index_map: 13267 ms, 242331792 bytes in 6000002 allocations absl::node_hash_map: 10260 ms, 171497480 bytes in 6000001 allocations absl::flat_hash_map: 6530 ms, 142606336 bytes in 1 allocations
uuid std::unordered_map: 37338 ms, 288941512 bytes in 6000001 allocations boost::unordered_map: 24638 ms, 245477520 bytes in 6000002 allocations boost::unordered_flat_map: 9223 ms, 197132280 bytes in 1 allocations multi_index_map: 25062 ms, 290331800 bytes in 6000002 allocations absl::node_hash_map: 14005 ms, 219497480 bytes in 6000001 allocations absl::flat_hash_map: 10559 ms, 209715192 bytes in 1 allocations

Each container uses its own default hash function, except the entries labeled FNV-1a in std::string and std::string_view, which use the same implementation of Fowler–Noll–Vo hash, version 1a, and uuid, where all containers use the same user-provided function based on boost::hash_combine.

Deviations from the standard

The adoption of open addressing imposes a number of deviations from the C++ standard for unordered associative containers. Users should keep them in mind when migrating to boost::unordered_flat_map from boost::unordered_map (or from any other implementation of std::unordered_map):

  • Both Key and T in boost::unordered_flat_map<Key,T> must be MoveConstructible. This is due to the fact that elements are stored directly into the bucket array and have to be transferred to a new block of memory on rehashing; by contrast, boost::unordered_map is a node-based container and elements are never moved once constructed.
  • For the same reason, pointers and references to elements become invalid after rehashing (boost::unordered_map only invalidates iterators).
  • begin() is not constant-time (the bucket array is traversed till the first non-empty bucket is found).
  • erase(iterator) returns void rather than an iterator to the element after the erased one. This is done to maximize performance, as locating the next element requires traversing the bucket array; if that element is absolutely required, the erase(iterator++) idiom can be used. This performance issue is not exclusive to open addressing, and has been discussed in the context of the C++ standard too. (Update Oct 19, 2024: This limitation has been partially solved.)
  • The maximum load factor can't be changed by the user (max_load_factor(z) is provided for backwards compatibility reasons, but does nothing). Rehashing can occur before the load reaches max_load_factor() * bucket_count() due to the anti-drift mechanism described previously.
  • There is no bucket API (bucket_size, begin(n), etc.) save bucket_count.
  • There are no node handling facilities (extract, etc.) Such functionality makes no sense here as open-addressing containers are precisely not node-based. merge is provided, but the implementation relies on element movement rather than node transferring.
Conclusions and next steps

boost::unordered_flat_map and boost::unordered_flat_set are the new open-addressing containers in Boost.Unordered providing top speed in exchange for some interface and behavioral deviations from the standards-compliant boost::unordered_map and boost::unordered_set. We have analyzed their internal data structure and provided some theoretical and practical evidence for their excellent performance. As of this writing, we claim boost::unordered_flat_map/boost::unordered_flat_set to rank among the fastest hash containers available to C++ programmers.

With this work, we have reached an important milestone in the ongoing Development Plan for Boost.Unordered. After Boost 1.81, we will continue improving the functionality and performance of existing containers and will possibly augment the available container catalog to offer greater freedom of choice to Boost users. Your feedback on our current and future work is much welcome.

tag:blogger.com,1999:blog-2715968472735546962.post-6276904501422653074
Extensions
Deferred argument evaluation
Show full content

Suppose our program deals with heavy entities of some type object which are uniquely identified by an integer ID. The following is a possible implementation of a function that controls ID-constrained creation of such objects:

object* retrieve_or_create(int id)
{
  static std::unordered_map<int, std::unique_ptr<object>> m;

  // see if the object is already in the map
auto [it,b] = m.emplace(id, nullptr);
// create it otherwise if(b) it->second = std::make_unique<object>(id); return it->second.get(); }

Note that the code is careful not to create a spurious object if an equivalent one already exists; but in doing so, we have introduced a potentially inconsistency in the internal map if object creation throws:

// fixed version

object* retrieve_or_create(int id)
{
  static std::unordered_map<int, std::unique_ptr<object>> m;

  // see if the object is already in the map
auto [it,b] = m.emplace(id, nullptr); // create it otherwise
if(b){ try{ it->second = std::make_unique<object>(id); } catch(...){
// we can get here when running out of memory, for instance m.erase(it); throw; } } return it->second.get(); }

This fixed version is a little cumbersome, to say the least. Starting in C++17, we can use try_emplace to rewrite retrieve_or_create as follows:

object* retrieve_or_create(int id)
{
  static std::unordered_map<int, std::unique_ptr<object>> m;

  auto [it,b] = m.try_emplace(id, std::make_unique<object>(id));
  return it->second.get();
}

But then we've introduced the problem of spurious object creation we strived to avoid. Ideally, we'd like for try_emplace to not create the object except when really needed. What we're effectively asking for is some sort of technique for deferred argument evaluation. As it happens, it is very easy to devise our own:

template<typename F>
struct deferred_call
{
  using result_type=decltype(std::declval<const F>()());
  operator result_type() const { return f(); }

  F f;
};

object* retrieve_or_create(int id)
{
  static std::unordered_map<int, std::unique_ptr<object>> m;

  auto [it,b] = m.try_emplace(
    id,
    deferred_call([&]{ return std::make_unique<object>(id); }));
  return it->second.get();
}

deferred_call is a small utlity that computes a value upon request of conversion to deferred_call::result_type. In the example, such conversion will only happen if try_emplace really needs to create a std::pair<const int, std::unique_ptr<object>>, that is, if no equivalent object was already present in the map.

In a general setting, for deferred_call to work as expected, that is, to delay producing the value until the point of actual usage, the following conditions must be met:

  1. The deferred_call object is passed to function/constructor template accepting generic, unconstrained parameters.
  2. All internal intermediate interfaces are also generic.
  3. The final function/constructor where actual usage happens asks exactly for a deferred_call::result_type value or reference.

It is the last condition that can be the most problematic:

void f(std::string);
    
// error: deferred_call not convertible to std::string
f(deferred_call([]{ return "hello"; }));

C++ rules for conversion alows just one user-defined conversion to take place at most, and here we are calling for the sequence deferred_callconst char*std::string. In this case, however, the fix is trivial:

void f(std::string);

f(deferred_call([]{ return std::string("hello"); })); 

Update Oct 4

Jessy De Lannoit proposes a variation on deferred_call that solves the problem of producing a value that is one user-defined conversion away from the target type:

template<typename F>
struct deferred_call
{
using result_type=decltype(std::declval<const F>()());
operator result_type() const { return f(); }

template<typename T>
requires (std::is_constructible_v<T, result_type>)
constexpr operator T() const { return {f()}; }

F f;
};

void f(std::string); // works ok: deferred_call converts to std::string
f(deferred_call([]{ return "hello"; }));

This version of deferred_call has an eager conversion operator producing any requested value as long  as it is constructible from deferred_call::result_type. The solution comes with a different set of problems, though:

void f(std::string);
void f(const char*); // ambiguous call to f
f(deferred_call([]{ return "hello"; }));
There is probably little more we can do without language support. One can imagine some sort of "silent" conversion operator that does not add to the cap on user-defined conversions allowed by the rules of C++:
template<typename F>
struct deferred_call
{
using result_type=decltype(std::declval<const F>()());
operator result_type() const { return f(); }

// "silent" conversion operator marked with ~explicit
// (not actual C++)
template<typename T>
requires (std::is_constructible_v<T, result_type>)
~explicit constexpr operator T() const { return {f()}; }

F f;
};
tag:blogger.com,1999:blog-2715968472735546962.post-5681113432062027975
Extensions
Advancing the state of the art for std::unordered_map implementations
Show full content
Introduction

Several Boost authors have embarked on a project to improve the performance of Boost.Unordered's implementation of std::unordered_map (and multimap, set and multiset variants), and to extend its portfolio of available containers to offer faster, non-standard alternatives based on open addressing.

The first goal of the project has been completed in time for Boost 1.80 (due August 2022). We describe here the technical innovations introduced in boost::unordered_map that makes it the fastest implementation of std::unordered_map on the market.

Closed vs. open addressing

On a first approximation, hash table implementations fall on either of two general classes:

  • Closed addressing (also known as separate chaining) relies on an array of buckets, each of which points to a list of elements belonging to it. When a new element goes to an already occupied bucket, it is simply linked to the associated element list. The figure depicts what we call the textbook implementation of closed addressing, arguably the simplest layout, and among the fastest, for this type of hash tables.
textbook layout
  • Open addressing (or closed hashing) stores at most one element in each bucket (sometimes called a slot). When an element goes to an already occupied slot, some probing mechanism is used to locate an available slot, preferrably close to the original one.

Recent, high-performance hash tables use open addressing and leverage on its inherently better cache locality and on widely available SIMD operations. Closed addressing provides some functional advantages, though, and remains relevant as the required foundation for the implementation of std::unodered_map.

Restrictions on the implementation of std::unordered_map

The standardization of C++ unordered associative containers is based on Matt Austern's 2003 N1456 paper. Back in the day, open-addressing approaches were not regarded as sufficiently mature, so closed addressing was taken as the safe implementation of choice. Even though the C++ standard does not explicitly require that closed addressing must be used, the assumption that this is the case leaks through the public interface of std::unordered_map:

  • A bucket API is provided.
  • Pointer stability implies that the container is node-based. In C++17, this implication was made explicit with the introduction of extract capabilities.
  • Users can control the container load factor.
  • Requirements on the hash function are very lax (open addressing depends on high-quality hash functions with the ability to spread keys widely across the space of std::size_t values.)

As a result, all standard library implementations use some form of closed addressing for the internal structure of their std::unordered_map (and related containers).

Coming as an additional difficulty, there are two complexity requirements:

  • iterator increment must be (amortized) constant time,
  • erase must be constant time on average,

that rule out the textbook implementation of closed addressing (see N2023 for details). To cope with this problem, standard libraries depart from the textbook layout in ways that introduce speed and memory penalties: this is, for instance, how libstdc++-v3 and libc++ layouts look like:

libstdc++-v3/libc++ layout

To provide constant iterator increment, all nodes are linked together, which in its turn forces two adjustments to the data structure:

  • Buckets point to the node before the first one in the bucket so as to preserve constant-time erasure.
  • To detect the end of a bucket, the element hash value is added as a data member of the node itself (libstdc++-v3 opts for on-the-fly hash calculation under some circumstances).

Visual Studio standard library (formerly from Dinkumware) uses an entirely different approach to circumvent the problem, but the general outcome is that resulting data structures perform significantly worse than the textbook layout in terms of speed, memory consumption, or both.

Boost.Unordered 1.80 data layout

The new data layout used by Boost.Unordered goes back to the textbook approach:

Boost.Unordered layout

Unlike the rest of standard library implementations, nodes are not linked across the container but only within each bucket. This makes constant-time erase trivially implementable, but leaves unsolved the problem of constant-time iterator increment: to achieve it, we introduce so-called bucket groups (top of the diagram). Each bucket group consists of a 32/64-bit bucket occupancy mask plus next and prev pointers linking non-empty bucket groups together. Iteration across buckets resorts to a combination of bit manipulation operations on the bitmasks plus group traversal through next pointers, which is not only constant time but also very lightweight in terms of execution time and of memory overhead (4 bits per bucket).

Fast modulo

When inserting or looking for an element, hash table implementations need to map the element hash value into the array of buckets (or slots in the open-addressing case). There are two general approaches in common use:

  • Bucket array sizes follow a sequence of prime numbers p, and mapping is of the form hh mod p.
  • Bucket array sizes follow a power-of-two sequence 2n, and mapping takes n bits from h. Typically it is the n least significant bits that are used, but in some cases, like when h is postprocessed to improve its uniformity via multiplication by a well-chosen constant m (such as defined by Fibonacci hashing), it is best to take the n most significant bits, that is, h → (h × m) >> (Nn), where N is the bitwidth of std::size_t and >> is the usual C++ right shift operation.

We use the modulo by a prime approach because it produces very good spreading even if hash values are not uniformly distributed. In modern CPUs, however, modulo is an expensive operation involving integer division; compilers, on the other hand, know how to perform modulo by a constant much more efficiently, so one possible optimization is to keep a table of pointers to functions fp : hh mod p. This technique replaces expensive modulo calculation with a table jump plus a modulo-by-a-constant operation.

In Boost.Unordered 1.80, we have gone a step further. Daniel Lemire et al. show how to calculate h mod p as an operation involving some shifts and multiplications by p and a pre-computed c value acting as a sort of reciprocal of p. We have used this work to implement hash mapping as h → fastmod(h, p, c) (some details omitted). Note that, even though fastmod is generally faster than modulo by a constant, most performance gains actually come from the fact that we are eliminating the table jump needed to select fp, which prevented code inlining.

Time and memory performance of Boost 1.80 boost::unordered_map

We are providing some benchmark results of the boost::unordered_map against libstdc++-v3, libc++ and Visual Studio standard library for insertion, lookup and erasure scenarios. boost::unordered_map is mostly faster across the board, and in some cases significantly so. There are three factors contributing to this performance advantage:

  • the very reduced memory footprint improves cache utilization,
  • fast modulo is used,
  • the new layout incurs one less pointer indirection than libstdc++-v3 and libc++ to access the elements of a bucket.

As for memory consumption, let N be the number of elements in a container with B buckets: the memory overheads (that is, memory allocated minus memory used strictly for the elements themselves) of the different implementations on 64-bit architectures are:

Implementation Memory overhead (bytes) libstdc++-v3 16 N + 8 B (hash caching)
8 N + 8 B (no hash caching) libc++ 16 N + 8 B Visual Studio (Dinkumware) 16 N + 16 B Boost.Unordered 8 N + 8.5 B  Which hash container to choose

Opting for closed-addressing (which, in the realm of C++, is almost synonymous with using an implementation of std::unordered_map) or choosing a speed-oriented, open-addressing container is in practice not a clear-cut decision. Some factors favoring one or the other option are listed:

  • std::unordered_map
    • The code uses some specific parts of its API like node extraction, the bucket interface or the ability to set the maximum load factor, which are generally not available in open-addressing containers.
    • Pointer stability and/or non-moveability of values required (though some open-addressing alternatives support these at the expense of reduced performance).
    • Constant-time iterator increment required.
    • Hash functions used are only mid-quality (open addressing requires that the hash function have very good key-spreading properties).
    • Equivalent key support, ie. unordered_multimap/unordered_multiset required. We do not know of any open-addressing container supporting equivalent keys.
  • Open-addressing containers
    • Performance is the main concern.
    • Existing code can be adapted to a basically more stringent API and more demanding requirements on the element type (like moveability).
    • Hash functions are of good quality (or the default ones from the container provider are used).

If you decide to use std::unordered_map, Boost.Unordered 1.80 now gives you the fastest, fully-conformant implementation on the market.

Next steps

There are some further areas of improvement to boost::unordered_map that we will investigate post Boost 1.80:

  • Reduce the memory overhead of the new layout from 4 bits to 3 bits per bucket.
  • Speed up performance for equivalent key variants (unordered_multimap/unordered_multiset).

In parallel, we are working on the future boost::unordered_flat_map, our proposal for a top-speed, open-addressing container beyond the limitations imposed by std::unordered_map interface. Your feedback on our current and future work is much welcome.

tag:blogger.com,1999:blog-2715968472735546962.post-5329302977245474327
Extensions
Emulating template named arguments in C++20
Show full content

std::unordered_map is a highly configurable class template with five parameters:

template<
    class Key,
    class Value,
    class Hash = std::hash<Key>,
    class KeyEqual = std::equal_to<Key>,
    class Allocator = std::allocator< std::pair<const Key, Value> >
> class unordered_map;

Typical usage depends on default values for most of these parameters:

using my_map=std::unordered_map<int,std::string>;

but things get cumbersome when we want to specify one of the usually defaulted types:

template<typename T> class my_allocator{ ... };
using my_map=std::unordered_map<
int, std::string,
std::hash<int>, std::equal_to<int>,
my_allocator< std::pair<const int, std::string> >
>;

In the example, we are forced to specify the hash and equality predicate with their default value types just to get to the allocator, which is the parameter we really wanted to specify. Ideally we would like to have a syntax like this:

// this is not actual C++
using my_map = std::unordered_map<
Key=int, Value=std::string,
Allocator=my_allocator< std::pair<const int, std::string> >
>;

Turns out we can emulate this by resorting to designated initializers, introduced in C++20:

template<
typename Key, typename Value,
typename Hash = std::hash<Key>,
typename Equal = std::equal_to<Key>,
typename Allocator = std::allocator< std::pair<const Key,Value> >
>
struct unordered_map_config
{
Key *key = nullptr;
Value *value = nullptr;
Hash *hash = nullptr;
Equal *equal = nullptr;
Allocator *allocator = nullptr;

using type = std::unordered_map<Key,Value,Hash,Equal,Allocator>;
};

template<typename T>
constexpr T *type = nullptr;

template<unordered_map_config Cfg>
using unordered_map = typename decltype(Cfg)::type;

...

using my_map = unordered_map<{
.key = type<int>, .value = type<std::string>,
.allocator = type< my_allocator< std::pair<const int, std::string > > >
}>;

The approach taken by the simulation is to use designated initializers to create an aggregate object consisting of dummy null pointers: the values of the pointers do not matter, but their types are captured via CTAD and used to synthesize the associated std::unordered_map instantiation. Two more C++20 features this technique depends on are:

  • Non-type template parameters have been extended to accept literal types (which include aggregate types such as unordered_map_config instantiations).
  • The class template unordered_map_config can be specified as a non-type template parameter of unordered_map. In C++17, we would have had to define unordered_map as
    template<auto Cfg>
    using unordered_map = typename decltype(Cfg)::type;
    which would force the user to explicit name unordered_map_config in
    using my_map = unordered_map<unordered_map_config{...}>;

There is still the unavoidable noise of having to use the type template alias since, of course, aggregate initialization is about values rather than types.

Another limitation of this simulation is that we cannot mix named and unnamed parameters:

// compiler error: either all initializer clauses should be designated
// or none of them should be
using my_map = unordered_map<{
type<int>, type<std::string>,
.allocator = type< my_allocator< std::pair<const int, std::string > > >
}>;

C++20 designated parameters are more restrictive than their C99 counterpart; some of the constraints (initializers cannot be specified out of order) are totally valid in the context of C++, but I personally fail to see why mixing named and unnamed parameters would pose any problem.

tag:blogger.com,1999:blog-2715968472735546962.post-8409773756729875423
Extensions
Start Wordle with TARES
Show full content

There have been some discussions on what the best first guess is for the game Wordle, but none, to the best of my knowledge, has used the following approach. After each guess, the game answers back with a matching result like these:

■■■■■ (all letters wrong), 

■■■■■ (two letters right, one mispositioned),

■■■■■ (all letters right).

There are 35=243 possible answers. From an information-theoretic point of view, the word we are trying to guess is a random variable (selected from a predefined dictionary), and the information we are obtaining by submitting our query is measured by the entropy formula

H(guess) = − ∑ pi log2 pi bits,

where pi is the probability that the game returns the i-th answer (i = 1, ... , 243) for our particular guess. So, the best first guess is the one for which we get the most information, that is, the associated entropy is maximum. Intuitively speaking, we are going for the guess that yields the most balanced partition of the dictionary words as grouped by their matching result: entropy is maximum when all pi are equal (this is impossible for our problem, but gives an upper bound on the attainable entropy of log2(243) = 7.93 bits).

Let's compute then the best guesses. Wordle uses a dictionary of 2,315 entries which is unfortunately not disclosed; in its place we will resort to Stanford GraphBase list. I wrote a trivial C++17 program that goes through each of the 5,757 words of Stanford's list and computes its associated entropy as a first guess (see it running online). The resulting top 10 best words, along with their entropies are:

TARES    6.20918
RATES    6.11622
TALES    6.09823
TEARS    6.05801
NARES    6.01579
TIRES    6.01493
REALS    6.00117
DARES    5.99343
LORES    5.99031
TRIES    5.98875


tag:blogger.com,1999:blog-2715968472735546962.post-8276673301208999191
Extensions
Global warming as falling into the Sun
Show full content
This summer in Spain has been so particularly hot that people came up with graphical jokes like this: (Cáceres is my hometown; versions of this picture for many other Spanish populations swarm the net.) Pursuing this idea half-seriously, one can reason that an increase in global temperatures due to climate change might be journalistically equated with the Earth getting closer to the Sun and thus receiving more radiation, which analogy conjures up doomy visions of our planet falling into the blazing hell of the star: let us do the calculations. Climate sensitivity, usually denoted by λ, links changes in global surface temperature with variations of received radiative power ΔT = λ ΔW. The mechanism by which radiative power changes (increased albedo, greenhouse effect) results in a different associated λ parameter. For the case of power variations due to changes in solar activity, Tung et. al have calculated λs to be in the range of 0.69 to 0.97 K/(W/m2) using data from observations of 11-year solar cycles, and estimate that the stationary sensitivity (i.e. if the change in power was permanent) would be 1.5 times higher, thus in the range of 1.03 to 1.45 K/(W/m2). Now, the Earth is D0 = 1.496 × 108 km away from the Sun, and receives an average radiation of W0 = 1366 W/m2. Assuming far-field conditions, the radiative power received at the Earth as a function of the distance D to the Sun is then W = w / D2,
w = 3.057 × 1025 W/sr, which allows us to calculate ΔT = λs ΔW from ΔD = D0 − D, as shown in the graph for the minimum and maximum estimated values of λs. Although this cannot be checked visually, the lines are not straight but include a negligible (in these distance ranges) quadratic component. So, the estimated increase of 0.75 °C in global temperature during the 20th century is equivalent to pushing the Earth between 30 and 40 thousand kilometers towards the Sun. Each extra °C brings us 38,000-54,000 km closer to the star. For those stuck with USCS, each °F is equivalent to 13,000-18,000 miles.
As an alarmist meme, the figure works poorly since no amount of global warming will translate to anything resembling "falling into the Sun": relative changes in distance measure in the permyriads. And, yes, the joke at the beginning of this article is definitely a gross exaggeration.
tag:blogger.com,1999:blog-2715968472735546962.post-5086479727356436454
Extensions
Compile-time checking the existence of a class template
Show full content
(Updated after a suggestion from bluescarni.) I recently had to use C++14's std::is_final but wanted to downgrade to boost::is_final if the former was not available. Trusting __cplusplus implies overlooking the fact that compilers never provide 100% support for any version of the language, and  Boost.Config is usually helpful with these matters, but, as of this writing, it does not provide any macro to check for the existence of std::is_final. It turns out the matter can be investigated with some compile-time manipulations. We first set up some helping machinery in a namespace of our own:
namespace std_is_final_exists_detail{
    
template<typename> struct is_final{};

struct helper{};

}
std_is_final_exists_detail::is_final has the same signature as the (possibly existing) std::is_final homonym, but need not implement any of the functionality since it will be used for detection only. The class helper is now used to write code directly into namespace std, as the rules of the language allow (and, in some cases, encourage) us to specialize standard class templates for our own types, like for instance with std::hash:
namespace std{

template<>
struct hash<std_is_final_exists_detail::helper>
{
  std::size_t operator()(
    const std_is_final_exists_detail::helper&)const{return 0;}
      
  static constexpr bool check()
  {
    using helper=std_is_final_exists_detail::helper;
    using namespace std_is_final_exists_detail;
    
    return
      !std::is_same<
        is_final<helper>,
        std_is_final_exists_detail::is_final<helper>>::value;
  }
};

}
operator() is defined to nominally comply with the expected semantics of std::hash specialization; it is in check that the interesting work happens. By a non-obvious but totally sensible C++ rule, the directive
using namespace std_is_final_exists_detail;
makes all the symbols of the namespace (including is_final) visible as if they were declared in the nearest namespace containing both std_is_final_exists_detail and std, that is, at global namespace level. This means that the unqualified use of is_final in
!std::is_same<
  is_final<helper>,...
resolves to std::is_final if it exists (as it is within namespace std, i.e. closer than the global level), and to std_is_final_exists_detail::is_final otherwise. We can wrap everything up in a utility class:
using std_is_final_exists=std::integral_constant<
  bool,
  std::hash<std_is_final_exists_detail::helper>::check()
>;
and check with a program
#include <iostream>

int main()
{
  std::cout<<"std_is_final_exists: "
           <<std_is_final_exists::value<<"\n";
}
that dutifully ouputs
std_is_final_exists: 0
with GCC in -std=c+11 mode and
std_is_final_exists: 1
when with -std=c+14. Clang and Visual Studio also handle this code properly. (Updated Sep 7, 2016.) The same technique can be used to walk the last mile and implement an is_final type trait class relying on std::final but falling back to boost::is_final if the former is not present. I've slightly changed naming and used std::is_void for the specialization trick as it involves a little less typing.
#include <boost/type_traits/is_final.hpp>
#include <type_traits>

namespace my_lib{
namespace is_final_fallback{

template<typename T> using is_final=boost::is_final<T>;

struct hook{};

}}

namespace std{

template<>
struct is_void<::my_lib::is_final_fallback::hook>:
  std::false_type
{      
  template<typename T>
  static constexpr bool is_final_f()
  {
    using namespace ::my_lib::is_final_fallback;
    return is_final<T>::value;
  }
};

} /* namespace std */

namespace my_lib{

template<typename T>
struct is_final:std::integral_constant<
  bool,
  std::is_void<is_final_fallback::hook>::template is_final_f<T>()
>{};

} /* namespace mylib */
tag:blogger.com,1999:blog-2715968472735546962.post-4186541128035191485
Extensions
Passing capturing C++ lambda functions as function pointers
Show full content
Suppose we have a function accepting a C-style callback function like this:
void do_something(void (*callback)())
{
  ...
  callback();
}
As captureless C++ lambda functions can be cast to regular function pointers, the following works as expected:
auto callback=[](){std::cout<<"callback called\n";};
do_something(callback);

output: callback called
Unfortunately , if our callback code captures some variable from the context, we are out of luck
int num_callbacks=0;
···
auto callback=[&](){
  std::cout<<"callback called "<<++num_callbacks<<" times \n";
};
do_something(callback);

error: cannot convert 'main()::<lambda>' to 'void (*)()'
because capturing lambda functions create a closure of the used context that needs to be carried around to the point of invocation. If we are allowed to modify do_something we can easily circumvent the problem by accepting a more powerful std::function-based callback:
void do_something(std::function<void()> callback)
{
  ...
  callback();
}

int num_callbacks=0;
...
auto callback=[&](){
  std::cout<<"callback called "<<++num_callbacks<<" times \n";
};
do_something(callback);

output: callback called 1 times
but we want to explore the challenge when this is not available (maybe because do_something is legacy C code, or because we do not want to incur the runtime penalty associated with std::function's usage of dynamic memory). Typically, C-style callback APIs accept an additional callback argument through a type-erased void*:
void do_something(void(*callback)(void*),void* callback_arg)
{
  ...
  callback(callback_arg);
}
and this is actually the only bit we need to force our capturing lambda function through do_something. The gist of the trick is passing the lambda function as the callback argument and providing a captureless thunk as the callback function pointer:
int num_callbacks=0;
...
auto callback=[&](){
  std::cout<<"callback called "<<++num_callbacks<<" times \n";
};
auto thunk=[](void* arg){ // note thunk is captureless
  (*static_cast<decltype(callback)*>(arg))();
};
do_something(thunk,&callback);

output: callback called 1 times
Note that we are not using dynamic memory nor doing any extra copying of the captured data, since callback is accessed in the point of invocation through a pointer; so, this technique can be advantageous even if modern std::functions could be used instead. The caveat is that the user code must make sure that captured data is alive when the callback is invoked (which is not the case when execution happens after scope exit if, for instance, it is carried out in a different thread). Postcript Tcbrindle poses the issue of lambda functions casting to function pointers with C++ linkage, where C linkage may be needed. Although this is rarely a problem in practice, it can be solved through another layer of indirection:
extern "C" void do_something(
  void(*callback)(void*),void* callback_arg)
{
  ...
  callback(callback_arg);
}

...

using callback_pair=std::pair<void(*)(void*),void*>;

extern "C" void call_thunk(void * arg)
{
  callback_pair* p=static_cast<callback_pair*>(arg);
  p->first(p->second);
}
...
int num_callbacks=0;
...
auto callback=[&](){
  std::cout<<"callback called "<<++num_callbacks<<" times \n";
};
auto thunk=[](void* arg){ // note thunk is captureless
  (*static_cast<decltype(callback)*>(arg))();
};
callback_pair p{thunk,&callback};
do_something(call_thunk,&p);

output: callback called 1 times
tag:blogger.com,1999:blog-2715968472735546962.post-4027023004490328417
Extensions
A formal definition of mutation independence
Show full content
Louis Dionne poses the problem of move independence in the context of C++, that is, under which conditions a sequence of operations
f(std::move(x));
g(std::move(x));
is sound in the sense that the first does not interfere with the second. We give here a functional definition for this property that can be applied to the case Louis discusses. Let X be some type and functions f: XT×X and g: XQ×X. The impurity of a non-functional construct in an imperative language such as C++ is captured in this functional setting by the fact that these functions return, besides the output value itself, a new, possibly changed, value of X. We denote by fT and fX the projection of f onto T and X, respectively, and similarly for g. We say that f does not affect g if  gQ(x) = gQ(fX(x)) ∀xX. If we define the equivalence relationship ~g in X as x ~g y iff gQ(x) = gQ(y), then f does not affect g iff fX(x) ~g xxX or fX([x]g) ⊆ [x]gxX, where [x]g is the equivalence class of x under ~g. We say that f and g are mutation-independent if f does not affect g and g does not affect f, that is, fX([x]g) ⊆ [x]g and gX([x]f) ⊆ [x]fxX, The following considers the case of f and g acting on separate components of a tuple: suppose that X = XX2 and f and g depend on and mutate X1 and X2 alone, respectively, or put more formally: fT(x1,x2) = fT(x1,x'2),
fX2(x1,x2) = x2,
gQ(x1,x2) = gQ(x'1,x2),
gX1(x1,x2) = x1 for all x1, x'1 ∈ X1, x2, x'2 ∈ X2. Then  f and g are mutation-independent (proof trivial). Getting back to C++, given a tuple x, two operations of the form:
f(std::get<i>(std::move(x)));
g(std::get<j>(std::move(x)));
are mutation-independent if i!=j; this can be extended to the case where f and g read from (but not write to) any component of x except the j-th and i-th, respectively.
tag:blogger.com,1999:blog-2715968472735546962.post-4097669839203364290
Extensions
(Oil+tax)-free Spanish gas prices 2014-15
Show full content
We use the data gathered at our hysteresis analysis of Spanish gas prices for 2014 and 2015 to gain further insight on their dynamics. This is a simple breakdown of gas (or gasoil) price: Price = oil cost + other costs + taxes + margin. A barrel of crude oil is refined into several final products totalling approximately the same amount of volume, that is, it takes roughly one liter of crude oil to produce one liter of gas (or gasoil). The simplest allocation model is to use market Brent prices as the oil cost for fuel production (we will see more realistic models later). If we eliminate taxes and oil cost, what remains in the fuel price is other costs plus margin. We plot this number for 95 octane gas and gasoil compared with Brent oil price, all in c€/l, for the period 2014-2015: (Oil+tax)-free fuel price, simple cost allocation model [c€/l]
Brent oil cost [c€/l]
When we factor out crude oil cost, the remaning parts of the price increase moderately (~25% for gasoline, ~15% for gas). In a scenario of oil price reduction, oil direct costs as a percentage of tax-free fuel prices have consequently dropped from 70% to 50%: Oil direct cost / tax-free fuel price,simple cost allocation model Value-based cost allocation Crude oil is refined into several final products from high-quality fuel to asphalt, plastic etc. The EIA provides typical yield data for US refineries that we can use as a reasonable approximation to the Spanish case. The volume breakdown we are interested in is roughly:
  • Gas: 45%
  • Gasoil: 30%
  • Other products: 37%
(Note that the sum is greater than 100% because additional components are mixed in the process). Now, as these products have very different prices in the market, it is natural to allocate oil costs proportionally to end-user value: pricetotal = 45% pricegasoline + 30% pricegasoil + 37% priceother ,
costgasoline = costoil × pricegasoline / pricetotal ,
costgas = costoil × pricegas / pricetotal (prices without taxes). Since it is difficult to obtain accurate data on prices for the remaining products, we consider two conventional scenarios where these products are valued at 50% and 25% of the average fuel price, respectively:
  • A: priceother = 50% (pricegasoline + pricegasoil)/2
  • B: priceother = 25% (pricegasoline + pricegasoil)/2
The figure depicts resulting prices without oil costs or taxes (i.e. other costs plus margin): (Oil+tax)-free fuel price, value-based cost allocation [c€/l]
Brent oil cost [c€/l]
Unlike with our previous, naïve allocation model, here we see, both in scenarios A and B, that margins for gasoline and gas match very precisely almost all the time: this can be seen as further indication that value-based cost allocation is indeed the model used by gas companies themselves. Visual inspection reveals two insights:
  • Short-term, margin fluctuations are countercyclical to oil price. This might be due to an effort from companies to stabilize prices.
  • In the two-year period studied, margins grow very much, around 30% for scenario A and 60% for scenario B. This trend has been somewhat corrected in the second half of 2015, though.
The percentual contribution of oil costs to fuel prices (which is by virtue of the cost allocation model exactly the same for gasoline and gas) drops in 2014-15 from 75% to 55% (scenario A) and from 85% to 60% (scenario B). Oil direct cost / tax-free fuel price, value-based cost allocation
tag:blogger.com,1999:blog-2715968472735546962.post-1760849184030187902
Extensions
Gas price hysteresis, Spain 2015
Show full content
We begin the new year redoing our hysteresis analysis for Spanish gas prices with data from 2015, obtained from the usual sources: The figure shows the weekly evolution during 2015 of prices of Brent oil and average retail prices without taxes of 95 octane gas and gasoil in Spain, all in c€ per liter. For gasoline, the corresponding scatter plot of Δ(gasoline price before taxes) against Δ(Brent price) is with linear regressions for the entire graph and both semiplanes Δ(Brent price) ≥ 0 and ≤ 0, given by overall → y = f(x) = b + mx = −0.1210 + 0.2554x,
ΔBrent ≥ 0 → y = f+(x) = b+ + m+x = 0.2866 − 0.0824x,
ΔBrent ≤ 0 → y = f−(x) = b− + m−x = 0.3552 + 0.4040x. Due to the outlier in the right lower corner (with date August 31), positive variations in oil price don't translate, in average, as positive increments in the price of gasoline. The most worrisome aspect is the fact that b+ and are b− positive, which suggests an underlying trend to increase prices when oil is stable. For gasoil we have with regressions overall → y = f(x) = b + mx = −0.0672 + 0.3538x,
ΔBrent ≥ 0 → y = f+(x) = b+ + m+x = −0.2457 + 0.2013x,
ΔBrent ≤ 0 → y = f−(x) = b− + m−x = 0.2468 + 0.3956x. Again, no "rocket and feather" effect here (in fact,  m+ is slightly smaller than m−). Variations around ΔBrent = 0 are fairly symmetrical and, seemingly, fair.
tag:blogger.com,1999:blog-2715968472735546962.post-5955401195404938553
Extensions
How likely?
Show full content
Yesterday, CUP political party held a general assembly to determine whether to support or not Artur Mas's candidacy to President of the Catalonian regional government. The final voting round among 3,030 representatives ended up in an exact 1,515/1,515 tie, leaving the question unsolved for the moment being. Such an unexpected result has prompted a flurry of Internet activity about the mathematical probability of its occurrence. The question "how likely was this result to happen?" is of course unanswerable without a specification of the context (i.e. the probability space) we choose to frame the event. A plausible formulation is: If a proportion p of CUP voters are pro-Mas, how likely is it that a random sample based on 3,030 individuals yields a 50/50 tie? The simple answer (assuming the number of CUP voters is much larger that 3,030) is Pp(1,015 | 3,030), where Pp(n | N) is the binomial distribution of N Bernouilli trials with probability p resulting in exactly n successes. The figure shows this value for 40% ≤ p ≤ 60%. At p = 50%, which without further information is our best estimation of pro-Mas supporters among CUP voters, the probability of a tie is 1.45%. A deviation in p of ±4% would have made this result virtually impossible. A slightly more interesting question is the following: If a proportion p of CUP voters are pro-Mas, how likely is a random sample of 3,030 individuals to misestimate the majority opinion? When p is in the vicinity of 50%, there is a non-negligible probability that the assembly vote come up with the wrong (i.e. against voters' wishes) result. This probability is Ip(1,516, 1,515) if p < 50%,
1 − Pp(1,015 | 3,030) if p = 50%,
I1−p(1,516, 1,515) if p > 50%, where Ip(a,b) is the regularized beta function. The figure shows the corresponding graph for 3,030 representatives and 40% ≤ p ≤ 60%. The function shows a discontinuity at the singular (and zero-probability) event p = 50%, in which case the assembly will yield the wrong result always except for the previously studied situation that there is an exact tie (so, the probability of misestimation is 1 − 1.45% = 98.55 %). Other than this, the likelihood of misestimation approaches 49%+ as p tends to 50%. We have learnt that CUP voters are almost evenly divided between pro- and anti-Mas: if the difference between both positions is 0.7% or less, an assembly of 3,030 representatives such as held yesterday will fail to reflect the party's global position in more than 1 out of 5 cases.
tag:blogger.com,1999:blog-2715968472735546962.post-5192203284085864546
Extensions
SOA container for encapsulated C++ DOD
Show full content
In a previous entry we saw how to decouple the logic of a class from the access to its member data so that the latter can be laid out in a DOD-friendly fashion for faster sequential processing. Instead of having a std::vector of, say, particles, now we can store the different particle members (position, velocity, etc.) in separate containers. This unfortunately results in more cumbersome initialization code: whereas for the traditional, OOP approach particle creation and access is compact and nicely localized:
std::vector<plain_particle> pp_;
...
for(std::size_t i=0;i<n;++i){
  pp_.push_back(plain_particle(...));
}
...
render(pp_.begin(),pp_.end());
when using DOD, in contrast, the equivalent code grows linearly with the number of members, even if most of it is boilerplate:
std::vector<char> color_;
std::vector<int>  x_,y_,dx_,dy_;
...
for(std::size_t i=0;i<n;++i){
  color_.push_back(...);
  x_.push_back(...);
  y_.push_back(...);
  dx_.push_back(...);
  dy_.push_back(...);  
}
...
auto beg_=make_pointer<particle>(
  access(&color_[0],&x_[0],&y_[0],&dx_[0],&dy_[0])),
auto end_=beg_dod+n;
render(beg_,end_);
We would like to rely on a container using SOA (structure of arrays) for its storage that allows us to retain our original OOP syntax:
using access=dod::access<color,x,y,dx,dy>;
dod::vector<particle<access>> p_;
...
for(std::size_t i=0;i<n;++i){
  p_.emplace_back(...);
}
...
render(p_.begin(),p_.end());
Note that particles are inserted into the container using emplace_back rather than push_back: this is due to the fact that a particle object (which push_back accepts as its argument) cannot be created out of the blue without its constituent members being previously stored somewhere; emplace_back, on the other hand, does not suffer from this chicken-and-egg problem. The implementation of such a container class is fairly straightfoward (limited here to the operations required to make the previous code work):
namespace dod{

template<typename Access>
class vector_base;

template<>
class vector_base<access<>>
{
protected:
  access<> data(){return {};}
  void emplace_back(){}
};

template<typename Member0,typename... Members>
class vector_base<access<Member0,Members...>>:
  protected vector_base<access<Members...>>
{
  using super=vector_base<access<Members...>>;
  using type=typename Member0::type;
  using impl=std::vector<type>;
  using size_type=typename impl::size_type;
  impl v;
  
protected:
  access<Member0,Members...> data()
  {
    return {v.data(),super::data()};
  }

  size_type size()const{return v.size();}

  template<typename Arg0,typename... Args>
  void emplace_back(Arg0&& arg0,Args&&... args){
    v.emplace_back(std::forward<Arg0>(arg0));
    try{
      super::emplace_back(std::forward<Args>(args)...);
    }
    catch(...){
      v.pop_back();
      throw;
    }
  }
};
  
template<typename T> class vector;
 
template<template <typename> class Class,typename Access> 
class vector<Class<Access>>:protected vector_base<Access>
{
  using super=vector_base<Access>;
  
public:
  using iterator=pointer<Class<Access>>;
  
  iterator begin(){return super::data();}
  iterator end(){return this->begin()+super::size();}
  using super::emplace_back;
};

} // namespace dod
dod::vector<Class<Members...>> derives from an implementation class that holds a std::vector for each of the Members declared. Inserting elements is just a simple matter of multiplexing to the vectors, and begin and end return dod::pointers to this structure of arrays. From the point of view of the user all the necessary magic is hidden by the framework and DOD processing becomes nearly identical in syntax to OOP. We provide a test program that exercises dod::vector against the classical OOP approach based on a std::vector of plain (i.e., non DOD) particles. Results are the same as previously discussed when we used DOD with manual initialization, that is, there is no abstraction penalty associated to using dod::vector, so we won't present any additional figures here. The framework we have constructed so far provides the bare minimum needed to test the ideas presented. In order to be fully usable there are various aspects that should be expanded upon:
  • access<Members...> just considers the case where each member is stored separately. Sometimes the most efficient layout will call for mixed scenarios where some of the members are grouped together. This can be modelled, for instance, by having member accept multiple pieces of data in its declaration.
  • dod::pointer does not properly implement const access, that is, pointer<const particle<...>> does not compile.
  • dod::vector should be implemented to provide the full interface of a proper vector class.
All of this can be in principle tackled without serious design dificulties.
tag:blogger.com,1999:blog-2715968472735546962.post-2775924357848556949
Extensions
C++ encapsulation for Data-Oriented Design: performance
Show full content
(Many thanks to Manu Sánchez for his help with running tests and analyzing results.) In a past entry, we implemented a little C++ framework that allows us to do DOD while retaining some of the encapsulation benefits and the general look and feel of traditional object-based programming. We complete here the framework by adding a critical piece from the point of view of usability, namely the ability to process sequences of DOD entities with as terse a syntax as we would have in OOP. To enable DOD for a particular class (like the particle we used in the previous entry), i.e., to distribute its different data members in separate memory locations, we change the class source code to turn it into a class template particle<Access> where Access is a framework-provided entity in charge of granting access to the external data members with a similar syntax as if they were an integral part of the class itself. Now, particle<Access> is no longer a regular class with value semantics, but a mere proxy to the external data without ownership to it. Importantly, it is the members and not the particle objects that are stored: particles are constructed on the fly when needed to use its interface in order to process the data. So, code like
for(const auto& p:particle_)p.render();
cannot possibly work because the application does not have any particle_ container to begin with: instead, the information is stored in separate locations:
std::vector<char> color_;
std::vector<int>  x_,y_,dx_,dy_;
and "traversing" the particles requires that we go through the associated containers in parallel and invoke render on a temporary particle object constructed out of them:
auto itc=&color_[0],ec=itc+color_.size();
auto itx=&x_[0];
auto ity=&y_[0];
auto itdx=&dx_[0];
auto itdy=&dy_[0];
  
for(;itc!=ec;++itc,++itx,++ity,++itdx,++itdy){
  auto p=make_particle(
    access<color,x,y,dx,dy>(itc,itx,ity,itdx,itdy));
  p.render();
}
Fortunately, this boilerplate code can be hidden by the framework by using these auxiliary constructs:
template<typename T> class pointer;

template<template <typename> class Class,typename Access>
class pointer<Class<Access>>
{
  // behaves as Class<Access>>*
};

template<template <typename> class Class,typename Access>
pointer<Class<Access>> make_pointer(const Access& a)
{
  return pointer<Class<Access>>(a);
}
We won't delve into the implementation details of pointer (the interested reader can see the actual code in the test program given below): from the point of view of the user, this utility class accepts an access entity, which is a collection of pointers to the data members plus an offset member (this offset has been added to the former version of the framework), it keeps everything in sync when doing pointer arithmetic and dereferences to a temporary particle object. The resulting user code is as simple as it gets:
auto n=color_.size();
auto beg_=make_pointer<particle>(access<color,x,y,dx,dy>(
  &color_[0],&x_[0],&y_[0],&dx_[0],&dy_[0]));
auto end_=beg_+n;
  
for(auto it=beg_;it!=end_;++it)it->render();
Index-based traversal is also possible:
for(std::size_t i=0;i<n;++i)beg_[i].render();
Once the containers are populated and beg_ and end_ defined, user code can handle particles as if they were stored in [beg_, end_), thus effectively isolated from the fact that the actual data is scattered around different containers for maximum processing performance. Are we paying an abstraction penalty for the convenience this framework affords? There are two sources of concern:
  • Even though traversal code is in principle equivalent to hand-written DOD code, compilers might not be able to optimize all the template scaffolding away.
  • Traversing with access<color,x,y,dx,dy> for rendering when only color, x and y are needed (because render does not access dx or dy) involves iterating over dx_ and dy_ without actually accessing either one: again, the compiler might or might not optimize this extra code.
We provide a test program (Boost required) that measures the performance of this framework against some alternatives. The looped-over render procedure simply updates a global variable so that resulting execution times are basically those of the produced iteration code. The different options compared are:
  • ⬛ oop: iteration over a traditional object-based structure
  • ⬛ raw: hand-written data-processing loop
  • ⬛ dod: DOD framework with access<color,x,y,dx,dy>
  • ⬛ render_dod: DOD framework with  access<color,x,y>
  • ⬛ oop[i]: index-based access instead of iterator traversal
  • ⬛ raw[i]: hand-written index-based loop
  • ⬛ dod[i]: index-based with access<color,x,y,dx,dy>
  • ⬛ render_dod[i]: index-based with access<color,x,y>
The difference between dod and render_dod (and the same applies to their index-based variants) is that the latter keeps access only to the data members strictly required by render: if the compiler were not able to optimize unnecessary pointer manipulations in dod, render_dod would be expected to be faster; the drawback is that this would require fine tuning the access entity for each member function. Manu Sánchez has set up an extensive testing environment to build and run the program using different compilers and machines: The figures show the release-mode execution times of the eight options described above when traversing sequences of n = 104, 105, 106 and 107 particles. GCC 5.1, MinGW, Intel Core i7-4790k @4.5GHz Execution times / number of elements. As expected, OOP is the slowest due to cache effects. The rest of options are basically equivalent, which shows that GCC is able to entirely optimize away the syntactic niceties brought in by our DOD framework. MSVC 14.0, Windows, Intel Core i7-4790k @4.5GHz Execution times / number of elements. Here, again, all DOD options are roughly equivalent, although raw (pointer-based hand-written loop) is slightly slower. Curiously enough, MSVC is much worse at optimizing DOD with respect to OOP than GCC is, with execution times up to 4 times higher for n = 104 and 1.3 times higher for n = 107, the latter scenario being presumably dominated by cache efficiencies. GCC 5.2, Linux, AMD A6-1450 APU @1.0 GHz Execution times / number of elements. From a qualitative point of view, these results are in line with those obtained for GCC 5.1 under an Intel Core i7, although as the AMD A6 is a much less powerful processor execution times are higher (×8-10 for n = 104, ×4-5.5 for n = 107). Clang 3.6, Linux, AMD A6-1450 APU @1.0 GHz Execution times / number of elements. As it happens with the rest of compilers, DOD options (both manual and framework-supported) perform equally well. However, the comparison with GCC 5.2 on the same machine shows important differences: iterator-based OOP is faster (×1.1-1.4) in Clang, index-based OOP yields the same results for both compilers, and the DOD options in Clang are consistently slower (×2.3-3.4) than in GCC, to the point that OOP outperforms them for low values of n. A detailed analysis of the assembly code produced would probably gain us more insight into these contrasting behaviors: interested readers can access the resulting assembly listings at the associated GitHub repository.
tag:blogger.com,1999:blog-2715968472735546962.post-7164304846261577453
Extensions
C++ encapsulation for Data-Oriented Design
Show full content
Data-Oriented Design, or DOD for short, seeks to maximize efficiency by laying out data in such a way that their processing is as streamlined as possible. This is often against the usual object-based principles that naturally lead to grouping the information accordingly to the user-domain entities that it models. Consider for instance a game where large quantities of particles are rendered and moved around:
class particle
{  
  char  color;
  int   x;
  int   y;
  int   dx;
  int   dy;
public:

  static const int max_x=200;
  static const int max_y=100;
    
  particle(char color_,int x_,int y_,int dx_,int dy_):
    color(color_),x(x_),y(y_),dx(dx_),dy(dy_)
  {}

  void render()const
  {
    // for explanatory purposes only: dump to std::cout
    std::cout<<"["<<x<<","<<y<<","<<int(color)<<"]\n";
  }

  void move()
  {
    x+=dx;
    if(x<0){
      x*=-1;
      dx*=-1;
    }
    else if(x>max_x){
      x=2*max_x-x;
      dx*=-1;      
    }
    
    y+=dy;
    if(y<0){
      y*=-1;
      dx*=-1;
    }
    else if(y>max_y){
      y=2*max_y-y;
      dy*=-1;      
    }
  }
};
...
// game particles
std::vector<particle> particles;
In the rendering loop, the program might do:
for(const auto& p:particles)p.render();
Trivial as it seems, the execution speed of this approach is nevertheless suboptimal. The memory layout for particles looks like: which, when traversed in the rendering loop, results in 47% of the data cached by the CPU (the part corresponding to dx and dy, in white) not being used, or even more if padding occurs. A more intelligent layout based on 5 different vectors allows the needed data, and only this, to be cached in three parallel cache lines, thus maximizing occupancy and minimizing misses. For the moving loop, it is a different set of data vectors that must be provided. DOD is increasingly popular, in particular in very demanding areas such as game programming. Mike Acton's presentation on DOD and C++ is an excellent introduction to the principles of data orientation. The problem with DOD is that encapsulation is lost: rather than being nicely packed in contiguous chunks of memory whose lifetime management is heavily supported by the language rules, "objects" now live as virtual entities with disemboweled, scattered pieces of information floating around in separate data structures. Methods acting on the data need to publish the exact information they require as part of their interface, and it is the responsibility of the user to locate it and provide it. We want to explore ways to remedy this situation by allowing a modest level of object encapsulation compatible with DOD. Roughly speaking, in C++ an object serves two different purposes:
  • Providing a public interface (a set of member functions) acting on the associated data.
  • Keeping access to the data and managing its lifetime.
Both roles are mediated by the this pointer. In fact, executing a member function on an object
x.f(args...);
is conceptually equivalent to invoking a function with an implicit extra argument
X::f(this,args...);
where the data associated to x, assumed to be contiguous, is pointed to by this. We can break this intermediation by letting objects be supplied with an access entity replacing this for the purpose of reaching out to the information. We begin with a purely syntactic device:
template<typename T,int Tag=0>
struct member
{
  using type=T;
  static const int tag=Tag;
};
member<T,Tag> will be used to specify that a given class has a piece of information with type T. Tag is needed to tell apart different members of the same type (for instance, particle has four different members of type int, namely x, y, dx and dy). Now, the following class:
template<typename Member>
class access
{
  using type=typename Member::type;
  type* p;

public:
  access(type* p):p(p){}
  
  type&       get(Member){return *p;}
  const type& get(Member)const{return *p;}
};
stores a pointer to a piece of data accessing the specified member. This can be easily expanded to accommodate for more than one member:
template<typename... Members>class access;

template<typename Member>
class access<Member>
{
  using type=typename Member::type;
  type* p;

public:
  access(type* p):p(p){}
  
  type&       get(Member){return *p;}
  const type& get(Member)const{return *p;}
};

template<typename Member0,typename... Members>
class access<Member0,Members...>:
  public access<Member0>,access<Members...>
{
public:
  template<typename Arg0,typename... Args>
  access(Arg0&& arg0,Args&&... args):
    access<Member0>(std::forward<Arg0>(arg0)),
    access<Members...>(std::forward<Args>(args)...)
  {}
  
  using access<Member0>::get;
  using access<Members...>::get;
};
To access, say, the data labeled as member<int,0> we need to write get(member<int,0>()). The price we have to pay for having data scattered around memory is that the access entity holds several pointers, one for member: on the other hand, the resulting objects, as we will see, really behave as on-the-fly proxies to their associated information, so access entities will seldom be stored. particle can be rewritten so that data is accessed through a generic access object:
template<typename Access>
class particle:Access
{
  using Access::get;
  
  using color=member<char,0>;
  using x=member<int,0>;
  using y=member<int,1>;
  using dx=member<int,2>;
  using dy=member<int,3>;

public:

  static const int max_x=200;
  static const int max_y=100;

  particle(const Access& a):Access(a){}

  void render()const
  {
    std::cout<<"["<<get(x())<<","
      <<get(y())<<","<<int(get(color()))<<"]\n";
  }

  void move()
  {
    get(x())+=get(dx());
    if(get(x())<0){
      get(x())*=-1;
      get(dx())*=-1;
    }
    else if(get(x())>max_x){
      get(x())=2*max_x-get(x());
      get(dx())*=-1;      
    }
    
    get(y())+=get(dy());
    if(get(y())<0){
      get(y())*=-1;
      get(dy())*=-1;
    }
    else if(get(y())>max_y){
      get(y())=2*max_y-get(y());
      get(dy())*=-1;      
    }
  }
};

template<typename Access>
particle<Access> make_particle(Access&& a)
{
  return particle<Access>(std::forward<Access>(a));
}
The transformations that need be done on the source code are not many:
  • Turn the class into a class template dependent on an access entity from which it derives.
  • Rather than declaring internal data members, define the corresponding member labels.
  • Delete former OOP constructors define just one constructor taking an access object as its only data member.
  • Replace mentions of data member by their corresponding access member function invocation (in the example, substitute get(color()) for color, get(x()) for x, etc.)
  • For convenience's sake, provide a make template function (in the example make_particle) to simplify object creation.
Observe how this woks in practice:
using color=member<char,0>;
using x=member<int,0>;
using y=member<int,1>;
using dx=member<int,2>;
using dy=member<int,3>;

char color_=5;
int  x_=20,y_=40,dx_=2,dy_=-1;

auto p=make_particle(access<color,x,y>(&color_,&x_,&y_));
auto q=make_particle(access<x,y,dx,dy>(&x_,&y_,&dx_,&dy_));
p.render();
q.move();
p.render();
The particle data now lives externally as a bunch of separate variables (or, in a more real-life scenario, stored in containers). p and q act as proxies to the same information (i.e., they don't copy data internally) but other than this they provide the same interface as the OOP version of particle, and can be used similarly. Note that the two objects specify different sets of access members, as required by render and move, respectively. So, the following
q.render(); // error
would result in a compile time error as render accesses data that q does not provide. Of course we can do
auto p=make_particle(
         access<color,x,y,dx,dy>(&color_,&x_,&y_,&dx_,&dy_)),
     q=p;
so that the resulting objects can take advantage of the entire particle interface. In later entries we will see how this need not affect performance in traversal algorithms. A nice side effect of this technique is that, when a DOD class is added extra data, former code will continue to work as long as this data is only used in new member functions of the class. Implementing DOD enablement as a template policy also allows us to experiment with alternative access semantics. For instance, the tuple_storage utility
template<typename Tuple,std::size_t Index,typename... Members>
class tuple_storage_base;

template<typename Tuple,std::size_t Index>
class tuple_storage_base<Tuple,Index>:public Tuple
{
  struct inaccessible{};
public:
  using Tuple::Tuple;
  
  void get(inaccessible);
  
  Tuple&       tuple(){return *this;}
  const Tuple& tuple()const{return *this;}
};

template<
  typename Tuple,std::size_t Index,
  typename Member0,typename... Members
>
class tuple_storage_base<Tuple,Index,Member0,Members...>:
  public tuple_storage_base<Tuple,Index+1,Members...>
{
  using super=tuple_storage_base<Tuple,Index+1,Members...>;
  using type=typename Member0::type;

public:
  using super::super;
  using super::get;
  
  type&       get(Member0)
                {return std::get<Index>(this->tuple());}
  const type& get(Member0)const
                {return std::get<Index>(this->tuple());}  
};

template<typename... Members>
class tuple_storage:
  public tuple_storage_base<
    std::tuple<typename Members::type...>,0,Members...
  >
{
  using super=tuple_storage_base<
    std::tuple<typename Members::type...>,0,Members...
  >;
  
public:
  using super::super;
};
can we used to replace the external access policy with an object containing the data proper:
using storage=tuple_storage<color,x,y,dx,dy>;
auto r=make_particle(storage(3,100,10,10,-15));
auto s=r;
r.render();
r.move();
r.render();
s.render(); // different data than r
which effectively brings us back the old OOP class with ownership semantics. (Also, it is easy to implement an access policy on top of tuple_storage that gives proxy semantics for tuple-based storage. This is left as an exercise for the reader.) A C++11 example program is provided that puts to use the ideas we have presented. Traversal is at the core of DOD, as the paradigm is oriented towards handling large numbers of like objects. In a later entry we will extend this framework to provide for easy object traversal and measure the resulting performance as compared with OOP.
tag:blogger.com,1999:blog-2715968472735546962.post-8894064543265654503
Extensions