1XNU use of Atomics and Memory Barriers 2====================================== 3 4Goal 5---- 6 7This document discusses the use of atomics and memory barriers in XNU. It is 8meant as a guide to best practices, and warns against a variety of possible 9pitfalls in the handling of atomics in C. 10 11It is assumed that the reader has a decent understanding of 12the [C11 memory model](https://en.cppreference.com/w/c/atomic/memory_order) 13as this document builds on it, and explains the liberties XNU takes with said 14model. 15 16All the interfaces discussed in this document are available through 17the `<os/atomic_private.h>` header. 18 19Note: Linux has thorough documentation around memory barriers 20(Documentation/memory-barriers.txt), some of which is Linux specific, 21but most is not and is a valuable read. 22 23 24Vocabulary 25---------- 26 27In the rest of this document we'll refer to the various memory ordering defined 28by C11 as relaxed, consume, acquire, release, acq\_rel and seq\_cst. 29 30`os_atomic` also tries to make the distinction between compiler **barriers** 31(which limit how much the compiler can reorder code), and memory **fences**. 32 33 34The dangers and pitfalls of C11's `<stdatomic.h>` 35------------------------------------------------- 36 37While the C11 memory model has likely been one of the most important additions 38to modern C, in the purest C tradition, it is a sharp tool. 39 40By default, C11 comes with two variants of each atomic "operation": 41 42- an *explicit* variant where memory orderings can be specified, 43- a regular variant which is equivalent to the former with the *seq_cst* 44 memory ordering. 45 46When an `_Atomic` qualified variable is accessed directly without using 47any `atomic_*_explicit()` operation, then the compiler will generate the 48matching *seq_cst* atomic operations on your behalf. 49 50The sequentially consistent world is extremely safe from a lot of compiler 51and hardware reorderings and optimizations, which is great, but comes with 52a huge cost in terms of memory barriers. 53 54 55It seems very tempting to use `atomic_*_explicit()` functions with explicit 56memory orderings, however, the compiler is entitled to perform a number of 57optimizations with relaxed atomics, that most developers will not expect. 58Indeed, the compiler is perfectly allowed to perform various optimizations it 59does with other plain memory accesess such as coalescing, reordering, hoisting 60out of loops, ... 61 62For example, when the compiler can know what `doit` is doing (which due to LTO 63is almost always the case for XNU), is allowed to transform this code: 64 65```c 66 void 67 perform_with_progress(int steps, long _Atomic *progress) 68 { 69 for (int i = 0; i < steps; i++) { 70 doit(i); 71 atomic_store_explicit(progress, i, memory_order_relaxed); 72 } 73 } 74``` 75 76Into this, which obviously defeats the entire purpose of `progress`: 77 78```c 79 void 80 perform_with_progress(int steps, long _Atomic *progress) 81 { 82 for (int i = 0; i < steps; i++) { 83 doit(i); 84 } 85 atomic_store_explicit(progress, steps, memory_order_relaxed); 86 } 87``` 88 89 90How `os_atomic_*` tries to address `<stdatomic.h>` pitfalls 91----------------------------------------------------------- 92 931. the memory locations passed to the various `os_atomic_*` 94 functions do not need to be marked `_Atomic` or `volatile` 95 (or `_Atomic volatile`), which allow for use of atomic 96 operations in code written before C11 was even a thing. 97 98 It is however recommended in new code to use the `_Atomic` 99 specifier. 100 1012. `os_atomic_*` cannot be coalesced by the compiler: 102 all accesses are performed on the specified locations 103 as if their type was `_Atomic volatile` qualified. 104 1053. `os_atomic_*` only comes with the explicit variants: 106 orderings must be provided and can express either memory orders 107 where the name is the same as in C11 without the `memory_order_` prefix, 108 or a compiler barrier ordering `compiler_acquire`, `compiler_release`, 109 `compiler_acq_rel`. 110 1114. `os_atomic_*` emits the proper compiler barriers that 112 correspond to the requested memory ordering (using 113 `atomic_signal_fence()`). 114 115 116Best practices for the use of atomics in XNU 117-------------------------------------------- 118 119For most generic code, the `os_atomic_*` functions from 120`<os/atomic_private.h>` are the preferred interfaces. 121 122`__sync_*`, `__c11_*` and `__atomic_*` compiler builtins should not be used. 123 124`<stdatomic.h>` functions may be used if: 125 126- compiler coalescing / reordering is desired (refcounting 127 implementations may desire this for example). 128 129 130Qualifying atomic variables with `_Atomic` or even 131`_Atomic volatile` is encouraged, however authors must 132be aware that a direct access to this variable will 133result in quite heavy memory barriers. 134 135The *consume* memory ordering should not be used 136(See *dependency* memory order later in this documentation). 137 138**Note**: `<libkern/OSAtomic.h>` provides a bunch of legacy 139atomic interfaces, but this header is considered obsolete 140and these functions should not be used in new code. 141 142 143High level overview of `os_atomic_*` interfaces 144----------------------------------------------- 145 146### Compiler barriers and memory fences 147 148`os_compiler_barrier(mem_order?)` provides a compiler barrier, 149with an optional barrier ordering. It is implemented with C11's 150`atomic_signal_fence()`. The barrier ordering argument is optional 151and defaults to the `acq_rel` compiler barrier (which prevents the 152compiler to reorder code in any direction around this barrier). 153 154`os_atomic_thread_fence(mem_order)` provides a memory barrier 155according to the semantics of `atomic_thread_fence()`. It always 156implies the equivalent `os_compiler_barrier()` even on UP systems. 157 158### Init, load and store 159 160`os_atomic_init`, `os_atomic_load` and `os_atomic_store` provide 161facilities equivalent to `atomic_init`, `atomic_load_explicit` 162and `atomic_store_explicit` respectively. 163 164Note that `os_atomic_load` and `os_atomic_store` promise that they will 165compile to a plain load or store. `os_atomic_load_wide` and 166`os_atomic_store_wide` can be used to have access to atomic loads and store 167that involve more costly codegen (such as compare exchange loops). 168 169### Basic RMW (read/modify/write) atomic operations 170 171The following basic atomic RMW operations exist: 172 173- `inc`: atomic increment (equivalent to an atomic add of `1`), 174- `dec`: atomic decrement (equivalent to an atomic sub of `1`), 175- `add`: atomic add, 176- `sub`: atomic sub, 177- `or`: atomic bitwise or, 178- `xor`: atomic bitwise xor, 179- `and`: atomic bitwise and, 180- `andnot`: atomic bitwise andnot (equivalent to atomic and of ~value), 181- `min`: atomic min, 182- `max`: atomic max. 183 184For any such operation, two variants exist: 185 186- `os_atomic_${op}_orig` (for example `os_atomic_add_orig`) 187 which returns the value stored at the specified location 188 *before* the atomic operation took place 189- `os_atomic_${op}` (for example `os_atomic_add`) which 190 returns the value stored at the specified location 191 *after* the atomic operation took place 192 193This convention is picked for two reasons: 194 1951. `os_atomic_add(p, value, ...)` is essentially equivalent to the C 196 in place addition `(*p += value)` which returns the result of the 197 operation and not the original value of `*p`. 198 1992. Most subtle atomic algorithms do actually require the original value 200 stored at the location, especially for bit manipulations: 201 `(os_atomic_or_orig(p, bit, relaxed) & bit)` will atomically perform 202 `*p |= bit` but also tell you whether `bit` was set in the original value. 203 204 Making it more explicit that the original value is used is hence 205 important for readers and worth the extra five keystrokes. 206 207Typically: 208 209```c 210 static int _Atomic i = 0; 211 212 printf("%d\n", os_atomic_inc_orig(&i)); // prints 0 213 printf("%d\n", os_atomic_inc(&i)); // prints 2 214``` 215 216### Atomic swap / compare and swap 217 218`os_atomic_xchg` is a simple wrapper around `atomic_exchange_explicit`. 219 220There are two variants of `os_atomic_cmpxchg` which are wrappers around 221`atomic_compare_exchange_strong_explicit`. Both of these variants will 222return false/0 if the compare exchange failed, and true/1 if the expected 223value was found at the specified location and the new value was stored. 224 2251. `os_atomic_cmpxchg(address, expected, new_value, mem_order)` which 226 will atomically store `new_value` at `address` if the current value 227 is equal to `expected`. 228 2292. `os_atomic_cmpxchgv(address, expected, new_value, orig_value, mem_order)` 230 which has an extra `orig_value` argument which must be a pointer to a local 231 variable and will be filled with the current value at `address` whether the 232 compare exchange was successful or not. In case of success, the loaded value 233 will always be `expected`, however in case of failure it will be filled with 234 the current value, which is helpful to redrive compare exchange loops. 235 236Unlike `atomic_compare_exchange_strong_explicit`, a single ordering is 237specified, which only takes effect in case of a successful compare exchange. 238In C11 speak, `os_atomic_cmpxchg*` always specifies `memory_order_relaxed` 239for the failure case ordering, as it is what is used most of the time. 240 241There is no wrapper around `atomic_compare_exchange_weak_explicit`, 242as `os_atomic_rmw_loop` offers a much better alternative for CAS-loops. 243 244### `os_atomic_rmw_loop` 245 246This expressive and versatile construct allows for really terse and 247way more readable compare exchange loops. It also uses LL/SC constructs more 248efficiently than a compare exchange loop would allow. 249 250Instead of a typical CAS-loop in C11: 251 252```c 253 int _Atomic *address; 254 int old_value, new_value; 255 bool success = false; 256 257 old_value = atomic_load_explicit(address, memory_order_relaxed); 258 do { 259 if (!validate(old_value)) { 260 break; 261 } 262 new_value = compute_new_value(old_value); 263 success = atomic_compare_exchange_weak_explicit(address, &old_value, 264 new_value, memory_order_acquire, memory_order_relaxed); 265 } while (__improbable(!success)); 266``` 267 268`os_atomic_rmw_loop` allows this form: 269 270```c 271 int _Atomic *address; 272 int old_value, new_value; 273 bool success; 274 275 success = os_atomic_rmw_loop(address, old_value, new_value, acquire, { 276 if (!validate(old_value)) { 277 os_atomic_rmw_loop_give_up(break); 278 } 279 new_value = compute_new_value(old_value); 280 }); 281``` 282 283Unlike the C11 variant, it lets the reader know in program order that this will 284be a CAS loop, and exposes the ordering upfront, while for traditional CAS loops 285one has to jump to the end of the code to understand what it does. 286 287Any control flow that attempts to exit its scope of the loop needs to be 288wrapped with `os_atomic_rmw_loop_give_up` (so that LL/SC architectures can 289abort their opened LL/SC transaction). 290 291Because these loops are LL/SC transactions, it is undefined to perform 292any store to memory (register operations are fine) within these loops, 293as these may cause the store-conditional to always fail. 294In particular nesting of `os_atomic_rmw_loop` is invalid. 295 296Use of `continue` within an `os_atomic_rmw_loop` is also invalid, instead an 297`os_atomic_rmw_loop_give_up(goto again)` jumping to an `again:` label placed 298before the loop should be used in this way: 299 300```c 301 int _Atomic *address; 302 int old_value, new_value; 303 bool success; 304 305again: 306 success = os_atomic_rmw_loop(address, old_value, new_value, acquire, { 307 if (needs_some_store_that_can_thwart_the_transaction(old_value)) { 308 os_atomic_rmw_loop_give_up({ 309 // Do whatever you need to do/store to central memory 310 // that would cause the loop to always fail 311 do_my_rmw_loop_breaking_store(); 312 313 // And only then redrive. 314 goto again; 315 }); 316 } 317 if (!validate(old_value)) { 318 os_atomic_rmw_loop_give_up(break); 319 } 320 new_value = compute_new_value(old_value); 321 }); 322``` 323 324### the *dependency* memory order 325 326Because the C11 *consume* memory order is broken in various ways, 327most compilers, clang included, implement it as an equivalent 328for `memory_order_acquire`. However, its concept is useful 329for certain algorithms. 330 331As an attempt to provide a replacement for this, `<os/atomic_private.h>` 332implements an entirely new *dependency* memory ordering. 333 334The purpose of this ordering is to provide a relaxed load followed by an 335implicit compiler barrier, that can be used as a root for a chain of hardware 336dependencies that would otherwise pair with store-releases done at this address, 337very much like the *consume* memory order is intended to provide. 338 339However, unlike the *consume* memory ordering where the compiler had to follow 340the dependencies, the *dependency* memory ordering relies on explicit 341annotations of when the dependencies are expected: 342 343- loads through a pointer loaded with a *dependency* memory ordering 344 will provide a hardware dependency, 345 346- dependencies may be injected into other loads not performed through this 347 particular pointer with the `os_atomic_load_with_dependency_on` and 348 `os_atomic_inject_dependency` interfaces. 349 350Here is an example of how it is meant to be used: 351 352```c 353 struct foo { 354 long value; 355 long _Atomic flag; 356 }; 357 358 void 359 publish(struct foo *p, long value) 360 { 361 p->value = value; 362 os_atomic_store(&p->flag, 1, release); 363 } 364 365 366 bool 367 broken_read(struct foo *p, long *value) 368 { 369 /* 370 * This isn't safe, as there's absolutely no hardware dependency involved. 371 * Using an acquire barrier would of course fix it but is quite expensive... 372 */ 373 if (os_atomic_load(&p->flag, relaxed)) { 374 *value = p->value; 375 return true; 376 } 377 return false; 378 } 379 380 bool 381 valid_read(struct foo *p, long *value) 382 { 383 long flag = os_atomic_load(&p->flag, dependency); 384 if (flag) { 385 /* 386 * Further the chain of dependency to any loads through `p` 387 * which properly pair with the release barrier in `publish`. 388 */ 389 *value = os_atomic_load_with_dependency_on(&p->value, flag); 390 return true; 391 } 392 return false; 393 } 394``` 395 396There are 4 interfaces involved with hardware dependencies: 397 3981. `os_atomic_load(..., dependency)` to initiate roots of hardware dependencies, 399 that should pair with a store or rmw with release semantics or stronger 400 (release, acq\_rel or seq\_cst), 401 4022. `os_atomic_inject_dependency` can be used to inject the dependency provided 403 by a *dependency* load, or any other value that has had a dependency 404 injected, 405 4063. `os_atomic_load_with_dependency_on` to do an otherwise related relaxed load 407 that still prolongs a dependency chain, 408 4094. `os_atomic_make_dependency` to create an opaque token out of a given 410 dependency root to inject into multiple loads. 411 412 413**Note**: this technique is NOT safe when the compiler can reason about the 414pointers that you are manipulating, for example if the compiler can know that 415the pointer can only take a couple of values and ditch all these manually 416crafted dependency chains. Hopefully there will be a future C2Y standard that 417provides a similar construct as a language feature instead. 418