1XNU use of Atomics and Memory Barriers 2====================================== 3 4How to use atomics and memory barriers in xnu. 5 6Goal 7---- 8 9This document discusses the use of atomics and memory barriers in XNU. It is 10meant as a guide to best practices, and warns against a variety of possible 11pitfalls in the handling of atomics in C. 12 13It is assumed that the reader has a decent understanding of 14the [C11 memory model](https://en.cppreference.com/w/c/atomic/memory_order) 15as this document builds on it, and explains the liberties XNU takes with said 16model. 17 18All the interfaces discussed in this document are available through 19the `<os/atomic_private.h>` header. 20 21Note: Linux has thorough documentation around memory barriers 22(Documentation/memory-barriers.txt), some of which is Linux specific, 23but most is not and is a valuable read. 24 25 26Vocabulary 27---------- 28 29In the rest of this document we'll refer to the various memory ordering defined 30by C11 as relaxed, consume, acquire, release, acq\_rel and seq\_cst. 31 32`os_atomic` also tries to make the distinction between compiler **barriers** 33(which limit how much the compiler can reorder code), and memory **fences**. 34 35 36The dangers and pitfalls of C11's `<stdatomic.h>` 37------------------------------------------------- 38 39While the C11 memory model has likely been one of the most important additions 40to modern C, in the purest C tradition, it is a sharp tool. 41 42By default, C11 comes with two variants of each atomic "operation": 43 44- an *explicit* variant where memory orderings can be specified, 45- a regular variant which is equivalent to the former with the *seq_cst* 46 memory ordering. 47 48When an `_Atomic` qualified variable is accessed directly without using 49any `atomic_*_explicit()` operation, then the compiler will generate the 50matching *seq_cst* atomic operations on your behalf. 51 52The sequentially consistent world is extremely safe from a lot of compiler 53and hardware reorderings and optimizations, which is great, but comes with 54a huge cost in terms of memory barriers. 55 56 57It seems very tempting to use `atomic_*_explicit()` functions with explicit 58memory orderings, however, the compiler is entitled to perform a number of 59optimizations with relaxed atomics, that most developers will not expect. 60Indeed, the compiler is perfectly allowed to perform various optimizations it 61does with other plain memory accesess such as coalescing, reordering, hoisting 62out of loops, ... 63 64For example, when the compiler can know what `doit` is doing (which due to LTO 65is almost always the case for XNU), is allowed to transform this code: 66 67```c 68 void 69 perform_with_progress(int steps, long _Atomic *progress) 70 { 71 for (int i = 0; i < steps; i++) { 72 doit(i); 73 atomic_store_explicit(progress, i, memory_order_relaxed); 74 } 75 } 76``` 77 78Into this, which obviously defeats the entire purpose of `progress`: 79 80```c 81 void 82 perform_with_progress(int steps, long _Atomic *progress) 83 { 84 for (int i = 0; i < steps; i++) { 85 doit(i); 86 } 87 atomic_store_explicit(progress, steps, memory_order_relaxed); 88 } 89``` 90 91 92How `os_atomic_*` tries to address `<stdatomic.h>` pitfalls 93----------------------------------------------------------- 94 951. the memory locations passed to the various `os_atomic_*` 96 functions do not need to be marked `_Atomic` or `volatile` 97 (or `_Atomic volatile`), which allow for use of atomic 98 operations in code written before C11 was even a thing. 99 100 It is however recommended in new code to use the `_Atomic` 101 specifier. 102 1032. `os_atomic_*` cannot be coalesced by the compiler: 104 all accesses are performed on the specified locations 105 as if their type was `_Atomic volatile` qualified. 106 1073. `os_atomic_*` only comes with the explicit variants: 108 orderings must be provided and can express either memory orders 109 where the name is the same as in C11 without the `memory_order_` prefix, 110 or a compiler barrier ordering `compiler_acquire`, `compiler_release`, 111 `compiler_acq_rel`. 112 1134. `os_atomic_*` emits the proper compiler barriers that 114 correspond to the requested memory ordering (using 115 `atomic_signal_fence()`). 116 117 118Best practices for the use of atomics in XNU 119-------------------------------------------- 120 121For most generic code, the `os_atomic_*` functions from 122`<os/atomic_private.h>` are the preferred interfaces. 123 124`__sync_*`, `__c11_*` and `__atomic_*` compiler builtins should not be used. 125 126`<stdatomic.h>` functions may be used if: 127 128- compiler coalescing / reordering is desired (refcounting 129 implementations may desire this for example). 130 131 132Qualifying atomic variables with `_Atomic` or even 133`_Atomic volatile` is encouraged, however authors must 134be aware that a direct access to this variable will 135result in quite heavy memory barriers. 136 137The *consume* memory ordering should not be used 138(See *dependency* memory order later in this documentation). 139 140**Note**: `<libkern/OSAtomic.h>` provides a bunch of legacy 141atomic interfaces, but this header is considered obsolete 142and these functions should not be used in new code. 143 144 145High level overview of `os_atomic_*` interfaces 146----------------------------------------------- 147 148### Compiler barriers and memory fences 149 150`os_compiler_barrier(mem_order?)` provides a compiler barrier, 151with an optional barrier ordering. It is implemented with C11's 152`atomic_signal_fence()`. The barrier ordering argument is optional 153and defaults to the `acq_rel` compiler barrier (which prevents the 154compiler to reorder code in any direction around this barrier). 155 156`os_atomic_thread_fence(mem_order)` provides a memory barrier 157according to the semantics of `atomic_thread_fence()`. It always 158implies the equivalent `os_compiler_barrier()` even on UP systems. 159 160### Init, load and store 161 162`os_atomic_init`, `os_atomic_load` and `os_atomic_store` provide 163facilities equivalent to `atomic_init`, `atomic_load_explicit` 164and `atomic_store_explicit` respectively. 165 166Note that `os_atomic_load` and `os_atomic_store` promise that they will 167compile to a plain load or store. `os_atomic_load_wide` and 168`os_atomic_store_wide` can be used to have access to atomic loads and store 169that involve more costly codegen (such as compare exchange loops). 170 171### Basic RMW (read/modify/write) atomic operations 172 173The following basic atomic RMW operations exist: 174 175- `inc`: atomic increment (equivalent to an atomic add of `1`), 176- `dec`: atomic decrement (equivalent to an atomic sub of `1`), 177- `add`: atomic add, 178- `sub`: atomic sub, 179- `or`: atomic bitwise or, 180- `xor`: atomic bitwise xor, 181- `and`: atomic bitwise and, 182- `andnot`: atomic bitwise andnot (equivalent to atomic and of ~value), 183- `min`: atomic min, 184- `max`: atomic max. 185 186For any such operation, two variants exist: 187 188- `os_atomic_${op}_orig` (for example `os_atomic_add_orig`) 189 which returns the value stored at the specified location 190 *before* the atomic operation took place 191- `os_atomic_${op}` (for example `os_atomic_add`) which 192 returns the value stored at the specified location 193 *after* the atomic operation took place 194 195This convention is picked for two reasons: 196 1971. `os_atomic_add(p, value, ...)` is essentially equivalent to the C 198 in place addition `(*p += value)` which returns the result of the 199 operation and not the original value of `*p`. 200 2012. Most subtle atomic algorithms do actually require the original value 202 stored at the location, especially for bit manipulations: 203 `(os_atomic_or_orig(p, bit, relaxed) & bit)` will atomically perform 204 `*p |= bit` but also tell you whether `bit` was set in the original value. 205 206 Making it more explicit that the original value is used is hence 207 important for readers and worth the extra five keystrokes. 208 209Typically: 210 211```c 212 static int _Atomic i = 0; 213 214 printf("%d\n", os_atomic_inc_orig(&i)); // prints 0 215 printf("%d\n", os_atomic_inc(&i)); // prints 2 216``` 217 218### Atomic swap / compare and swap 219 220`os_atomic_xchg` is a simple wrapper around `atomic_exchange_explicit`. 221 222There are two variants of `os_atomic_cmpxchg` which are wrappers around 223`atomic_compare_exchange_strong_explicit`. Both of these variants will 224return false/0 if the compare exchange failed, and true/1 if the expected 225value was found at the specified location and the new value was stored. 226 2271. `os_atomic_cmpxchg(address, expected, new_value, mem_order)` which 228 will atomically store `new_value` at `address` if the current value 229 is equal to `expected`. 230 2312. `os_atomic_cmpxchgv(address, expected, new_value, orig_value, mem_order)` 232 which has an extra `orig_value` argument which must be a pointer to a local 233 variable and will be filled with the current value at `address` whether the 234 compare exchange was successful or not. In case of success, the loaded value 235 will always be `expected`, however in case of failure it will be filled with 236 the current value, which is helpful to redrive compare exchange loops. 237 238Unlike `atomic_compare_exchange_strong_explicit`, a single ordering is 239specified, which only takes effect in case of a successful compare exchange. 240In C11 speak, `os_atomic_cmpxchg*` always specifies `memory_order_relaxed` 241for the failure case ordering, as it is what is used most of the time. 242 243There is no wrapper around `atomic_compare_exchange_weak_explicit`, 244as `os_atomic_rmw_loop` offers a much better alternative for CAS-loops. 245 246### `os_atomic_rmw_loop` 247 248This expressive and versatile construct allows for really terse and 249way more readable compare exchange loops. It also uses LL/SC constructs more 250efficiently than a compare exchange loop would allow. 251 252Instead of a typical CAS-loop in C11: 253 254```c 255 int _Atomic *address; 256 int old_value, new_value; 257 bool success = false; 258 259 old_value = atomic_load_explicit(address, memory_order_relaxed); 260 do { 261 if (!validate(old_value)) { 262 break; 263 } 264 new_value = compute_new_value(old_value); 265 success = atomic_compare_exchange_weak_explicit(address, &old_value, 266 new_value, memory_order_acquire, memory_order_relaxed); 267 } while (__improbable(!success)); 268``` 269 270`os_atomic_rmw_loop` allows this form: 271 272```c 273 int _Atomic *address; 274 int old_value, new_value; 275 bool success; 276 277 success = os_atomic_rmw_loop(address, old_value, new_value, acquire, { 278 if (!validate(old_value)) { 279 os_atomic_rmw_loop_give_up(break); 280 } 281 new_value = compute_new_value(old_value); 282 }); 283``` 284 285Unlike the C11 variant, it lets the reader know in program order that this will 286be a CAS loop, and exposes the ordering upfront, while for traditional CAS loops 287one has to jump to the end of the code to understand what it does. 288 289Any control flow that attempts to exit its scope of the loop needs to be 290wrapped with `os_atomic_rmw_loop_give_up` (so that LL/SC architectures can 291abort their opened LL/SC transaction). 292 293Because these loops are LL/SC transactions, it is undefined to perform 294any store to memory (register operations are fine) within these loops, 295as these may cause the store-conditional to always fail. 296In particular nesting of `os_atomic_rmw_loop` is invalid. 297 298Use of `continue` within an `os_atomic_rmw_loop` is also invalid, instead an 299`os_atomic_rmw_loop_give_up(goto again)` jumping to an `again:` label placed 300before the loop should be used in this way: 301 302```c 303 int _Atomic *address; 304 int old_value, new_value; 305 bool success; 306 307again: 308 success = os_atomic_rmw_loop(address, old_value, new_value, acquire, { 309 if (needs_some_store_that_can_thwart_the_transaction(old_value)) { 310 os_atomic_rmw_loop_give_up({ 311 // Do whatever you need to do/store to central memory 312 // that would cause the loop to always fail 313 do_my_rmw_loop_breaking_store(); 314 315 // And only then redrive. 316 goto again; 317 }); 318 } 319 if (!validate(old_value)) { 320 os_atomic_rmw_loop_give_up(break); 321 } 322 new_value = compute_new_value(old_value); 323 }); 324``` 325 326### the *dependency* memory order 327 328Because the C11 *consume* memory order is broken in various ways, 329most compilers, clang included, implement it as an equivalent 330for `memory_order_acquire`. However, its concept is useful 331for certain algorithms. 332 333As an attempt to provide a replacement for this, `<os/atomic_private.h>` 334implements an entirely new *dependency* memory ordering. 335 336The purpose of this ordering is to provide a relaxed load followed by an 337implicit compiler barrier, that can be used as a root for a chain of hardware 338dependencies that would otherwise pair with store-releases done at this address, 339very much like the *consume* memory order is intended to provide. 340 341However, unlike the *consume* memory ordering where the compiler had to follow 342the dependencies, the *dependency* memory ordering relies on explicit 343annotations of when the dependencies are expected: 344 345- loads through a pointer loaded with a *dependency* memory ordering 346 will provide a hardware dependency, 347 348- dependencies may be injected into other loads not performed through this 349 particular pointer with the `os_atomic_load_with_dependency_on` and 350 `os_atomic_inject_dependency` interfaces. 351 352Here is an example of how it is meant to be used: 353 354```c 355 struct foo { 356 long value; 357 long _Atomic flag; 358 }; 359 360 void 361 publish(struct foo *p, long value) 362 { 363 p->value = value; 364 os_atomic_store(&p->flag, 1, release); 365 } 366 367 368 bool 369 broken_read(struct foo *p, long *value) 370 { 371 /* 372 * This isn't safe, as there's absolutely no hardware dependency involved. 373 * Using an acquire barrier would of course fix it but is quite expensive... 374 */ 375 if (os_atomic_load(&p->flag, relaxed)) { 376 *value = p->value; 377 return true; 378 } 379 return false; 380 } 381 382 bool 383 valid_read(struct foo *p, long *value) 384 { 385 long flag = os_atomic_load(&p->flag, dependency); 386 if (flag) { 387 /* 388 * Further the chain of dependency to any loads through `p` 389 * which properly pair with the release barrier in `publish`. 390 */ 391 *value = os_atomic_load_with_dependency_on(&p->value, flag); 392 return true; 393 } 394 return false; 395 } 396``` 397 398There are 4 interfaces involved with hardware dependencies: 399 4001. `os_atomic_load(..., dependency)` to initiate roots of hardware dependencies, 401 that should pair with a store or rmw with release semantics or stronger 402 (release, acq\_rel or seq\_cst), 403 4042. `os_atomic_inject_dependency` can be used to inject the dependency provided 405 by a *dependency* load, or any other value that has had a dependency 406 injected, 407 4083. `os_atomic_load_with_dependency_on` to do an otherwise related relaxed load 409 that still prolongs a dependency chain, 410 4114. `os_atomic_make_dependency` to create an opaque token out of a given 412 dependency root to inject into multiple loads. 413 414 415**Note**: this technique is NOT safe when the compiler can reason about the 416pointers that you are manipulating, for example if the compiler can know that 417the pointer can only take a couple of values and ditch all these manually 418crafted dependency chains. Hopefully there will be a future C2Y standard that 419provides a similar construct as a language feature instead. 420