xref: /xnu-11417.140.69/doc/primitives/atomics.md (revision 43a90889846e00bfb5cf1d255cdc0a701a1e05a4)
1XNU use of Atomics and Memory Barriers
2======================================
3
4How to use atomics and memory barriers in xnu.
5
6Goal
7----
8
9This document discusses the use of atomics and memory barriers in XNU. It is
10meant as a guide to best practices, and warns against a variety of possible
11pitfalls in the handling of atomics in C.
12
13It is assumed that the reader has a decent understanding of
14the [C11 memory model](https://en.cppreference.com/w/c/atomic/memory_order)
15as this document builds on it, and explains the liberties XNU takes with said
16model.
17
18All the interfaces discussed in this document are available through
19the `<os/atomic_private.h>` header.
20
21Note: Linux has thorough documentation around memory barriers
22(Documentation/memory-barriers.txt), some of which is Linux specific,
23but most is not and is a valuable read.
24
25
26Vocabulary
27----------
28
29In the rest of this document we'll refer to the various memory ordering defined
30by C11 as relaxed, consume, acquire, release, acq\_rel and seq\_cst.
31
32`os_atomic` also tries to make the distinction between compiler **barriers**
33(which limit how much the compiler can reorder code), and memory **fences**.
34
35
36The dangers and pitfalls of C11's `<stdatomic.h>`
37-------------------------------------------------
38
39While the C11 memory model has likely been one of the most important additions
40to modern C, in the purest C tradition, it is a sharp tool.
41
42By default, C11 comes with two variants of each atomic "operation":
43
44- an *explicit* variant where memory orderings can be specified,
45- a regular variant which is equivalent to the former with the *seq_cst*
46  memory ordering.
47
48When an `_Atomic` qualified variable is accessed directly without using
49any `atomic_*_explicit()` operation, then the compiler will generate the
50matching *seq_cst* atomic operations on your behalf.
51
52The sequentially consistent world is extremely safe from a lot of compiler
53and hardware reorderings and optimizations, which is great, but comes with
54a huge cost in terms of memory barriers.
55
56
57It seems very tempting to use `atomic_*_explicit()` functions with explicit
58memory orderings, however, the compiler is entitled to perform a number of
59optimizations with relaxed atomics, that most developers will not expect.
60Indeed, the compiler is perfectly allowed to perform various optimizations it
61does with other plain memory accesess such as coalescing, reordering, hoisting
62out of loops, ...
63
64For example, when the compiler can know what `doit` is doing (which due to LTO
65is almost always the case for XNU), is allowed to transform this code:
66
67```c
68    void
69    perform_with_progress(int steps, long _Atomic *progress)
70    {
71        for (int i = 0; i < steps; i++) {
72            doit(i);
73            atomic_store_explicit(progress, i, memory_order_relaxed);
74        }
75    }
76```
77
78Into this, which obviously defeats the entire purpose of `progress`:
79
80```c
81    void
82    perform_with_progress(int steps, long _Atomic *progress)
83    {
84        for (int i = 0; i < steps; i++) {
85            doit(i);
86        }
87        atomic_store_explicit(progress, steps, memory_order_relaxed);
88    }
89```
90
91
92How `os_atomic_*` tries to address `<stdatomic.h>` pitfalls
93-----------------------------------------------------------
94
951. the memory locations passed to the various `os_atomic_*`
96   functions do not need to be marked `_Atomic` or `volatile`
97   (or `_Atomic volatile`), which allow for use of atomic
98   operations in code written before C11 was even a thing.
99
100   It is however recommended in new code to use the `_Atomic`
101   specifier.
102
1032. `os_atomic_*` cannot be coalesced by the compiler:
104   all accesses are performed on the specified locations
105   as if their type was `_Atomic volatile` qualified.
106
1073. `os_atomic_*` only comes with the explicit variants:
108   orderings must be provided and can express either memory orders
109   where the name is the same as in C11 without the `memory_order_` prefix,
110   or a compiler barrier ordering `compiler_acquire`, `compiler_release`,
111   `compiler_acq_rel`.
112
1134. `os_atomic_*` emits the proper compiler barriers that
114   correspond to the requested memory ordering (using
115   `atomic_signal_fence()`).
116
117
118Best practices for the use of atomics in XNU
119--------------------------------------------
120
121For most generic code, the `os_atomic_*` functions from
122`<os/atomic_private.h>` are the preferred interfaces.
123
124`__sync_*`, `__c11_*` and `__atomic_*` compiler builtins should not be used.
125
126`<stdatomic.h>` functions may be used if:
127
128- compiler coalescing / reordering is desired (refcounting
129  implementations may desire this for example).
130
131
132Qualifying atomic variables with `_Atomic` or even
133`_Atomic volatile` is encouraged, however authors must
134be aware that a direct access to this variable will
135result in quite heavy memory barriers.
136
137The *consume* memory ordering should not be used
138(See *dependency* memory order later in this documentation).
139
140**Note**: `<libkern/OSAtomic.h>` provides a bunch of legacy
141atomic interfaces, but this header is considered obsolete
142and these functions should not be used in new code.
143
144
145High level overview of `os_atomic_*` interfaces
146-----------------------------------------------
147
148### Compiler barriers and memory fences
149
150`os_compiler_barrier(mem_order?)` provides a compiler barrier,
151with an optional barrier ordering. It is implemented with C11's
152`atomic_signal_fence()`. The barrier ordering argument is optional
153and defaults to the `acq_rel` compiler barrier (which prevents the
154compiler to reorder code in any direction around this barrier).
155
156`os_atomic_thread_fence(mem_order)` provides a memory barrier
157according to the semantics of `atomic_thread_fence()`. It always
158implies the equivalent `os_compiler_barrier()` even on UP systems.
159
160### Init, load and store
161
162`os_atomic_init`, `os_atomic_load` and `os_atomic_store` provide
163facilities equivalent to `atomic_init`, `atomic_load_explicit`
164and `atomic_store_explicit` respectively.
165
166Note that `os_atomic_load` and `os_atomic_store` promise that they will
167compile to a plain load or store. `os_atomic_load_wide` and
168`os_atomic_store_wide` can be used to have access to atomic loads and store
169that involve more costly codegen (such as compare exchange loops).
170
171### Basic RMW (read/modify/write) atomic operations
172
173The following basic atomic RMW operations exist:
174
175- `inc`: atomic increment (equivalent to an atomic add of `1`),
176- `dec`: atomic decrement (equivalent to an atomic sub of `1`),
177- `add`: atomic add,
178- `sub`: atomic sub,
179- `or`: atomic bitwise or,
180- `xor`: atomic bitwise xor,
181- `and`: atomic bitwise and,
182- `andnot`: atomic bitwise andnot (equivalent to atomic and of ~value),
183- `min`: atomic min,
184- `max`: atomic max.
185
186For any such operation, two variants exist:
187
188- `os_atomic_${op}_orig` (for example `os_atomic_add_orig`)
189  which returns the value stored at the specified location
190  *before* the atomic operation took place
191- `os_atomic_${op}` (for example `os_atomic_add`) which
192  returns the value stored at the specified location
193  *after* the atomic operation took place
194
195This convention is picked for two reasons:
196
1971. `os_atomic_add(p, value, ...)` is essentially equivalent to the C
198   in place addition `(*p += value)` which returns the result of the
199   operation and not the original value of `*p`.
200
2012. Most subtle atomic algorithms do actually require the original value
202   stored at the location, especially for bit manipulations:
203   `(os_atomic_or_orig(p, bit, relaxed) & bit)` will atomically perform
204   `*p |= bit` but also tell you whether `bit` was set in the original value.
205
206   Making it more explicit that the original value is used is hence
207   important for readers and worth the extra five keystrokes.
208
209Typically:
210
211```c
212    static int _Atomic i = 0;
213
214    printf("%d\n", os_atomic_inc_orig(&i)); // prints 0
215    printf("%d\n", os_atomic_inc(&i)); // prints 2
216```
217
218### Atomic swap / compare and swap
219
220`os_atomic_xchg` is a simple wrapper around `atomic_exchange_explicit`.
221
222There are two variants of `os_atomic_cmpxchg` which are wrappers around
223`atomic_compare_exchange_strong_explicit`. Both of these variants will
224return false/0 if the compare exchange failed, and true/1 if the expected
225value was found at the specified location and the new value was stored.
226
2271. `os_atomic_cmpxchg(address, expected, new_value, mem_order)` which
228   will atomically store `new_value` at `address` if the current value
229   is equal to `expected`.
230
2312. `os_atomic_cmpxchgv(address, expected, new_value, orig_value, mem_order)`
232   which has an extra `orig_value` argument which must be a pointer to a local
233   variable and will be filled with the current value at `address` whether the
234   compare exchange was successful or not. In case of success, the loaded value
235   will always be `expected`, however in case of failure it will be filled with
236   the current value, which is helpful to redrive compare exchange loops.
237
238Unlike `atomic_compare_exchange_strong_explicit`, a single ordering is
239specified, which only takes effect in case of a successful compare exchange.
240In C11 speak, `os_atomic_cmpxchg*` always specifies `memory_order_relaxed`
241for the failure case ordering, as it is what is used most of the time.
242
243There is no wrapper around `atomic_compare_exchange_weak_explicit`,
244as `os_atomic_rmw_loop` offers a much better alternative for CAS-loops.
245
246### `os_atomic_rmw_loop`
247
248This expressive and versatile construct allows for really terse and
249way more readable compare exchange loops. It also uses LL/SC constructs more
250efficiently than a compare exchange loop would allow.
251
252Instead of a typical CAS-loop in C11:
253
254```c
255    int _Atomic *address;
256    int old_value, new_value;
257    bool success = false;
258
259    old_value = atomic_load_explicit(address, memory_order_relaxed);
260    do {
261        if (!validate(old_value)) {
262            break;
263        }
264        new_value = compute_new_value(old_value);
265        success = atomic_compare_exchange_weak_explicit(address, &old_value,
266                new_value, memory_order_acquire, memory_order_relaxed);
267    } while (__improbable(!success));
268```
269
270`os_atomic_rmw_loop` allows this form:
271
272```c
273    int _Atomic *address;
274    int old_value, new_value;
275    bool success;
276
277    success = os_atomic_rmw_loop(address, old_value, new_value, acquire, {
278        if (!validate(old_value)) {
279            os_atomic_rmw_loop_give_up(break);
280        }
281        new_value = compute_new_value(old_value);
282    });
283```
284
285Unlike the C11 variant, it lets the reader know in program order that this will
286be a CAS loop, and exposes the ordering upfront, while for traditional CAS loops
287one has to jump to the end of the code to understand what it does.
288
289Any control flow that attempts to exit its scope of the loop needs to be
290wrapped with `os_atomic_rmw_loop_give_up` (so that LL/SC architectures can
291abort their opened LL/SC transaction).
292
293Because these loops are LL/SC transactions, it is undefined to perform
294any store to memory (register operations are fine) within these loops,
295as these may cause the store-conditional to always fail.
296In particular nesting of `os_atomic_rmw_loop` is invalid.
297
298Use of `continue` within an `os_atomic_rmw_loop` is also invalid, instead an
299`os_atomic_rmw_loop_give_up(goto again)` jumping to an `again:` label placed
300before the loop should be used in this way:
301
302```c
303    int _Atomic *address;
304    int old_value, new_value;
305    bool success;
306
307again:
308    success = os_atomic_rmw_loop(address, old_value, new_value, acquire, {
309        if (needs_some_store_that_can_thwart_the_transaction(old_value)) {
310            os_atomic_rmw_loop_give_up({
311                // Do whatever you need to do/store to central memory
312                // that would cause the loop to always fail
313                do_my_rmw_loop_breaking_store();
314
315                // And only then redrive.
316                goto again;
317            });
318        }
319        if (!validate(old_value)) {
320            os_atomic_rmw_loop_give_up(break);
321        }
322        new_value = compute_new_value(old_value);
323    });
324```
325
326### the *dependency* memory order
327
328Because the C11 *consume* memory order is broken in various ways,
329most compilers, clang included, implement it as an equivalent
330for `memory_order_acquire`. However, its concept is useful
331for certain algorithms.
332
333As an attempt to provide a replacement for this, `<os/atomic_private.h>`
334implements an entirely new *dependency* memory ordering.
335
336The purpose of this ordering is to provide a relaxed load followed by an
337implicit compiler barrier, that can be used as a root for a chain of hardware
338dependencies that would otherwise pair with store-releases done at this address,
339very much like the *consume* memory order is intended to provide.
340
341However, unlike the *consume* memory ordering where the compiler had to follow
342the dependencies, the *dependency* memory ordering relies on explicit
343annotations of when the dependencies are expected:
344
345- loads through a pointer loaded with a *dependency* memory ordering
346  will provide a hardware dependency,
347
348- dependencies may be injected into other loads not performed through this
349  particular pointer with the `os_atomic_load_with_dependency_on` and
350  `os_atomic_inject_dependency` interfaces.
351
352Here is an example of how it is meant to be used:
353
354```c
355    struct foo {
356        long value;
357        long _Atomic flag;
358    };
359
360    void
361    publish(struct foo *p, long value)
362    {
363        p->value = value;
364        os_atomic_store(&p->flag, 1, release);
365    }
366
367
368    bool
369    broken_read(struct foo *p, long *value)
370    {
371        /*
372         * This isn't safe, as there's absolutely no hardware dependency involved.
373         * Using an acquire barrier would of course fix it but is quite expensive...
374         */
375        if (os_atomic_load(&p->flag, relaxed)) {
376            *value = p->value;
377            return true;
378        }
379        return false;
380    }
381
382    bool
383    valid_read(struct foo *p, long *value)
384    {
385        long flag = os_atomic_load(&p->flag, dependency);
386        if (flag) {
387            /*
388             * Further the chain of dependency to any loads through `p`
389             * which properly pair with the release barrier in `publish`.
390             */
391            *value = os_atomic_load_with_dependency_on(&p->value, flag);
392            return true;
393        }
394        return false;
395    }
396```
397
398There are 4 interfaces involved with hardware dependencies:
399
4001. `os_atomic_load(..., dependency)` to initiate roots of hardware dependencies,
401   that should pair with a store or rmw with release semantics or stronger
402   (release, acq\_rel or seq\_cst),
403
4042. `os_atomic_inject_dependency` can be used to inject the dependency provided
405   by a *dependency* load, or any other value that has had a dependency
406   injected,
407
4083. `os_atomic_load_with_dependency_on` to do an otherwise related relaxed load
409   that still prolongs a dependency chain,
410
4114. `os_atomic_make_dependency` to create an opaque token out of a given
412   dependency root to inject into multiple loads.
413
414
415**Note**: this technique is NOT safe when the compiler can reason about the
416pointers that you are manipulating, for example if the compiler can know that
417the pointer can only take a couple of values and ditch all these manually
418crafted dependency chains. Hopefully there will be a future C2Y standard that
419provides a similar construct as a language feature instead.
420