xref: /xnu-12377.41.6/doc/vm/pageout_scan.md (revision bbb1b6f9e71b8cdde6e5cd6f4841f207dee3d828)
1# Pageout Scan
2
3The design of Mach VM's paging algorithm (implemented in `vm_pageout_scan()`).
4
5## Start/Stop Conditions
6
7When a thread needs a free page it calls `vm_page_grab[_options]()`. If the
8system is running low on free pages for use (i.e.
9`vm_page_free_count < vm_page_free_reserved`), the faulting thread will block in
10`vm_page_wait()`. A subset of privileged (`TH_OPT_VMPRIV`) VM threads may
11continue continue grabbing "reserved" pages without blocking.
12
13Whenever a page is grabbed and the free page count is nearing its floor
14(`vm_page_free_count < vm_page_free_min`), a wakeup is immediately issued to
15the pageout thread (`VM_pageout_scan`) is woken. `VM_pageout_scan` is
16responsible for freeing clean pages and choosing dirty pages to evict so that
17incoming page demand can be satisfied.
18
19The pageout thread will continue scanning for pages to evict until all of the
20following conditions are met:
211. The free page count has reached its target (`vm_page_free_count >=
22vm_page_free_target`)\*
232. there are no privileged threads waiting for pages (indicated by
24`vm_page_free_wanted_privileged`)
253. there are no unprivileged threads waiting for pages (indicated by
26`vm_page_free_wanted`)
27
28\*Invariant: `vm_page_free_target > vm_page_free_min > vm_page_free_reserved`
29
30## A Note on Complexity
31The state machine is complex and can be difficult to predict. This document
32serves as a high-level overview of the algorithm. Even seemingly minor
33changes to tuning can result in drastic behavioral differences when
34the system is pushed to the extreme.
35
36## Contribution Guidelines (Internal)
37
381. The `return_from_scan` label is the only spot where `vm_pageout_scan()`
39will stop. A single exit path makes for readability and understandability. Try
40to keep it that way.
412. Try to reduce the use of backwards `goto`s. Great care has been taken to
42remove these patterns. Don't regress readability! A to-be-completed
43[refactor](https://stashweb.sd.apple.com/projects/COREOS/repos/xnu/pull-requests/21219/overview)
44removes the remaining backwards `goto`s.
453. Be wary of 2nd order effects. For example:
46  - How might a bias towards paging anonymous memory affect jetsam? Too many
47  file backed pages may preclude jetsam from running and leave the system
48  unresponsive because of constant pageout/compressor activity
49  - How will varying compression ratios change the effectiveness of the
50  pageout algorithm? A bias towards anonymous pages may result in quicker
51  exhaustion of the compressor pool and increased memory pressure from the
52  resident compressed pages.
53
54It is critical that the pageout thread not block except as dictated by its
55state machine (e.g. to yield VM locks, to wait until the free page pool is
56depleted). Be very wary of introducing any new synchronization dependencies
57outside of the VM.
58
59## The Pageout Algorithm
60This section documents xnu's page eviction algorithm (`VM_pageout_scan`). It is broken into 5 "phases."
61
62### Phase 1 - Initialization & Rapid Reclamation
63* Initialize the relevant page targets that will guide the algorithm
64(`vps_init_page_targets()`). This determines how much anonymous memory and
65speculative memory to keep around. Look at the refactor #2 for a more cohesive
66collection of all the target page threshold calculations.
67* Initialize the Flow Control machine to its default state (`FCS_IDLE`).
68* Reclaim "cheap" memory from any other subsystems. These must be fast and non-blocking.
69  - `pmap_release_pages_fast()`
70
71**Note**: Phase 2 - 5 comprise the "FOR" loop in PageoutScan. The PageQ lock
72(`vm_page_queue_lock`) is held for most of this loop.
73
74### Phase 2
75Check to see if we need to drop the PageQ lock:
76- We have been holding for quite some time. The compressor/compactor
77  may need it.
78- Drop the lock, free any pages we might have accumulated (usually
79  after a few iterations through the loop)
80- Wake up the compactor and try to retake the lock. If the compactor
81  needed it, it would have grabbed it and we might block.
82- We need a vm-object lock but another thread is holding it. That thread
83  may also need the PageQ lock.
84- Drop the PageQ lock for 10us and try again.
85- Another thread (usually the NVMe driver) is waiting for the PageQ lock so
86  it can free some pages back to the VM. Yield the PageQ lock and see if that
87  helps.
88
89General Page Q management:
901. Check for overflow secluded pages (secluded count > secluded target) to push
91   to the active queue.
922. Deactivate a single page. This deactivated page should "balance" the reactivated
93   or reclaimed page that we remove from one of the inactive/anonymous queues below.
943. Are we done? (`return_from_scan`)?
954. Check for:
96  - "ripe" purgeable vm-object.
97  - a speculative queue to age
98  - a vm-object in the object cache to evict
995. If we found any actions to take in step 4, repeat Phase 2. Else, continue
100   to Phase 3.
101
102### Phase 3
103The following page queues are eligible to be reclaimed from:
104- Inactive Queue: deactivated file-backed pages
105- Speculative Queue: file-backed pages which have never been activated. These
106  are generally generated by read-ahead.
107- Anonymous Queue: deactivated anonymous pages
108- Cleaned Queue: File backed pages that have been "cleaned" by writing their
109  contents back to disk and are now reclaimable. This queue is no longer used.
110
1111. Update the file cache targets. (TODO: how?)
1122. Check the Flow Control state machine to evaluate if we should block to
113   allow the rest of the system to make forward progress.
114   - If the queues of interest are all empty, block for 50ms. There is nothing
115     `pageout_scan` can do, but the other VM threads may be able to make progress.
116   - If we have evaluated a significant number of pages without making *any*
117     progress (reactivations or frees), block for 1ms.
118   - If the compressor queues are full ("throttled"):
119     - `FCS_IDLE`: There are plenty of file-backed pages, bias the loop towards reclaiming these
120     - `FCS_DELAYED`: If the deadlock-detection period has elapsed then wakeup
121       the garbage collector, increase the reclamation target by 100, and
122       change state to `FCS_DEADLOCK_DETECTED`. Else, block.
123    - `FCS_DEADLOCK_DETECTED`: If the reclamation target is met, change state
124      back to `FCS_DELAYED`. Else, restart from Phase 2.
125
126### Phase 4
127We must now choose a "victim" page to attempt to reclaim. If a candidate page
128has been referenced since deactivation, it will be reactivated (barring
129certain "force-reclaim" conditions).
130
1311. Look for clean or speculative pages (unless we specifically want an
132   anonymous one).
1332. On non-app-swap systems (macOS), look for a "self-donated" page.
1343. Look for a background page. On Intel systems, we heavily bias towards
135   background pages during dark-wake mode to ensure background tasks (e.g.
136   Software Update) do not disrupt the user's normal working set.
1374. Look for 2 anonymous pages for every 1 file-backed page.\* This ratio comes
138   from the days of spinning disks and software compression, where re-faulting a
139   file-backed page was roughly twice as costly as an anonymous one.
1405. If steps 1-4 could not find an unreferenced page, restart from Phase 2.
141
142\* Certain extreme conditions may cause the 2:1 ratio to be ignored:
143  - The file-cache has fallen below its minimum size -> choose anonymous
144  - The number of inactive file-backed pages is less than 50% of all
145    file-backed pages -> choose anonymous
146  - The free page count is dangerously low (compression may require free pages
147    to compress into) -> choose file-backed
148
149### Phase 5
150We have found a victim page, and will now attempt to reclaim it. "Freed" pages
151are placed on a thread-local free queue to be freed to the global free queue
152in batches during Phase 2.
153
1541. Pull the page off of its current queue.
1552. *Try* to take the vm-object lock corresponding to the victim page. Note
156   that this is an inversion of the typical lock ordering (vm-object ->
157   page-queues). As such, `pageout_scan` cannot block if the lock is currently
158   held by another thread. If it cannot take the vm-object lock, then identify
159   another potential victim page via Phase 4 and tell the system that a
160   "privileged" thread wants its vm-object lock (precluding other threads
161   from taking the lock until the privileged thread has had an opportunity
162   to take it), drop the PageQ lock, pause for 10µs, and restart from Phase 2.
1633. Evaluate the page's current state:
164  - `busy`: this page is being transiently operated on by another thread,
165    place it back on its queue and restart from Phase 2.
166  - `free_when_done`/`cleaning`: this page is about to be freed by another
167    thread. Skip it and restart from Phase 2.
168  - `error`/`absent`/`pager==NULL`/`object==NULL`: this page can be freed
169    without any cleaning. Free the page.
170  - `purgeable(empty)`: object has already been purged, free the page.
171  - `purgeable(volatile)`: We'll purge this object wholesale once it is ripe,
172    so compressing it now isn't worth the work. Skip this page and restart
173    from Phase 2.
1744. Check (with the pmap) if the page has been modified or referenced.
1755. If the page has been referenced since we identified it as a victim, consider
176   reactivating it. If we have consecutively re-activated a sufficient number
177   of pages, then reclaim the page anyway to ensure forward progress is made.\*
178   On embedded systems, a sufficient number of these forced reclamations will
179   trigger jetsams. Pages which were first faulted by real-time threads are
180   exempted from these forced reclamations to prevent audio glitches.
1816. Disconnect the page from all page-table and virtual mappings. If it is
182   anonymous, leave a breadcrumb in the page table entry for memory accounting
183   purposes.
1847. If the page is clean, free it.
1858. Otherwise, the page is dirty and needs to be "cleaned" before it can be reclaimed.
186   Place it on the relevant pageout queue (i.e. compressor for anonymous and external
187   for file-backed) and wakeup the relevant VM thread.
1889. Restart from Phase 2.
189
190\* This can happen when the working set turns over rapidly or the system is
191seriously overcommited. In such cases, we can't rely on the LRU approximation
192to identified "good" victims and need to reclaim whatever we can find.
193
194## Historical Experiments
195
196### Latency-based Jetsam
197By placing a "fake" page in the active page queue with an associated timestamp,
198we can track the rate of paging by measuring how long it takes for the page to
199be identified as a victim by `pageout_scan`. A rapid paging rate indicates
200that the system cannot keep up with memory demand via paging alone. In such
201cases, jetsams would be invoked directly by `pageout_scan` to free larger
202amounts of memory and reduce demand.
203
204Experiments with this implementation highlighted that many iterations of
205`pageout_scan` are required before the latency-detection mechanism will
206trigger. The delay imposed by these LPF-characteristics was often larger than
207the existing page-shortage mechanism and regressed use cases like Camera launch.
208Further, performing kills directly on the pageout thread added significant
209latency.
210
211Re-introducing the paging-rate measurement without the jetsam-trigger may be
212worthwhile for diagnosing system health.
213
214### Dynamic Scheduling Priority
215In theory, a misbehaving low-priority thread can generate lots of page demand,
216invoking `pageout_scan` to run at a very high priority (91). Thus, the low-priority
217thread can effectively preempt higher-priority user threads and starve them of
218the core(s) used by the VM thread(s). This can be mitigated by using
219propagating the priority of threads waiting on free pages to `pageout_scan`,
220allowing `pageout_scan` to only run at a priority as high as its highest waiter.
221
222This approach was enabled on low core-count devices (i.e. watches) for 1-2
223years. However, it eventually appeared to contribute to audio glitches and had
224to be disabled.
225
226In general, *any* page-wait (even short ones) can be catastrophic for latency
227sensitive/real-time threads, especially if those threads will also have to
228wait for an I/O to complete after the page-wait. By slowing the preemptive
229paging done without any waiters (at `pageout_scan`'s now low base priority),
230the likelihood of page-waits increases.
231
232