1# Pageout Scan 2 3The design of Mach VM's paging algorithm (implemented in `vm_pageout_scan()`). 4 5## Start/Stop Conditions 6 7When a thread needs a free page it calls `vm_page_grab[_options]()`. If the 8system is running low on free pages for use (i.e. 9`vm_page_free_count < vm_page_free_reserved`), the faulting thread will block in 10`vm_page_wait()`. A subset of privileged (`TH_OPT_VMPRIV`) VM threads may 11continue continue grabbing "reserved" pages without blocking. 12 13Whenever a page is grabbed and the free page count is nearing its floor 14(`vm_page_free_count < vm_page_free_min`), a wakeup is immediately issued to 15the pageout thread (`VM_pageout_scan`) is woken. `VM_pageout_scan` is 16responsible for freeing clean pages and choosing dirty pages to evict so that 17incoming page demand can be satisfied. 18 19The pageout thread will continue scanning for pages to evict until all of the 20following conditions are met: 211. The free page count has reached its target (`vm_page_free_count >= 22vm_page_free_target`)\* 232. there are no privileged threads waiting for pages (indicated by 24`vm_page_free_wanted_privileged`) 253. there are no unprivileged threads waiting for pages (indicated by 26`vm_page_free_wanted`) 27 28\*Invariant: `vm_page_free_target > vm_page_free_min > vm_page_free_reserved` 29 30## A Note on Complexity 31The state machine is complex and can be difficult to predict. This document 32serves as a high-level overview of the algorithm. Even seemingly minor 33changes to tuning can result in drastic behavioral differences when 34the system is pushed to the extreme. 35 36## Contribution Guidelines (Internal) 37 381. The `return_from_scan` label is the only spot where `vm_pageout_scan()` 39will stop. A single exit path makes for readability and understandability. Try 40to keep it that way. 412. Try to reduce the use of backwards `goto`s. Great care has been taken to 42remove these patterns. Don't regress readability! A to-be-completed 43[refactor](https://stashweb.sd.apple.com/projects/COREOS/repos/xnu/pull-requests/21219/overview) 44removes the remaining backwards `goto`s. 453. Be wary of 2nd order effects. For example: 46 - How might a bias towards paging anonymous memory affect jetsam? Too many 47 file backed pages may preclude jetsam from running and leave the system 48 unresponsive because of constant pageout/compressor activity 49 - How will varying compression ratios change the effectiveness of the 50 pageout algorithm? A bias towards anonymous pages may result in quicker 51 exhaustion of the compressor pool and increased memory pressure from the 52 resident compressed pages. 53 54It is critical that the pageout thread not block except as dictated by its 55state machine (e.g. to yield VM locks, to wait until the free page pool is 56depleted). Be very wary of introducing any new synchronization dependencies 57outside of the VM. 58 59## The Pageout Algorithm 60This section documents xnu's page eviction algorithm (`VM_pageout_scan`). It is broken into 5 "phases." 61 62### Phase 1 - Initialization & Rapid Reclamation 63* Initialize the relevant page targets that will guide the algorithm 64(`vps_init_page_targets()`). This determines how much anonymous memory and 65speculative memory to keep around. Look at the refactor #2 for a more cohesive 66collection of all the target page threshold calculations. 67* Initialize the Flow Control machine to its default state (`FCS_IDLE`). 68* Reclaim "cheap" memory from any other subsystems. These must be fast and non-blocking. 69 - `pmap_release_pages_fast()` 70 71**Note**: Phase 2 - 5 comprise the "FOR" loop in PageoutScan. The PageQ lock 72(`vm_page_queue_lock`) is held for most of this loop. 73 74### Phase 2 75Check to see if we need to drop the PageQ lock: 76- We have been holding for quite some time. The compressor/compactor 77 may need it. 78- Drop the lock, free any pages we might have accumulated (usually 79 after a few iterations through the loop) 80- Wake up the compactor and try to retake the lock. If the compactor 81 needed it, it would have grabbed it and we might block. 82- We need a vm-object lock but another thread is holding it. That thread 83 may also need the PageQ lock. 84- Drop the PageQ lock for 10us and try again. 85- Another thread (usually the NVMe driver) is waiting for the PageQ lock so 86 it can free some pages back to the VM. Yield the PageQ lock and see if that 87 helps. 88 89General Page Q management: 901. Check for overflow secluded pages (secluded count > secluded target) to push 91 to the active queue. 922. Deactivate a single page. This deactivated page should "balance" the reactivated 93 or reclaimed page that we remove from one of the inactive/anonymous queues below. 943. Are we done? (`return_from_scan`)? 954. Check for: 96 - "ripe" purgeable vm-object. 97 - a speculative queue to age 98 - a vm-object in the object cache to evict 995. If we found any actions to take in step 4, repeat Phase 2. Else, continue 100 to Phase 3. 101 102### Phase 3 103The following page queues are eligible to be reclaimed from: 104- Inactive Queue: deactivated file-backed pages 105- Speculative Queue: file-backed pages which have never been activated. These 106 are generally generated by read-ahead. 107- Anonymous Queue: deactivated anonymous pages 108- Cleaned Queue: File backed pages that have been "cleaned" by writing their 109 contents back to disk and are now reclaimable. This queue is no longer used. 110 1111. Update the file cache targets. (TODO: how?) 1122. Check the Flow Control state machine to evaluate if we should block to 113 allow the rest of the system to make forward progress. 114 - If the queues of interest are all empty, block for 50ms. There is nothing 115 `pageout_scan` can do, but the other VM threads may be able to make progress. 116 - If we have evaluated a significant number of pages without making *any* 117 progress (reactivations or frees), block for 1ms. 118 - If the compressor queues are full ("throttled"): 119 - `FCS_IDLE`: There are plenty of file-backed pages, bias the loop towards reclaiming these 120 - `FCS_DELAYED`: If the deadlock-detection period has elapsed then wakeup 121 the garbage collector, increase the reclamation target by 100, and 122 change state to `FCS_DEADLOCK_DETECTED`. Else, block. 123 - `FCS_DEADLOCK_DETECTED`: If the reclamation target is met, change state 124 back to `FCS_DELAYED`. Else, restart from Phase 2. 125 126### Phase 4 127We must now choose a "victim" page to attempt to reclaim. If a candidate page 128has been referenced since deactivation, it will be reactivated (barring 129certain "force-reclaim" conditions). 130 1311. Look for clean or speculative pages (unless we specifically want an 132 anonymous one). 1332. On non-app-swap systems (macOS), look for a "self-donated" page. 1343. Look for a background page. On Intel systems, we heavily bias towards 135 background pages during dark-wake mode to ensure background tasks (e.g. 136 Software Update) do not disrupt the user's normal working set. 1374. Look for 2 anonymous pages for every 1 file-backed page.\* This ratio comes 138 from the days of spinning disks and software compression, where re-faulting a 139 file-backed page was roughly twice as costly as an anonymous one. 1405. If steps 1-4 could not find an unreferenced page, restart from Phase 2. 141 142\* Certain extreme conditions may cause the 2:1 ratio to be ignored: 143 - The file-cache has fallen below its minimum size -> choose anonymous 144 - The number of inactive file-backed pages is less than 50% of all 145 file-backed pages -> choose anonymous 146 - The free page count is dangerously low (compression may require free pages 147 to compress into) -> choose file-backed 148 149### Phase 5 150We have found a victim page, and will now attempt to reclaim it. "Freed" pages 151are placed on a thread-local free queue to be freed to the global free queue 152in batches during Phase 2. 153 1541. Pull the page off of its current queue. 1552. *Try* to take the vm-object lock corresponding to the victim page. Note 156 that this is an inversion of the typical lock ordering (vm-object -> 157 page-queues). As such, `pageout_scan` cannot block if the lock is currently 158 held by another thread. If it cannot take the vm-object lock, then identify 159 another potential victim page via Phase 4 and tell the system that a 160 "privileged" thread wants its vm-object lock (precluding other threads 161 from taking the lock until the privileged thread has had an opportunity 162 to take it), drop the PageQ lock, pause for 10µs, and restart from Phase 2. 1633. Evaluate the page's current state: 164 - `busy`: this page is being transiently operated on by another thread, 165 place it back on its queue and restart from Phase 2. 166 - `free_when_done`/`cleaning`: this page is about to be freed by another 167 thread. Skip it and restart from Phase 2. 168 - `error`/`absent`/`pager==NULL`/`object==NULL`: this page can be freed 169 without any cleaning. Free the page. 170 - `purgeable(empty)`: object has already been purged, free the page. 171 - `purgeable(volatile)`: We'll purge this object wholesale once it is ripe, 172 so compressing it now isn't worth the work. Skip this page and restart 173 from Phase 2. 1744. Check (with the pmap) if the page has been modified or referenced. 1755. If the page has been referenced since we identified it as a victim, consider 176 reactivating it. If we have consecutively re-activated a sufficient number 177 of pages, then reclaim the page anyway to ensure forward progress is made.\* 178 On embedded systems, a sufficient number of these forced reclamations will 179 trigger jetsams. Pages which were first faulted by real-time threads are 180 exempted from these forced reclamations to prevent audio glitches. 1816. Disconnect the page from all page-table and virtual mappings. If it is 182 anonymous, leave a breadcrumb in the page table entry for memory accounting 183 purposes. 1847. If the page is clean, free it. 1858. Otherwise, the page is dirty and needs to be "cleaned" before it can be reclaimed. 186 Place it on the relevant pageout queue (i.e. compressor for anonymous and external 187 for file-backed) and wakeup the relevant VM thread. 1889. Restart from Phase 2. 189 190\* This can happen when the working set turns over rapidly or the system is 191seriously overcommited. In such cases, we can't rely on the LRU approximation 192to identified "good" victims and need to reclaim whatever we can find. 193 194## Historical Experiments 195 196### Latency-based Jetsam 197By placing a "fake" page in the active page queue with an associated timestamp, 198we can track the rate of paging by measuring how long it takes for the page to 199be identified as a victim by `pageout_scan`. A rapid paging rate indicates 200that the system cannot keep up with memory demand via paging alone. In such 201cases, jetsams would be invoked directly by `pageout_scan` to free larger 202amounts of memory and reduce demand. 203 204Experiments with this implementation highlighted that many iterations of 205`pageout_scan` are required before the latency-detection mechanism will 206trigger. The delay imposed by these LPF-characteristics was often larger than 207the existing page-shortage mechanism and regressed use cases like Camera launch. 208Further, performing kills directly on the pageout thread added significant 209latency. 210 211Re-introducing the paging-rate measurement without the jetsam-trigger may be 212worthwhile for diagnosing system health. 213 214### Dynamic Scheduling Priority 215In theory, a misbehaving low-priority thread can generate lots of page demand, 216invoking `pageout_scan` to run at a very high priority (91). Thus, the low-priority 217thread can effectively preempt higher-priority user threads and starve them of 218the core(s) used by the VM thread(s). This can be mitigated by using 219propagating the priority of threads waiting on free pages to `pageout_scan`, 220allowing `pageout_scan` to only run at a priority as high as its highest waiter. 221 222This approach was enabled on low core-count devices (i.e. watches) for 1-2 223years. However, it eventually appeared to contribute to audio glitches and had 224to be disabled. 225 226In general, *any* page-wait (even short ones) can be catastrophic for latency 227sensitive/real-time threads, especially if those threads will also have to 228wait for an I/O to complete after the page-wait. By slowing the preemptive 229paging done without any waiters (at `pageout_scan`'s now low base priority), 230the likelihood of page-waits increases. 231 232