1XNU hibernation 2=============== 3 4Suspending the entire system state to RAM. 5 6Goal 7---- 8 9This document discusses the design and implementation of XNU hibernation. The 10reader is assumed to generally understand how standard suspend to RAM (S2R) 11works in XNU; a detailed discussion of S2R is beyond the scope of this 12discussion. 13 14Vocabulary 15---------- 16 17* Polled I/O : a mode of operation supported by I/O drivers (primarily storage 18 devices) where operations may be conducted from a single-threaded 19 context with interrupts disabled 20* S2R : Suspend to RAM (aka sleep) 21* WKdm : Wilson-Kaplan direct mapped compression 22 23Background 24---------- 25 26In order to prolong battery life, XNU supports suspending/powering off various 27devices and preserving the state of those devices in memory. This feature is 28referred to as suspend to RAM (S2R). In this mode, IOKit delivers a number of 29notifications to IOServices to allow them to participate in S2R. 30 31What is hibernation? 32-------------------- 33 34Hibernation is a feature built on the foundation of S2R. However, while S2R 35preserves state in memory (which must therefore remain powered), hibernation 36preserves contents to persistent storage (the disk) and then completely powers 37the system off. 38 39Hibernation entry 40----------------- 41 42During hibernation, XNU invokes essentially the normal S2R machinery, but with a 43few hibernation-specific differences: 44 45* PMRootDomain calls `IOHibernateSystemSleep()` before system sleep (devices 46 awake, normal execution context). 47* `IOHibernateSystemSleep()` opens the hibernation file (or partition) at the 48 BSD level, grabs its extents and searches for a polling driver willing to work 49 with that IOMedia. 50* The BSD code makes an ioctl to the storage driver to get the partition base 51 offset to the disk, and other ioctls to get the transfer constraints. 52* If successful, the file is written to make sure it's initially not bootable 53 (in case of later failure) and the `boot-image` nvram variable is set to point 54 to the first block of the file. (has to be done here because writing to nvram 55 may block, so we have to do this before preemption is disabled). 56* `hibernate_page_list_allocate()` is called to allocate page bitmaps for all 57 DRAM. 58 - The hibernation code represents every page of physical memory in page 59 bitmaps of type `hibernate_bitmap_t`. There is one page bitmap per range of 60 memory, with a bit to represent each page in that range; these page bitmaps 61 are in turn stored in a `hibernate_page_list_t`. The page bitmaps are used 62 to represent, for each page, whether preservation of that page is necessary. 63 - On ARM64, `secure_hmac_get_io_ranges()` is called to get a list of the I/O 64 regions that need to be included in the hibernation image (for example, the 65 GPU UAT handoff region). These I/O regions are typically DRAM regions carved 66 out by iBoot that exist outside of kernel-managed memory. A page bitmap is 67 allocated for each one of these ranges (as well as a single bitmap for the 68 kernel-managed DRAM memory). 69* `hibernate_processor_setup()` is called to set up some platform-specific state 70 needed in the hibernation image header. `hibernate_processor_setup()` also 71 sets a flag in the boot processor's `cpu_data_t` to indicate that hibernation 72 is in progress on this CPU. 73* At this point, `gIOHibernateState` is set to the value 74 `kIOHibernateStateHibernating`. 75* Regular sleep progresses; some drivers may inspect the root domain property 76 `kIOHibernateStateKey` to modify behavior. The platform drivers save state to 77 memory as usual, but any drivers required for hibernation I/O are left in a 78 state such that polled I/Os can be issued. 79* By the time regular sleep has completed, all CPUs but the boot CPU have been 80 halted, and we are running on the boot CPU's idle thread in the shutdown 81 context, with preemption disabled. 82* Eventually the platform calls `hibernate_write_image()` in the shutdown 83 context on the last cpu, at which point memory is ready to be saved. This call 84 is made from `acpi_hibernate()` on Intel and from `ml_arm_sleep()` on ARM64. 85* `hibernate_write_image()` runs in the shutdown context, where no blocking is 86 permitted because preemption is disabled. `hibernate_write_image()` calls 87 `hibernate_page_list_setall()` to get the page bitmaps of DRAM that need to be 88 saved. 89* All pages are assumed to be saved (as part of the wired image) unless 90 explicitly subtracted by `hibernate_page_list_setall()`. 91 `hibernate_page_list_setall()` calls `hibernate_page_list_setall_machine()` to 92 make platform-specific amendments to the page bitmaps. 93* `hibernate_write_image()` writes the image header and extents list. The header 94 includes the second file extent so that only the header block is needed to 95 read the file, regardless of the underlying filesystem. 96 - The extents list describes the file's layout on disk. This block list makes 97 it possible for the platform booter to read the hibernation file from disk 98 without having to understand the underlying filesystem. 99* Some sections of memory are written directly (and uncompressed) to the image. 100 These are the portions of XNU itself that are required during hibernation 101 resume, as well as some other data that is required by the platform booter. 102 - On Intel, the `__HIB` segment is written to the hibernation image. 103 - On ARM64, because of ctrr/ktrr, a single `__HIB` segment isn't possible. 104 Instead, a number of sections of the kernel are written: 105 `__TEXT_EXEC,__hib_text`, `__DATA,__hib_data`, and 106 `__DATA_CONST,__hib_const`. The `__PPL` segment is also stored to the image 107 so that the PPL hmac driver can be used during hibernation resume. Certain 108 other pieces of memory must also be written unmodified to the hibernation 109 image for use by iBoot. Those pieces are described in the device tree so 110 that XNU doesn't need to know the details. 111 `secure_hmac_fetch_hibseg_and_info()` is used to determine the set of memory 112 regions to be stored in this phase. This routine also calculates an HMAC 113 that can be used by the booter to validate this content. 114* The portions of XNU (code and data) that are stored directly to the 115 hibernation image should be entirely self-contained; these are the only 116 portions of XNU that are available during resume to decompress the image. 117* Some additional pages are removed from the page bitmaps; these include various 118 temporary buffers used for hibernation. 119* The page bitmaps are written to the image. 120* More areas are removed from the page bitmaps (after they have been written to 121 the image); these include the pages already stored directly to the image, as 122 well as the stack that hibernation resume will run on. 123 `hibernate_page_list_set_volatile()` is invoked to make platform-specific 124 amendments to the page bitmaps. 125* Each wired page is compressed and written and then each non-wired page. 126 Compression and disk writes can occur in parallel if the polled mode I/O 127 driver supports this. 128 - On ARM64, `secure_hmac_update_and_compress_page()` is called for each page 129 included in the image so that the PPL can compute an HMAC of the hibernation 130 payload. 131* The image header records the values of `mach_absolute_time()` and 132 `mach_continuous_time()` close to the end of `hibernate_write_image()`. These 133 values can be used to fix up the offets applied to the hardware clock after 134 hibernation exit. 135* The image header is finalized. 136 - On ARM64, `secure_hmac_final()` is called to compute the HMAC of the 137 hibernation payload. There are actually two separate HMACs computed, one for 138 the wired pages and one for the non-wired pages. These HMACs are stored in 139 the image header. 140 - On ARM64, `secure_hmac_fetch_rorgn_sha()` and `secure_hmac_fetch_rorgn_hmac()` are 141 called to obtain the SHA256 and HMAC of the read-only region. They were 142 calculated on cold boot. They are stored in the image header. 143 This is described in more detail in the "Security details" section of this 144 document. 145 - On ARM64, `secure_hmac_finalize_image()` is called to compute the HMAC of the 146 header of the image. This is described in more detail in the "Security 147 details" section of this document. 148* The image header is written to the start of the file and the polling driver 149 closed. 150* The machine powers down. 151 - On Intel, depending on power settings, the system could sleep instead at 152 this point. This allows for "safe sleep" where RAM remains powered until the 153 user wakes the system or the battery dies. 154 - On ARM64, we do not support this mode because hibernation is intended to 155 only be invoked on a critical battery event. 156 157Hibernation exit 158---------------- 159 160* The platform booter sees the `boot-image` nvram variable containing the device 161 and block number of the image, reads the header, and if the signature is 162 correct proceeds. The `boot-image` variable is cleared. 163 - On ARM64, iBoot takes the read-only region SHA256 value from the image 164 header and calculates an HMAC. It then compares the HMAC against the 165 value stored in the image header. If they do not match, iBoot panics. 166* The platform booter reads the portion of the image used for wired pages, to 167 memory. Its assumed this will fit in memory in its entirety. The image is 168 decrypted (either transparently by ANS or in software, depending on platform 169 support). The platform booter is not expected to decompress any of the 170 payload; that is the kernel's responsibility. 171* The platform booter copies the portions of XNU that were previously saved to 172 the image back to their original physical addresses in memory. 173* The platform booter invokes `hibernate_machine_entrypoint()`, passing in the 174 location of the image in memory. Translation is off. Only code and data that 175 was mapped by the booter is safe to call, since all the other wired pages are 176 still compressed in the image. 177 - On Intel, `hibernate_machine_entrypoint()` sets up a simple temporary page 178 table; this page table will later be modified as necessary while pages are 179 being restored. 180 - On ARM64, `hibernate_machine_entrypoint()` sets up a temporary page table 181 such that all of the required XNU code pages are executable, all data pages 182 are readable/writable as necessary, and all of the rest of memory is mapped 183 such that it can be written to during restore. Some device registers also 184 have to be mapped to support serial logging and using the hmac block. 185* Any pages occupied by the raw image are removed from from the page bitmaps. 186 - On Intel, this is done in `hibernate_kernel_entrypoint()`. 187 - On ARM64, we have to do this from `hibernate_machine_entrypoint()` because 188 we borrow free pages (as indicated by the page bitmaps) to store the 189 temporary page table. 190* `hibernate_machine_entrypoint()` calls `hibernate_kernel_entrypoint()`. 191* `hibernate_kernel_entrypoint()` uses the page bitmaps to determine which pages 192 can be uncompressed from the wired image directly to their final location. Any 193 pages that conflict with the image itself are copied to interim scratch space. 194* After all of the image has been parsed, the pages that were temporarily copied 195 to scratch are uncompressed to their final location, overwriting pages in the 196 wired image. 197 - `hibernate_restore_phys_page()` is used to actually copy pages to their 198 final location. 199* At this point, `gIOHibernateState` is set to 200 `kIOHibernateStateWakingFromHibernate`. 201* `pal_hib_patchup()` is called to perform platform-specific post-resume fixups 202 - On Intel, `pal_hib_patchup()` is a no-op. 203 - On ARM64, `pal_hib_patchup()` is responsible for validating the HMAC of the 204 wired pages. `pal_hib_patchup()` also fixes up other state (such as some 205 PPL-related context). 206* After all of the wired pages have been restored, a wake from sleep is 207 simulated. 208 - On Intel, `hibernate_kernel_entrypoint()` calls `acpi_wake_prot_entry()`. 209 - On ARM64, `hibernate_kernel_entrypoint()` returns to 210 `hibernate_machine_entrypoint()`, which then jumps to `reset_vector`. 211* The kernel proceeds on essentially a normal S2R wake, with some 212 hibernation-specific changes. 213 - On ARM64, an important difference is that a normal S2R wake on some 214 platforms will run through the reconfig engine, whereas a hibernate wake 215 cannot invoke the reconfig engine and must emulate some of the reconfig 216 sequence on the AP. 217 - On ARM64, some further fixup is done in `arm_init_cpu()`. 218 + `wake_abstime` needs to be restored to the last absolute time captured 219 during hibernation entry. This is necessary because during normal S2R, 220 `wake_abstime` is captured too early; later calls to 221 `mach_absolute_time()` in the hibernation entry path cause the 222 `s_last_absolute_time` test to fail if we don't do this fixup. 223 + `hwclock_conttime_offset` is set to the `hwClockOffset` value that iBoot 224 computed. This is necessary since `ml_get_hwclock()` does not tick across 225 hibernation but `mach_continuous_time()` is expected to. 226 + The boot CPU's idle thread preemption_count also has to be fixed up. This 227 is necessary because the page containing preemption_count is captured 228 when the count is set to 1 (since the page is captured from within the 229 PPL). 230* After the platform CPU init code is called, `hibernate_machine_init()` is 231 called to restore the rest of memory, using the polled mode driver, before 232 other threads can run or any devices are turned on. This split of wired vs. 233 non-wired pages reduces the memory usage for the platform booter, and allows 234 decompression in parallel with disk reads for the non-wired pages. 235* The polling driver is closed down and regular wake proceeds. 236* When the kernel calls IOKit to wake (normal execution context) 237 `hibernate_teardown()` is called to release any memory. 238* The hibernation file is closed via BSD. 239 240Hibernation file management 241--------------------------- 242 243powerd in userspace is responsible for managing the lifecycle of the hibernation 244file. The details of this lifecycle are beyond the scope of this document, but 245essentially, it gets created and its space is preallocated by powerd the first 246time the system hibernates. powerd can also grow the file as necessary. 247 248Security details 249---------------- 250 251### Intel: 252 253* The hibernation image is encrypted with a key obtained from the APFS 254 `APFSMEDIA_GETHIBERKEY` platform function. 255 256### ARM64: 257 258* The hibernation image is encrypted with a key obtained from the SEP. The 259 details for how this key is derived and used are beyond the scope of this 260 document, but are documented in detail in the AppleSEPOS project 261 (doc/SecureHibernation). 262* Various portions of the hibernation image have HMACs calculated over them. All 263 HMACs are calculated by the PPL. The exact scheme for computing these HMACs is 264 documented in more detail in ppl_hib.c, but the HMACs that are calculated are: 265 - `imageHeaderHMAC` is an HMAC of the header of the image, up to 266 `imageHeaderHMACSize`. However, because of the order that data is written 267 (the header is the last thing actually written), the HMAC is actually 268 calculated as `HMAC(SHA([data after header up to imageHeaderHMACSize], 269 [header]))`. 270 - `handoffHMAC` is an HMAC of the `IOHibernateHandoff` data passed from iBoot 271 to XNU 272 - `image1PagesHMAC` is an HMAC of the wired pages that were stored to the 273 hibernation image 274 - `image2PagesHMAC` is an HMAC of the non-wired pages that were stored to the 275 hibernation image 276* The PPL hibernation driver also keeps track of every PPL-owned page being 277 hashed (both kernel-managed memory and I/O memory owned by the PPL). This will 278 be double-checked in `secure_hmac_finalize_image()` to ensure that all PPL-owned 279 memory is included in the hibernation image. Any missing pages will panic the 280 system as the absence of PPL pages in the image could be a security risk (and 281 surely a bug). 282* During early boot, `secure_hmac_compute_rorgn_hmac()` is used to measure the 283 entirety of the rorgn. On hibernation resume, the same function is invoked to 284 verify that the rorgn matches its original contents. 285 - Only the SHA256 of the rorgn is compared on resume. The SIO HMAC key1, used 286 to compute this HMAC, is invalidated by iBoot on the resume path after it 287 verifies the HMAC. See rdar://75750348 (xnu should store the SHA of the 288 read-only region along with the hash in memory for iBoot to validate on 289 hibernate resume). 290