xref: /xnu-11215.41.3/doc/lifecycle/hibernation.md (revision 33de042d024d46de5ff4e89f2471de6608e37fa4)
1XNU hibernation
2===============
3
4Suspending the entire system state to RAM.
5
6Goal
7----
8
9This document discusses the design and implementation of XNU hibernation. The
10reader is assumed to generally understand how standard suspend to RAM (S2R)
11works in XNU; a detailed discussion of S2R is beyond the scope of this
12discussion.
13
14Vocabulary
15----------
16
17* Polled I/O : a mode of operation supported by I/O drivers (primarily storage
18               devices) where operations may be conducted from a single-threaded
19               context with interrupts disabled
20* S2R        : Suspend to RAM (aka sleep)
21* WKdm       : Wilson-Kaplan direct mapped compression
22
23Background
24----------
25
26In order to prolong battery life, XNU supports suspending/powering off various
27devices and preserving the state of those devices in memory. This feature is
28referred to as suspend to RAM (S2R). In this mode, IOKit delivers a number of
29notifications to IOServices to allow them to participate in S2R.
30
31What is hibernation?
32--------------------
33
34Hibernation is a feature built on the foundation of S2R. However, while S2R
35preserves state in memory (which must therefore remain powered), hibernation
36preserves contents to persistent storage (the disk) and then completely powers
37the system off.
38
39Hibernation entry
40-----------------
41
42During hibernation, XNU invokes essentially the normal S2R machinery, but with a
43few hibernation-specific differences:
44
45* PMRootDomain calls `IOHibernateSystemSleep()` before system sleep (devices
46  awake, normal execution context).
47* `IOHibernateSystemSleep()` opens the hibernation file (or partition) at the
48  BSD level, grabs its extents and searches for a polling driver willing to work
49  with that IOMedia.
50* The BSD code makes an ioctl to the storage driver to get the partition base
51  offset to the disk, and other ioctls to get the transfer constraints.
52* If successful, the file is written to make sure it's initially not bootable
53  (in case of later failure) and the `boot-image` nvram variable is set to point
54  to the first block of the file. (has to be done here because writing to nvram
55  may block, so we have to do this before preemption is disabled).
56* `hibernate_page_list_allocate()` is called to allocate page bitmaps for all
57  DRAM.
58  - The hibernation code represents every page of physical memory in page
59    bitmaps of type `hibernate_bitmap_t`. There is one page bitmap per range of
60    memory, with a bit to represent each page in that range; these page bitmaps
61    are in turn stored in a `hibernate_page_list_t`. The page bitmaps are used
62    to represent, for each page, whether preservation of that page is necessary.
63  - On ARM64, `secure_hmac_get_io_ranges()` is called to get a list of the I/O
64    regions that need to be included in the hibernation image (for example, the
65    GPU UAT handoff region). These I/O regions are typically DRAM regions carved
66    out by iBoot that exist outside of kernel-managed memory. A page bitmap is
67    allocated for each one of these ranges (as well as a single bitmap for the
68    kernel-managed DRAM memory).
69* `hibernate_processor_setup()` is called to set up some platform-specific state
70  needed in the hibernation image header. `hibernate_processor_setup()` also
71  sets a flag in the boot processor's `cpu_data_t` to indicate that hibernation
72  is in progress on this CPU.
73* At this point, `gIOHibernateState` is set to the value
74  `kIOHibernateStateHibernating`.
75* Regular sleep progresses; some drivers may inspect the root domain property
76  `kIOHibernateStateKey` to modify behavior. The platform drivers save state to
77  memory as usual, but any drivers required for hibernation I/O are left in a
78  state such that polled I/Os can be issued.
79* By the time regular sleep has completed, all CPUs but the boot CPU have been
80  halted, and we are running on the boot CPU's idle thread in the shutdown
81  context, with preemption disabled.
82* Eventually the platform calls `hibernate_write_image()` in the shutdown
83  context on the last cpu, at which point memory is ready to be saved. This call
84  is made from `acpi_hibernate()` on Intel and from `ml_arm_sleep()` on ARM64.
85* `hibernate_write_image()` runs in the shutdown context, where no blocking is
86  permitted because preemption is disabled. `hibernate_write_image()` calls
87  `hibernate_page_list_setall()` to get the page bitmaps of DRAM that need to be
88  saved.
89* All pages are assumed to be saved (as part of the wired image) unless
90  explicitly subtracted by `hibernate_page_list_setall()`.
91  `hibernate_page_list_setall()` calls `hibernate_page_list_setall_machine()` to
92  make platform-specific amendments to the page bitmaps.
93* `hibernate_write_image()` writes the image header and extents list. The header
94  includes the second file extent so that only the header block is needed to
95  read the file, regardless of the underlying filesystem.
96  - The extents list describes the file's layout on disk. This block list makes
97    it possible for the platform booter to read the hibernation file from disk
98    without having to understand the underlying filesystem.
99* Some sections of memory are written directly (and uncompressed) to the image.
100  These are the portions of XNU itself that are required during hibernation
101  resume, as well as some other data that is required by the platform booter.
102  - On Intel, the `__HIB` segment is written to the hibernation image.
103  - On ARM64, because of ctrr/ktrr, a single `__HIB` segment isn't possible.
104    Instead, a number of sections of the kernel are written:
105    `__TEXT_EXEC,__hib_text`, `__DATA,__hib_data`, and
106    `__DATA_CONST,__hib_const`. The `__PPL` segment is also stored to the image
107    so that the PPL hmac driver can be used during hibernation resume. Certain
108    other pieces of memory must also be written unmodified to the hibernation
109    image for use by iBoot. Those pieces are described in the device tree so
110    that XNU doesn't need to know the details.
111    `secure_hmac_fetch_hibseg_and_info()` is used to determine the set of memory
112    regions to be stored in this phase. This routine also calculates an HMAC
113    that can be used by the booter to validate this content.
114* The portions of XNU (code and data) that are stored directly to the
115  hibernation image should be entirely self-contained; these are the only
116  portions of XNU that are available during resume to decompress the image.
117* Some additional pages are removed from the page bitmaps; these include various
118  temporary buffers used for hibernation.
119* The page bitmaps are written to the image.
120* More areas are removed from the page bitmaps (after they have been written to
121  the image); these include the pages already stored directly to the image, as
122  well as the stack that hibernation resume will run on.
123  `hibernate_page_list_set_volatile()` is invoked to make platform-specific
124  amendments to the page bitmaps.
125* Each wired page is compressed and written and then each non-wired page.
126  Compression and disk writes can occur in parallel if the polled mode I/O
127  driver supports this.
128  - On ARM64, `secure_hmac_update_and_compress_page()` is called for each page
129    included in the image so that the PPL can compute an HMAC of the hibernation
130    payload.
131* The image header records the values of `mach_absolute_time()` and
132  `mach_continuous_time()` close to the end of `hibernate_write_image()`. These
133  values can be used to fix up the offets applied to the hardware clock after
134  hibernation exit.
135* The image header is finalized.
136  - On ARM64, `secure_hmac_final()` is called to compute the HMAC of the
137    hibernation payload. There are actually two separate HMACs computed, one for
138    the wired pages and one for the non-wired pages. These HMACs are stored in
139    the image header.
140  - On ARM64, `secure_hmac_fetch_rorgn_sha()` and `secure_hmac_fetch_rorgn_hmac()` are
141    called to obtain the SHA256 and HMAC of the read-only region. They were
142    calculated on cold boot. They are stored in the image header.
143    This is described in more detail in the "Security details" section of this
144    document.
145  - On ARM64, `secure_hmac_finalize_image()` is called to compute the HMAC of the
146    header of the image. This is described in more detail in the "Security
147    details" section of this document.
148* The image header is written to the start of the file and the polling driver
149  closed.
150* The machine powers down.
151  - On Intel, depending on power settings, the system could sleep instead at
152    this point. This allows for "safe sleep" where RAM remains powered until the
153    user wakes the system or the battery dies.
154  - On ARM64, we do not support this mode because hibernation is intended to
155    only be invoked on a critical battery event.
156
157Hibernation exit
158----------------
159
160* The platform booter sees the `boot-image` nvram variable containing the device
161  and block number of the image, reads the header, and if the signature is
162  correct proceeds. The `boot-image` variable is cleared.
163  - On ARM64, iBoot takes the read-only region SHA256 value from the image
164    header and calculates an HMAC. It then compares the HMAC against the
165    value stored in the image header. If they do not match, iBoot panics.
166* The platform booter reads the portion of the image used for wired pages, to
167  memory. Its assumed this will fit in memory in its entirety. The image is
168  decrypted (either transparently by ANS or in software, depending on platform
169  support). The platform booter is not expected to decompress any of the
170  payload; that is the kernel's responsibility.
171* The platform booter copies the portions of XNU that were previously saved to
172  the image back to their original physical addresses in memory.
173* The platform booter invokes `hibernate_machine_entrypoint()`, passing in the
174  location of the image in memory. Translation is off. Only code and data that
175  was mapped by the booter is safe to call, since all the other wired pages are
176  still compressed in the image.
177  - On Intel, `hibernate_machine_entrypoint()` sets up a simple temporary page
178    table; this page table will later be modified as necessary while pages are
179    being restored.
180  - On ARM64, `hibernate_machine_entrypoint()` sets up a temporary page table
181    such that all of the required XNU code pages are executable, all data pages
182    are readable/writable as necessary, and all of the rest of memory is mapped
183    such that it can be written to during restore. Some device registers also
184    have to be mapped to support serial logging and using the hmac block.
185* Any pages occupied by the raw image are removed from from the page bitmaps.
186  - On Intel, this is done in `hibernate_kernel_entrypoint()`.
187  - On ARM64, we have to do this from `hibernate_machine_entrypoint()` because
188    we borrow free pages (as indicated by the page bitmaps) to store the
189    temporary page table.
190* `hibernate_machine_entrypoint()` calls `hibernate_kernel_entrypoint()`.
191* `hibernate_kernel_entrypoint()` uses the page bitmaps to determine which pages
192  can be uncompressed from the wired image directly to their final location. Any
193  pages that conflict with the image itself are copied to interim scratch space.
194* After all of the image has been parsed, the pages that were temporarily copied
195  to scratch are uncompressed to their final location, overwriting pages in the
196  wired image.
197  - `hibernate_restore_phys_page()` is used to actually copy pages to their
198    final location.
199* At this point, `gIOHibernateState` is set to
200  `kIOHibernateStateWakingFromHibernate`.
201* `pal_hib_patchup()` is called to perform platform-specific post-resume fixups
202  - On Intel, `pal_hib_patchup()` is a no-op.
203  - On ARM64, `pal_hib_patchup()` is responsible for validating the HMAC of the
204    wired pages. `pal_hib_patchup()` also fixes up other state (such as some
205    PPL-related context).
206* After all of the wired pages have been restored, a wake from sleep is
207  simulated.
208  - On Intel, `hibernate_kernel_entrypoint()` calls `acpi_wake_prot_entry()`.
209  - On ARM64, `hibernate_kernel_entrypoint()` returns to
210    `hibernate_machine_entrypoint()`, which then jumps to `reset_vector`.
211* The kernel proceeds on essentially a normal S2R wake, with some
212  hibernation-specific changes.
213  - On ARM64, an important difference is that a normal S2R wake on some
214    platforms will run through the reconfig engine, whereas a hibernate wake
215    cannot invoke the reconfig engine and must emulate some of the reconfig
216    sequence on the AP.
217  - On ARM64, some further fixup is done in `arm_init_cpu()`.
218     + `wake_abstime` needs to be restored to the last absolute time captured
219       during hibernation entry. This is necessary because during normal S2R,
220       `wake_abstime` is captured too early; later calls to
221       `mach_absolute_time()` in the hibernation entry path cause the
222       `s_last_absolute_time` test to fail if we don't do this fixup.
223     + `hwclock_conttime_offset` is set to the `hwClockOffset` value that iBoot
224       computed. This is necessary since `ml_get_hwclock()` does not tick across
225       hibernation but `mach_continuous_time()` is expected to.
226     + The boot CPU's idle thread preemption_count also has to be fixed up. This
227       is necessary because the page containing preemption_count is captured
228       when the count is set to 1 (since the page is captured from within the
229       PPL).
230* After the platform CPU init code is called, `hibernate_machine_init()` is
231  called to restore the rest of memory, using the polled mode driver, before
232  other threads can run or any devices are turned on. This split of wired vs.
233  non-wired pages reduces the memory usage for the platform booter, and allows
234  decompression in parallel with disk reads for the non-wired pages.
235* The polling driver is closed down and regular wake proceeds.
236* When the kernel calls IOKit to wake (normal execution context)
237  `hibernate_teardown()` is called to release any memory.
238* The hibernation file is closed via BSD.
239
240Hibernation file management
241---------------------------
242
243powerd in userspace is responsible for managing the lifecycle of the hibernation
244file. The details of this lifecycle are beyond the scope of this document, but
245essentially, it gets created and its space is preallocated by powerd the first
246time the system hibernates. powerd can also grow the file as necessary.
247
248Security details
249----------------
250
251### Intel:
252
253* The hibernation image is encrypted with a key obtained from the APFS
254  `APFSMEDIA_GETHIBERKEY` platform function.
255
256### ARM64:
257
258* The hibernation image is encrypted with a key obtained from the SEP. The
259  details for how this key is derived and used are beyond the scope of this
260  document, but are documented in detail in the AppleSEPOS project
261  (doc/SecureHibernation).
262* Various portions of the hibernation image have HMACs calculated over them. All
263  HMACs are calculated by the PPL. The exact scheme for computing these HMACs is
264  documented in more detail in ppl_hib.c, but the HMACs that are calculated are:
265  - `imageHeaderHMAC` is an HMAC of the header of the image, up to
266    `imageHeaderHMACSize`. However, because of the order that data is written
267    (the header is the last thing actually written), the HMAC is actually
268    calculated as `HMAC(SHA([data after header up to imageHeaderHMACSize],
269    [header]))`.
270  - `handoffHMAC` is an HMAC of the `IOHibernateHandoff` data passed from iBoot
271    to XNU
272  - `image1PagesHMAC` is an HMAC of the wired pages that were stored to the
273    hibernation image
274  - `image2PagesHMAC` is an HMAC of the non-wired pages that were stored to the
275    hibernation image
276* The PPL hibernation driver also keeps track of every PPL-owned page being
277  hashed (both kernel-managed memory and I/O memory owned by the PPL). This will
278  be double-checked in `secure_hmac_finalize_image()` to ensure that all PPL-owned
279  memory is included in the hibernation image. Any missing pages will panic the
280  system as the absence of PPL pages in the image could be a security risk (and
281  surely a bug).
282* During early boot, `secure_hmac_compute_rorgn_hmac()` is used to measure the
283  entirety of the rorgn. On hibernation resume, the same function is invoked to
284  verify that the rorgn matches its original contents.
285  - Only the SHA256 of the rorgn is compared on resume. The SIO HMAC key1, used
286    to compute this HMAC, is invalidated by iBoot on the resume path after it
287    verifies the HMAC. See rdar://75750348 (xnu should store the SHA of the
288    read-only region along with the hash in memory for iBoot to validate on
289    hibernate resume).
290