xref: /xnu-12377.41.6/doc/arm/sme.md (revision bbb1b6f9e71b8cdde6e5cd6f4841f207dee3d828)
1ARM Scalable Matrix Extension
2=============================
3
4Managing hardware resources related to SME state.
5
6Introduction
7------------
8
9This document describes how xnu manages the hardware resources associated with
10ARM's Scalable Matrix Extension (SME).
11
12SME is an ARMv9 extension intended to accelerate matrix math operations.  SME
13builds on top of ARM's previous Scalable Vector Extension (SVE), which extends
14the length of the FPSIMD register files and adds new 1D vector-math
15instructions.  SME extends SVE by adding a matrix register file and associated
162D matrix-math instructions.  SME2 further extends SME with additional
17instructions and register state.
18
19This document summarizes SVE, SME, and SME2 hardware features that are relevant
20to xnu.  It is not intended as a full programming guide for SVE or SME: readers
21may find a full description of these ISAs in the
22[SVE supplement to the ARM ARM](https://developer.arm.com/documentation/ddi0584/latest/)
23and [SME supplement to the ARM ARM](https://developer.arm.com/documentation/ddi0616/latest/),
24respectively.
25
26
27
28Hardware overview
29-----------------
30
31### EL0-accessible state
32
33SVE, SME, and SME2 introduce four new EL0-accessible register
34files<sup>[1](#feat_sve_footnote)</sup>:
35
36- vector registers `Z0`-`Z31`
37- predicate registers `P0`-`P15`
38- matrix data `ZA` (SME/SME2 only)
39- look-up table `ZT0` (SME2 only)
40
41These register files are unbanked, i.e., their contents are shared across all
42exception levels.  Data can be copied between these registers and system memory
43using specialized `ldr` and `str` variants.  SME also adds `mov` variants that
44can directly copy data between the vector and matrix register files.
45
46Most of these register files supplement, rather than replace, the existing ARM
47register files.  However the `Z` register file effectively extends the length of
48the existing FPSIMD `V` register file.  Instructions targeting the `V` register
49file will now access the lower 128 bits of the corresponding `Z` register.
50
51The size of most of these files is defined by the *streaming vector length*
52(SVL), a power-of-two between 128 and 2048 inclusive.  Each `Z` register is SVL
53bits in size; each `P` register is SVL / 8 bits in size; and `ZA` is SVL x SVL
54bits in size.  The value of SVL is determined by both hardware and software.
55Hardware places an implementation-defined cap on SVL, and privileged software
56can further reduce SVL for itself and lower exception levels.
57
58In contrast, `ZT0` is fixed at 512 bits, independent of SVL.
59
60SME also adds a single EL0-accessible system register `TPIDR2_EL0`.  Like
61`TPIDR_EL0`, `TPIDR2_EL0` is officially reserved for ABI use, but its contents
62have no particular meaning to hardware.
63
64### `PSTATE` changes
65
66SME adds two orthogonal states to `PSTATE`.
67
68`PSTATE.SM` moves the CPU in and out of a special execution mode called
69*streaming SVE mode*.  Software must enter streaming SVE mode to execute most
70SME instructions.  However software must then exit streaming SVE mode to execute
71many legacy SIMD instructions<sup>[2](#feat_sme_fa64_footnote)</sup>.  To make
72things even more complicated, these transitions cause the CPU to zero out the
73`V`/`Z` and `P` register files, and to set all `FPSR` flags.  When software
74needs to retain this state across `PSTATE.SM` transitions, it must manually
75stash the state in memory.
76
77`PSTATE.ZA` independently controls whether the contents of `ZA` and `ZT0` are
78valid.  Setting `PSTATE.ZA` zeroes out both register files, and enables
79instructions that access them.  Clearing `PSTATE.ZA` causes `ZA` and `ZT0`
80accesses to trap.
81
82Most SME instructions require both `PSTATE.SM` and `PSTATE.ZA` to be
83set, so software usually toggles both bits at the same time.  However setting
84these bits independently can be useful when software needs to interleave SME and
85FPSIMD instructions.  If software needs to temporarily exit streaming SVE mode
86to execute FPSIMD instructions, setting `PSTATE.{SM,ZA} = {0,1}` will do so
87without clobbering the `ZA` or `ZT0` array.
88
89`PSTATE.{SM,ZA} = {0,0}` acts as a hint to the CPU that it may power down
90SME-related hardware.  Hence software should clear these bits as soon as
91SME state can be discarded.
92
93These `PSTATE` bits are accessible to software in several ways:
94
95- Reads or writes to the `SVCR` system register, which packs both bits into
96  a single register
97- Writes to the `SVCRSM`, `SVCRZA`, or `SVCRSMZA` system registers with the
98  immediate values `0` or `1`, which directly modify the specified bit(s)
99- `sm{start,stop} (sm|za)` pseudo-instructions, which are assembler aliases for
100  the above `msr` instructions
101
102Regardless of which method is used to access these bits, software generally does
103not need explicit barriers.  Specifically, ARM guarantees that all direct and
104indirect reads from these bits will appear in program order relative to any
105direct writes.
106
107### Other hardware resources
108
109An implementation may share SME compute resources across multiple CPUs.  In this
110case, the per-CPU `SMPRI_EL1` controls the relative priority of the SME
111instructions issued by that CPU.  ARM guarantees that higher `SMPRI_EL1` values
112indicate higher priorities, and that setting `SMPRI_EL1 = 0` on all CPUs is a
113safe way to disable SME prioritization.  Otherwise the exact meaning of
114`SMPRI_EL1` is implementation-defined.
115
116EL2 may trap guest reads and writes to `SMPRI_EL1` using the fine-grained trap
117controls `HFGRTR_EL2.nSMPRI_EL1` and `HFGWTR_EL2.nSMPRI_EL1`, respectively.
118Alternatively, EL2 may adjust the effective SME priority at EL0 and EL1 without
119trapping, by populating the lookup table register `SMPRIMAP_EL2` and setting the
120control bit `HCRX_EL2.SMPME`.  When `HCRX_EL2.SMPME` is set, SME instructions
121executed at EL0 and EL1 will interpret `SMPRI_EL1` as an index into
122`SMPRIMAP_EL2` rather than as a raw priority value.
123
124`SMIDR_EL1` advertises hardware properties about the SME implementation,
125including whether SME execution priority is implemented.
126
127`CPACR_EL1` and `CPTR_ELx` have controls that can trap SVE and SME operations.
128Two of these are relevant to Apple's SME
129implementation<sup>[3](#cpacr_zen_footnote)</sup>:
130
131- `SMEN`: trap SME instructions and register accesses, including SVE
132  instructions executed during streaming SVE mode.
133- `FPEN`: trap FPSIMD, SME, and SVE instructions and most register accesses, but
134  *not* `SVCR` accesses.  Lower priority than `SMEN`.
135
136Several SME registers aren't affected by these controls, since they have their
137own trapping mechanisms.  `SMPRI_EL1` has fine-grained hypervisor trap controls
138as described above.  `SMIDR_EL1` accesses can trap to the hypervisor using the
139existing `HCR_EL2.TID1` control bit.  Finally `TPIDR2_EL0` has a dedicated
140control bit `SCTLR_ELx.EnTP2` along with fine-grained trap controls
141`HFG{R,W}TR_EL2.TPIDR2_EL0`.
142
143
144Software usage
145--------------
146
147### SME `PSTATE` management
148
149xnu has in-kernel SIMD instructions<sup>[4](#xnu_simd_footnote)</sup> which
150become illegal while the CPU is in streaming SVE mode.  This poses a problem if
151xnu interrupts EL0 while it is in the middle of executing SME-accelerated code.
152
153Hence, anytime xnu enters the kernel with `PSTATE.SM` set, it saves the current
154`Z`, `P`, and `SVCR` values and then clears `PSTATE.SM`.  xnu later restores
155these values during kernel exit.  These operations occur in an assembly-only
156module (`locore.s`) where we have strict control over code generation, and can
157guarantee that no problematic SIMD instructions are executed while `PSTATE.SM`
158is set.
159
160Since the kernel may interrupt *itself*, kernel code is forbidden from entering
161streaming SVE mode.  This policy means that xnu does not need to preserve
162`TPIDR2_EL0`, `ZA`, or `ZT0` during kernel entry and exit, since there are no
163in-kernel SME operations that could clobber them.
164
165### Context switching
166
167xnu saves and restores `TPIDR2_EL0`, `ZA`, and `ZT0` inside the ARM64
168implementation of `machine_switch_context()`, specifically as the routines
169`machine_{save,restore}_sme_context()` in `osfmk/arm64/pcb.c`.  These in turn
170build on lower-level routines to save and load SME register state, located in
171`osfmk/arm64/sme.c`.  The low-level routines are built on top of the SME `str`
172and `ldr` instructions, which can be executed outside of streaming SVE mode.
173
174`machine_{save,restore}_sme_context()` unconditionally save and restore
175`TPIDR2_EL0`, since its contents are valid even when EL0 isn't actually using
176SME.  However `ZA`'s and `ZT0`'s contents are often invalid and hence do not
177require context-switching.  `machine_save_sme_context()` reads `SVCR.ZA`
178to determine if the `ZA` and `ZT0` arrays were actually valid at context-switch
179time.  If not, it skips saving the invalid `ZA` and `ZT0` contents.
180
181Likewise, when context-switching back to a thread where the saved-state
182`SVCR.ZA` is cleared, `machine_restore_sme_context()` simply ensures that the
183CPU's `PSTATE.ZA` bit is cleared (executing `smstop za` if necessary).  xnu does
184not need to manually invalidate any `ZA` or `ZT0` state left by a previous
185thread: the next time `PSTATE.ZA` is enabled, the CPU is architecturally
186guaranteed to zero out both register files.
187
188As noted above, xnu saves `SVCR` on kernel entry and uses it to restore
189`PSTATE.SM` on kernel exit.  Hence `machine_restore_sme_context()` updates
190`PSTATE.ZA` to match the new process's saved state, but doesn't update
191`PSTATE.SM`.  Likewise `machine_restore_sme_context()` doesn't manipulate the `Z`
192or `P` register files, since these will be updated on kernel exit.
193
194Since SME thread state (`thread->machine.usme`) is large, and won't be used by
195most threads, xnu lazily allocates the backing memory the first time a thread
196encounters an SME instruction.  This is implemented by clearing `SCTLR_EL1.SMEN`
197inside `machine_restore_sme_context()`, then performing the allocation during
198the subsequent SME trap.
199
200### Execution priority
201
202xnu does not currently have an API for changing SME execution priority.
203Accordingly xnu resets `SMPRI_EL1` to `0` during CPU initialization, and
204otherwise does not modify it at runtime.
205
206### Power management
207
208xnu updates `PSTATE.ZA` during `machine_switch_sme_context()` using the `SVCR`
209value stashed in the new thread's SME state.  If the new process has never used
210SME, and hence doesn't have saved `ZA` state, xnu unconditionally clears
211`PSTATE.ZA`.  This policy means that xnu issues the power-down hint
212`PSTATE.{SM,ZA} = {0,0}` on every context-switch, unless the new thread has live
213`ZA` state.  (Recall that `PSTATE.SM` was previously cleared on kernel entry.)
214
215By extension, xnu will always issue this hint before entering WFI.  In order to
216reach `arm64_retention_wfi()`, xnu must first context-switch to the idle thread,
217which never has `ZA` state.
218
219### Virtualizing SME
220
221SME introduces a number of new registers that the hypervisor needs to manage.
222`SMCR_ELx` is the only one of these that's banked between EL1 and EL2.  The
223`SVCR`, `SMPRI_EL1`, and `TPIDR2_EL0` system registers are all shared between
224the host and guest, and must be managed by the host hypervisor accordingly.
225
226More critically, the `Z`, `P`, `ZA`, and `ZT0` register files are also shared
227across all exception levels.  To minimize the cost of managing this unbanked SME
228register state, xnu tries to keep the guest matrix state resident in the CPU as
229long as possible, even when the guest traps to EL2.  xnu will only spill the `ZA`
230and `ZT0` state back to memory when one of two things happens:
231
232(1) The `hv_vcpu_run` trap handler returns control all the way back to the VMM
233    thread at host EL0
234
235(2) xnu needs to context-switch the host VMM thread that owns the vCPU
236
237In these cases xnu will spill the guest `ZA` and `ZT0` state back to memory,
238then replace them with the VMM thread's or new thread's state (respectively).
239
240Unfortunately since xnu has to disable streaming SVE mode to handle traps, it's
241still forced to spill `Z` and `P` state to memory anytime the guest traps to EL2
242with `PSTATE.SM` set.
243
244
245Since xnu doesn't currently support SME prioritization, it sets `HCRX_EL2.SMPME`
246and populates all `SMPRIMAP_EL2` entries with a value of `0`.  Guest OSes are
247still allowed to write to `SMPRI_EL1`, but currently this has no effect on
248the actual hardware priority.
249
250
251Appendix: Mach thread-state APIs
252--------------------------------
253
254Low-level tools (e.g., debuggers) may access thread SVE and SME state through
255the standard Mach APIs `thread_{get,set}_state`.  But because SVE and SME
256register state are large and have implementation-defined size, accessing this
257state can be more complicated than other thread state flavors.
258
259xnu splits the SVE and SME thread state into several flavors:
260
261| Flavor                                       | C thread-state type   | Description               |
262|----------------------------------------------|-----------------------|---------------------------|
263| `ARM_SME_STATE`                              | `arm_sme_state_t`     | SVCR, TPIDR2_EL0, and SVL |
264| `ARM_SVE_Z_STATE1`, `ARM_SME_Z_STATE2`       | `arm_sve_z_state_t`   | Z register file           |
265| `ARM_SVE_P_STATE`                            | `arm_sve_p_state_t`   | P register file           |
266| `ARM_SME_ZA_STATE1` ... `ARM_SME_ZA_STATE16` | `arm_sme_za_state_t`  | ZA register file          |
267| `ARM_SME2_STATE`                             | `arm_sme2_state_t`    | ZT0 register file         |
268
269`arm_sve_z_state_t`, `arm_sve_p_state_t`, and `arm_sme_za_state_t` are
270statically sized for a vector length of 2048 bits, the largest vector length
271allowed by the ARM architecture.  In practice, all Apple CPUs currently use a
272smaller vector length.  In this case `thread_get_state` will pad the unused
273upper bits of each `z`, `p`, and `za` field with zeroes.  Likewise,
274`thread_set_state` will ignore any unused upper bits.
275
276`Z` can architecturally be up to 8 kilobytes in size.  Since this is too large
277to fit in a single Mach message, xnu's Mach thread-state APIs divide the `Z`
278register space into two different thread-state flavors.  Thread-state flavor
279`ARM_SME_ZA_STATE1` accesses Z0-Z15, and thread-state flavor `ARM_SME_ZA_STATE2`
280accesses Z16-Z31.
281
282xnu likewise divides `ZA` into 4-kilobyte windows.  Thread-state flavor
283`ARM_SME_ZA_STATE1` accesses the first 4 kilobytes of ZA space,
284`ARM_SME_ZA_STATE2` accesses the next 4 kilobytes of ZA space, and so on up to
285`ARM_SME_ZA_STATE16`.  When `ZA` is smaller than 4 kilobytes, `thread_get_state`
286will pad the unused upper bytes of `arm_sme_za_state_t::za` with zeroes, and
287`thread_set_state` will ignore any unused upper bytes.
288
289`thread_{get,set}_state` will return `KERN_INVALID_ARGUMENT` if software tries
290to do any of the following:
291
292* Access SME or SME2 state on a CPU that doesn't implement FEAT_SME or FEAT_SME2
293  (respectively)
294* Access `Z` or `P` state when the target thread's `SVCR.SM` bit is cleared
295* Access `ZA` or `ZT0` state when the target thread's `SVCR.ZA` bit is cleared
296* Change the current `svl` value while setting `ARM_SME_STATE`
297
298xnu does not currently support sending SME or SVE thread state with Mach
299exception messages.  Mach APIs that set exception ports, such as
300`thread_set_exception_ports`, will return `KERN_INVALID_ARGUMENT` if the
301requested `flavor` is one of the values described in this appendix.
302
303### Sample code
304
305The following C code illustrates how to interpret SME and SME2 state returned by
306`thread_get_state`.  (To keep the code as simple as possible, it ignores all of
307the possible error cases listed above.)
308
309```c
310arm_sme_state_t sme_state; mach_msg_type_number_t sme_state_count = ARM_SME_STATE_COUNT;
311// Read SVL_B and SVCR
312thread_get_state(thread, ARM_SME_STATE, &sme_state, &sme_state_count);
313
314const uint64_t SVCR_SM = (1 << 0);
315// Are Z and P valid?
316if (sme_state.__svcr & SVCR_SM) {
317    size_t z_element_size = sme_state.__svl_b;
318    char z[32][z_element_size];
319    size_t p_element_size = sme_state.__svl_b / 8;
320    char p[16][p_element_size];
321
322    arm_sve_z_state_t z_state; mach_msg_type_number_t z_state_count = ARM_SVE_Z_STATE_COUNT;
323    // Read Z0-Z15 and copy active bits
324    thread_get_state(thread, ARM_SVE_Z_STATE1, &z_state, &z_state_count);
325    for (int i = 0; i < 16; i++) {
326       memcpy(z[i], z_state.__z[i], z_element_size);
327    }
328    // Read Z16-Z32 and copy active bits
329    thread_get_state(thread, ARM_SVE_Z_STATE2, &z_state, &z_state_count);
330    for (int i = 0; i < 16; i++) {
331       memcpy(z[i + 16], z_state.__z[i], z_element_size);
332    }
333
334    arm_sve_p_state_t p_state; mach_msg_type_number_t p_state_count = ARM_SVE_P_STATE_COUNT;
335    // Read P0-P15 and copy active bits
336    thread_get_state(thread, ARM_SVE_P_STATE, &p_state, &p_state_count);
337    for (int i = 0; i < 16; i++) {
338       memcpy(p[i], p_state.__p[i], p_element_size);
339    }
340}
341
342const uint64_t SVCR_ZA = (1 << 1);
343// Are ZA and ZT0 valid?
344if (sme_state.__svcr & SVCR_ZA) {
345    size_t za_size = sme_state.__svl_b * sme_state.__svl_b;
346    char za[za_size];
347    const size_t zt0_size = 64;
348    char zt0[zt0_size];
349
350    const size_t max_chunk_size = 4096;
351    int n_chunks; size_t chunk_size;
352    if (za_size <= max_chunk_size) {
353        n_chunks = 1;
354        chunk_size = za_size;
355    } else {
356        n_chunks = za_size / max_chunk_size;
357        chunk_size = max_chunk_size;
358    }
359
360    for (int i = 0; i < n_chunks; i++) {
361        arm_sme_za_state_t za_state; mach_msg_type_number_t za_state_count = ARM_SME_ZA_STATE_COUNT;
362        // Read next chunk of ZA
363        thread_get_state(thread, ARM_SME_ZA_STATE1 + i, &za_state, &za_state_count);
364        memcpy(&za[chunk_size * i], za_state.__za, chunk_size);
365    }
366
367    arm_sme2_state_t sme2_state; mach_msg_type_number_t sme2_state_count = ARM_SME2_STATE;
368    thread_get_state(thread, ARM_SME2_STATE, &sme2_state, &sme2_state_count);
369    memcpy(zt0, sme2_state.__zt0, zt0_size);
370}
371```
372
373
374Footnotes
375---------
376
377<a name="feat_sve_footnote"></a>1. For simplicity, this section describes the
378behavior on Apple CPUs.  Details like register length and accessibility may
379depend on whether the CPU is in streaming SVE mode (described later in the
380document).  Apple's current SME implementation simply makes SVE features
381inaccessible outside this mode.
382
383<a name="feat_sme_fa64_footnote"></a>2. The optional CPU feature FEAT_SME_FA64
384allows full use of the SIMD instruction set inside streaming SVE mode.
385However xnu does not currently support any CPUs which implement FEAT_SME_FA64.
386
387<a name="cpacr_zen_footnote"></a>3. `CPACR_EL1` and `CPTR_ELx` also have a
388discrete trap control `ZEN` for SVE instruction and register accesses performed
389outside streaming SVE mode.  This trap control isn't currently relevant to Apple
390CPUs, since Apple's current SME implementation only allows SVE accesses inside
391streaming SVE mode.
392
393<a name="xnu_simd_footnote"></a>4. LLVM is surprisingly aggressive about
394emitting SIMD instructions unless explicitly inhibited by compiler flags.  Even
395if the xnu build started inhibiting these instructions for targets that support
396SME, they could still appear in existing kext binaries.
397
398