1ARM Scalable Matrix Extension 2============================= 3 4Managing hardware resources related to SME state. 5 6Introduction 7------------ 8 9This document describes how xnu manages the hardware resources associated with 10ARM's Scalable Matrix Extension (SME). 11 12SME is an ARMv9 extension intended to accelerate matrix math operations. SME 13builds on top of ARM's previous Scalable Vector Extension (SVE), which extends 14the length of the FPSIMD register files and adds new 1D vector-math 15instructions. SME extends SVE by adding a matrix register file and associated 162D matrix-math instructions. SME2 further extends SME with additional 17instructions and register state. 18 19This document summarizes SVE, SME, and SME2 hardware features that are relevant 20to xnu. It is not intended as a full programming guide for SVE or SME: readers 21may find a full description of these ISAs in the 22[SVE supplement to the ARM ARM](https://developer.arm.com/documentation/ddi0584/latest/) 23and [SME supplement to the ARM ARM](https://developer.arm.com/documentation/ddi0616/latest/), 24respectively. 25 26 27 28Hardware overview 29----------------- 30 31### EL0-accessible state 32 33SVE, SME, and SME2 introduce four new EL0-accessible register 34files<sup>[1](#feat_sve_footnote)</sup>: 35 36- vector registers `Z0`-`Z31` 37- predicate registers `P0`-`P15` 38- matrix data `ZA` (SME/SME2 only) 39- look-up table `ZT0` (SME2 only) 40 41These register files are unbanked, i.e., their contents are shared across all 42exception levels. Data can be copied between these registers and system memory 43using specialized `ldr` and `str` variants. SME also adds `mov` variants that 44can directly copy data between the vector and matrix register files. 45 46Most of these register files supplement, rather than replace, the existing ARM 47register files. However the `Z` register file effectively extends the length of 48the existing FPSIMD `V` register file. Instructions targeting the `V` register 49file will now access the lower 128 bits of the corresponding `Z` register. 50 51The size of most of these files is defined by the *streaming vector length* 52(SVL), a power-of-two between 128 and 2048 inclusive. Each `Z` register is SVL 53bits in size; each `P` register is SVL / 8 bits in size; and `ZA` is SVL x SVL 54bits in size. The value of SVL is determined by both hardware and software. 55Hardware places an implementation-defined cap on SVL, and privileged software 56can further reduce SVL for itself and lower exception levels. 57 58In contrast, `ZT0` is fixed at 512 bits, independent of SVL. 59 60SME also adds a single EL0-accessible system register `TPIDR2_EL0`. Like 61`TPIDR_EL0`, `TPIDR2_EL0` is officially reserved for ABI use, but its contents 62have no particular meaning to hardware. 63 64### `PSTATE` changes 65 66SME adds two orthogonal states to `PSTATE`. 67 68`PSTATE.SM` moves the CPU in and out of a special execution mode called 69*streaming SVE mode*. Software must enter streaming SVE mode to execute most 70SME instructions. However software must then exit streaming SVE mode to execute 71many legacy SIMD instructions<sup>[2](#feat_sme_fa64_footnote)</sup>. To make 72things even more complicated, these transitions cause the CPU to zero out the 73`V`/`Z` and `P` register files, and to set all `FPSR` flags. When software 74needs to retain this state across `PSTATE.SM` transitions, it must manually 75stash the state in memory. 76 77`PSTATE.ZA` independently controls whether the contents of `ZA` and `ZT0` are 78valid. Setting `PSTATE.ZA` zeroes out both register files, and enables 79instructions that access them. Clearing `PSTATE.ZA` causes `ZA` and `ZT0` 80accesses to trap. 81 82Most SME instructions require both `PSTATE.SM` and `PSTATE.ZA` to be 83set, so software usually toggles both bits at the same time. However setting 84these bits independently can be useful when software needs to interleave SME and 85FPSIMD instructions. If software needs to temporarily exit streaming SVE mode 86to execute FPSIMD instructions, setting `PSTATE.{SM,ZA} = {0,1}` will do so 87without clobbering the `ZA` or `ZT0` array. 88 89`PSTATE.{SM,ZA} = {0,0}` acts as a hint to the CPU that it may power down 90SME-related hardware. Hence software should clear these bits as soon as 91SME state can be discarded. 92 93These `PSTATE` bits are accessible to software in several ways: 94 95- Reads or writes to the `SVCR` system register, which packs both bits into 96 a single register 97- Writes to the `SVCRSM`, `SVCRZA`, or `SVCRSMZA` system registers with the 98 immediate values `0` or `1`, which directly modify the specified bit(s) 99- `sm{start,stop} (sm|za)` pseudo-instructions, which are assembler aliases for 100 the above `msr` instructions 101 102Regardless of which method is used to access these bits, software generally does 103not need explicit barriers. Specifically, ARM guarantees that all direct and 104indirect reads from these bits will appear in program order relative to any 105direct writes. 106 107### Other hardware resources 108 109An implementation may share SME compute resources across multiple CPUs. In this 110case, the per-CPU `SMPRI_EL1` controls the relative priority of the SME 111instructions issued by that CPU. ARM guarantees that higher `SMPRI_EL1` values 112indicate higher priorities, and that setting `SMPRI_EL1 = 0` on all CPUs is a 113safe way to disable SME prioritization. Otherwise the exact meaning of 114`SMPRI_EL1` is implementation-defined. 115 116EL2 may trap guest reads and writes to `SMPRI_EL1` using the fine-grained trap 117controls `HFGRTR_EL2.nSMPRI_EL1` and `HFGWTR_EL2.nSMPRI_EL1`, respectively. 118Alternatively, EL2 may adjust the effective SME priority at EL0 and EL1 without 119trapping, by populating the lookup table register `SMPRIMAP_EL2` and setting the 120control bit `HCRX_EL2.SMPME`. When `HCRX_EL2.SMPME` is set, SME instructions 121executed at EL0 and EL1 will interpret `SMPRI_EL1` as an index into 122`SMPRIMAP_EL2` rather than as a raw priority value. 123 124`SMIDR_EL1` advertises hardware properties about the SME implementation, 125including whether SME execution priority is implemented. 126 127`CPACR_EL1` and `CPTR_ELx` have controls that can trap SVE and SME operations. 128Two of these are relevant to Apple's SME 129implementation<sup>[3](#cpacr_zen_footnote)</sup>: 130 131- `SMEN`: trap SME instructions and register accesses, including SVE 132 instructions executed during streaming SVE mode. 133- `FPEN`: trap FPSIMD, SME, and SVE instructions and most register accesses, but 134 *not* `SVCR` accesses. Lower priority than `SMEN`. 135 136Several SME registers aren't affected by these controls, since they have their 137own trapping mechanisms. `SMPRI_EL1` has fine-grained hypervisor trap controls 138as described above. `SMIDR_EL1` accesses can trap to the hypervisor using the 139existing `HCR_EL2.TID1` control bit. Finally `TPIDR2_EL0` has a dedicated 140control bit `SCTLR_ELx.EnTP2` along with fine-grained trap controls 141`HFG{R,W}TR_EL2.TPIDR2_EL0`. 142 143 144Software usage 145-------------- 146 147### SME `PSTATE` management 148 149xnu has in-kernel SIMD instructions<sup>[4](#xnu_simd_footnote)</sup> which 150become illegal while the CPU is in streaming SVE mode. This poses a problem if 151xnu interrupts EL0 while it is in the middle of executing SME-accelerated code. 152 153Hence, anytime xnu enters the kernel with `PSTATE.SM` set, it saves the current 154`Z`, `P`, and `SVCR` values and then clears `PSTATE.SM`. xnu later restores 155these values during kernel exit. These operations occur in an assembly-only 156module (`locore.s`) where we have strict control over code generation, and can 157guarantee that no problematic SIMD instructions are executed while `PSTATE.SM` 158is set. 159 160Since the kernel may interrupt *itself*, kernel code is forbidden from entering 161streaming SVE mode. This policy means that xnu does not need to preserve 162`TPIDR2_EL0`, `ZA`, or `ZT0` during kernel entry and exit, since there are no 163in-kernel SME operations that could clobber them. 164 165### Context switching 166 167xnu saves and restores `TPIDR2_EL0`, `ZA`, and `ZT0` inside the ARM64 168implementation of `machine_switch_context()`, specifically as the routines 169`machine_{save,restore}_sme_context()` in `osfmk/arm64/pcb.c`. These in turn 170build on lower-level routines to save and load SME register state, located in 171`osfmk/arm64/sme.c`. The low-level routines are built on top of the SME `str` 172and `ldr` instructions, which can be executed outside of streaming SVE mode. 173 174`machine_{save,restore}_sme_context()` unconditionally save and restore 175`TPIDR2_EL0`, since its contents are valid even when EL0 isn't actually using 176SME. However `ZA`'s and `ZT0`'s contents are often invalid and hence do not 177require context-switching. `machine_save_sme_context()` reads `SVCR.ZA` 178to determine if the `ZA` and `ZT0` arrays were actually valid at context-switch 179time. If not, it skips saving the invalid `ZA` and `ZT0` contents. 180 181Likewise, when context-switching back to a thread where the saved-state 182`SVCR.ZA` is cleared, `machine_restore_sme_context()` simply ensures that the 183CPU's `PSTATE.ZA` bit is cleared (executing `smstop za` if necessary). xnu does 184not need to manually invalidate any `ZA` or `ZT0` state left by a previous 185thread: the next time `PSTATE.ZA` is enabled, the CPU is architecturally 186guaranteed to zero out both register files. 187 188As noted above, xnu saves `SVCR` on kernel entry and uses it to restore 189`PSTATE.SM` on kernel exit. Hence `machine_restore_sme_context()` updates 190`PSTATE.ZA` to match the new process's saved state, but doesn't update 191`PSTATE.SM`. Likewise `machine_restore_sme_context()` doesn't manipulate the `Z` 192or `P` register files, since these will be updated on kernel exit. 193 194Since SME thread state (`thread->machine.usme`) is large, and won't be used by 195most threads, xnu lazily allocates the backing memory the first time a thread 196encounters an SME instruction. This is implemented by clearing `SCTLR_EL1.SMEN` 197inside `machine_restore_sme_context()`, then performing the allocation during 198the subsequent SME trap. 199 200### Execution priority 201 202xnu does not currently have an API for changing SME execution priority. 203Accordingly xnu resets `SMPRI_EL1` to `0` during CPU initialization, and 204otherwise does not modify it at runtime. 205 206### Power management 207 208xnu updates `PSTATE.ZA` during `machine_switch_sme_context()` using the `SVCR` 209value stashed in the new thread's SME state. If the new process has never used 210SME, and hence doesn't have saved `ZA` state, xnu unconditionally clears 211`PSTATE.ZA`. This policy means that xnu issues the power-down hint 212`PSTATE.{SM,ZA} = {0,0}` on every context-switch, unless the new thread has live 213`ZA` state. (Recall that `PSTATE.SM` was previously cleared on kernel entry.) 214 215By extension, xnu will always issue this hint before entering WFI. In order to 216reach `arm64_retention_wfi()`, xnu must first context-switch to the idle thread, 217which never has `ZA` state. 218 219### Virtualizing SME 220 221SME introduces a number of new registers that the hypervisor needs to manage. 222`SMCR_ELx` is the only one of these that's banked between EL1 and EL2. The 223`SVCR`, `SMPRI_EL1`, and `TPIDR2_EL0` system registers are all shared between 224the host and guest, and must be managed by the host hypervisor accordingly. 225 226More critically, the `Z`, `P`, `ZA`, and `ZT0` register files are also shared 227across all exception levels. To minimize the cost of managing this unbanked SME 228register state, xnu tries to keep the guest matrix state resident in the CPU as 229long as possible, even when the guest traps to EL2. xnu will only spill the `ZA` 230and `ZT0` state back to memory when one of two things happens: 231 232(1) The `hv_vcpu_run` trap handler returns control all the way back to the VMM 233 thread at host EL0 234 235(2) xnu needs to context-switch the host VMM thread that owns the vCPU 236 237In these cases xnu will spill the guest `ZA` and `ZT0` state back to memory, 238then replace them with the VMM thread's or new thread's state (respectively). 239 240Unfortunately since xnu has to disable streaming SVE mode to handle traps, it's 241still forced to spill `Z` and `P` state to memory anytime the guest traps to EL2 242with `PSTATE.SM` set. 243 244 245Since xnu doesn't currently support SME prioritization, it sets `HCRX_EL2.SMPME` 246and populates all `SMPRIMAP_EL2` entries with a value of `0`. Guest OSes are 247still allowed to write to `SMPRI_EL1`, but currently this has no effect on 248the actual hardware priority. 249 250 251 252Footnotes 253--------- 254 255<a name="feat_sve_footnote"></a>1. For simplicity, this section describes the 256behavior on Apple CPUs. Details like register length and accessibility may 257depend on whether the CPU is in streaming SVE mode (described later in the 258document). Apple's current SME implementation simply makes SVE features 259inaccessible outside this mode. 260 261<a name="feat_sme_fa64_footnote"></a>2. The optional CPU feature FEAT_SME_FA64 262allows full use of the SIMD instruction set inside streaming SVE mode. 263However xnu does not currently support any CPUs which implement FEAT_SME_FA64. 264 265<a name="cpacr_zen_footnote"></a>3. `CPACR_EL1` and `CPTR_ELx` also have a 266discrete trap control `ZEN` for SVE instruction and register accesses performed 267outside streaming SVE mode. This trap control isn't currently relevant to Apple 268CPUs, since Apple's current SME implementation only allows SVE accesses inside 269streaming SVE mode. 270 271<a name="xnu_simd_footnote"></a>4. LLVM is surprisingly aggressive about 272emitting SIMD instructions unless explicitly inhibited by compiler flags. Even 273if the xnu build started inhibiting these instructions for targets that support 274SME, they could still appear in existing kext binaries. 275 276