1*43a90889SApple OSS DistributionsARM Scalable Matrix Extension 2*43a90889SApple OSS Distributions============================= 3*43a90889SApple OSS Distributions 4*43a90889SApple OSS DistributionsManaging hardware resources related to SME state. 5*43a90889SApple OSS Distributions 6*43a90889SApple OSS DistributionsIntroduction 7*43a90889SApple OSS Distributions------------ 8*43a90889SApple OSS Distributions 9*43a90889SApple OSS DistributionsThis document describes how xnu manages the hardware resources associated with 10*43a90889SApple OSS DistributionsARM's Scalable Matrix Extension (SME). 11*43a90889SApple OSS Distributions 12*43a90889SApple OSS DistributionsSME is an ARMv9 extension intended to accelerate matrix math operations. SME 13*43a90889SApple OSS Distributionsbuilds on top of ARM's previous Scalable Vector Extension (SVE), which extends 14*43a90889SApple OSS Distributionsthe length of the FPSIMD register files and adds new 1D vector-math 15*43a90889SApple OSS Distributionsinstructions. SME extends SVE by adding a matrix register file and associated 16*43a90889SApple OSS Distributions2D matrix-math instructions. SME2 further extends SME with additional 17*43a90889SApple OSS Distributionsinstructions and register state. 18*43a90889SApple OSS Distributions 19*43a90889SApple OSS DistributionsThis document summarizes SVE, SME, and SME2 hardware features that are relevant 20*43a90889SApple OSS Distributionsto xnu. It is not intended as a full programming guide for SVE or SME: readers 21*43a90889SApple OSS Distributionsmay find a full description of these ISAs in the 22*43a90889SApple OSS Distributions[SVE supplement to the ARM ARM](https://developer.arm.com/documentation/ddi0584/latest/) 23*43a90889SApple OSS Distributionsand [SME supplement to the ARM ARM](https://developer.arm.com/documentation/ddi0616/latest/), 24*43a90889SApple OSS Distributionsrespectively. 25*43a90889SApple OSS Distributions 26*43a90889SApple OSS Distributions 27*43a90889SApple OSS Distributions 28*43a90889SApple OSS DistributionsHardware overview 29*43a90889SApple OSS Distributions----------------- 30*43a90889SApple OSS Distributions 31*43a90889SApple OSS Distributions### EL0-accessible state 32*43a90889SApple OSS Distributions 33*43a90889SApple OSS DistributionsSVE, SME, and SME2 introduce four new EL0-accessible register 34*43a90889SApple OSS Distributionsfiles<sup>[1](#feat_sve_footnote)</sup>: 35*43a90889SApple OSS Distributions 36*43a90889SApple OSS Distributions- vector registers `Z0`-`Z31` 37*43a90889SApple OSS Distributions- predicate registers `P0`-`P15` 38*43a90889SApple OSS Distributions- matrix data `ZA` (SME/SME2 only) 39*43a90889SApple OSS Distributions- look-up table `ZT0` (SME2 only) 40*43a90889SApple OSS Distributions 41*43a90889SApple OSS DistributionsThese register files are unbanked, i.e., their contents are shared across all 42*43a90889SApple OSS Distributionsexception levels. Data can be copied between these registers and system memory 43*43a90889SApple OSS Distributionsusing specialized `ldr` and `str` variants. SME also adds `mov` variants that 44*43a90889SApple OSS Distributionscan directly copy data between the vector and matrix register files. 45*43a90889SApple OSS Distributions 46*43a90889SApple OSS DistributionsMost of these register files supplement, rather than replace, the existing ARM 47*43a90889SApple OSS Distributionsregister files. However the `Z` register file effectively extends the length of 48*43a90889SApple OSS Distributionsthe existing FPSIMD `V` register file. Instructions targeting the `V` register 49*43a90889SApple OSS Distributionsfile will now access the lower 128 bits of the corresponding `Z` register. 50*43a90889SApple OSS Distributions 51*43a90889SApple OSS DistributionsThe size of most of these files is defined by the *streaming vector length* 52*43a90889SApple OSS Distributions(SVL), a power-of-two between 128 and 2048 inclusive. Each `Z` register is SVL 53*43a90889SApple OSS Distributionsbits in size; each `P` register is SVL / 8 bits in size; and `ZA` is SVL x SVL 54*43a90889SApple OSS Distributionsbits in size. The value of SVL is determined by both hardware and software. 55*43a90889SApple OSS DistributionsHardware places an implementation-defined cap on SVL, and privileged software 56*43a90889SApple OSS Distributionscan further reduce SVL for itself and lower exception levels. 57*43a90889SApple OSS Distributions 58*43a90889SApple OSS DistributionsIn contrast, `ZT0` is fixed at 512 bits, independent of SVL. 59*43a90889SApple OSS Distributions 60*43a90889SApple OSS DistributionsSME also adds a single EL0-accessible system register `TPIDR2_EL0`. Like 61*43a90889SApple OSS Distributions`TPIDR_EL0`, `TPIDR2_EL0` is officially reserved for ABI use, but its contents 62*43a90889SApple OSS Distributionshave no particular meaning to hardware. 63*43a90889SApple OSS Distributions 64*43a90889SApple OSS Distributions### `PSTATE` changes 65*43a90889SApple OSS Distributions 66*43a90889SApple OSS DistributionsSME adds two orthogonal states to `PSTATE`. 67*43a90889SApple OSS Distributions 68*43a90889SApple OSS Distributions`PSTATE.SM` moves the CPU in and out of a special execution mode called 69*43a90889SApple OSS Distributions*streaming SVE mode*. Software must enter streaming SVE mode to execute most 70*43a90889SApple OSS DistributionsSME instructions. However software must then exit streaming SVE mode to execute 71*43a90889SApple OSS Distributionsmany legacy SIMD instructions<sup>[2](#feat_sme_fa64_footnote)</sup>. To make 72*43a90889SApple OSS Distributionsthings even more complicated, these transitions cause the CPU to zero out the 73*43a90889SApple OSS Distributions`V`/`Z` and `P` register files, and to set all `FPSR` flags. When software 74*43a90889SApple OSS Distributionsneeds to retain this state across `PSTATE.SM` transitions, it must manually 75*43a90889SApple OSS Distributionsstash the state in memory. 76*43a90889SApple OSS Distributions 77*43a90889SApple OSS Distributions`PSTATE.ZA` independently controls whether the contents of `ZA` and `ZT0` are 78*43a90889SApple OSS Distributionsvalid. Setting `PSTATE.ZA` zeroes out both register files, and enables 79*43a90889SApple OSS Distributionsinstructions that access them. Clearing `PSTATE.ZA` causes `ZA` and `ZT0` 80*43a90889SApple OSS Distributionsaccesses to trap. 81*43a90889SApple OSS Distributions 82*43a90889SApple OSS DistributionsMost SME instructions require both `PSTATE.SM` and `PSTATE.ZA` to be 83*43a90889SApple OSS Distributionsset, so software usually toggles both bits at the same time. However setting 84*43a90889SApple OSS Distributionsthese bits independently can be useful when software needs to interleave SME and 85*43a90889SApple OSS DistributionsFPSIMD instructions. If software needs to temporarily exit streaming SVE mode 86*43a90889SApple OSS Distributionsto execute FPSIMD instructions, setting `PSTATE.{SM,ZA} = {0,1}` will do so 87*43a90889SApple OSS Distributionswithout clobbering the `ZA` or `ZT0` array. 88*43a90889SApple OSS Distributions 89*43a90889SApple OSS Distributions`PSTATE.{SM,ZA} = {0,0}` acts as a hint to the CPU that it may power down 90*43a90889SApple OSS DistributionsSME-related hardware. Hence software should clear these bits as soon as 91*43a90889SApple OSS DistributionsSME state can be discarded. 92*43a90889SApple OSS Distributions 93*43a90889SApple OSS DistributionsThese `PSTATE` bits are accessible to software in several ways: 94*43a90889SApple OSS Distributions 95*43a90889SApple OSS Distributions- Reads or writes to the `SVCR` system register, which packs both bits into 96*43a90889SApple OSS Distributions a single register 97*43a90889SApple OSS Distributions- Writes to the `SVCRSM`, `SVCRZA`, or `SVCRSMZA` system registers with the 98*43a90889SApple OSS Distributions immediate values `0` or `1`, which directly modify the specified bit(s) 99*43a90889SApple OSS Distributions- `sm{start,stop} (sm|za)` pseudo-instructions, which are assembler aliases for 100*43a90889SApple OSS Distributions the above `msr` instructions 101*43a90889SApple OSS Distributions 102*43a90889SApple OSS DistributionsRegardless of which method is used to access these bits, software generally does 103*43a90889SApple OSS Distributionsnot need explicit barriers. Specifically, ARM guarantees that all direct and 104*43a90889SApple OSS Distributionsindirect reads from these bits will appear in program order relative to any 105*43a90889SApple OSS Distributionsdirect writes. 106*43a90889SApple OSS Distributions 107*43a90889SApple OSS Distributions### Other hardware resources 108*43a90889SApple OSS Distributions 109*43a90889SApple OSS DistributionsAn implementation may share SME compute resources across multiple CPUs. In this 110*43a90889SApple OSS Distributionscase, the per-CPU `SMPRI_EL1` controls the relative priority of the SME 111*43a90889SApple OSS Distributionsinstructions issued by that CPU. ARM guarantees that higher `SMPRI_EL1` values 112*43a90889SApple OSS Distributionsindicate higher priorities, and that setting `SMPRI_EL1 = 0` on all CPUs is a 113*43a90889SApple OSS Distributionssafe way to disable SME prioritization. Otherwise the exact meaning of 114*43a90889SApple OSS Distributions`SMPRI_EL1` is implementation-defined. 115*43a90889SApple OSS Distributions 116*43a90889SApple OSS DistributionsEL2 may trap guest reads and writes to `SMPRI_EL1` using the fine-grained trap 117*43a90889SApple OSS Distributionscontrols `HFGRTR_EL2.nSMPRI_EL1` and `HFGWTR_EL2.nSMPRI_EL1`, respectively. 118*43a90889SApple OSS DistributionsAlternatively, EL2 may adjust the effective SME priority at EL0 and EL1 without 119*43a90889SApple OSS Distributionstrapping, by populating the lookup table register `SMPRIMAP_EL2` and setting the 120*43a90889SApple OSS Distributionscontrol bit `HCRX_EL2.SMPME`. When `HCRX_EL2.SMPME` is set, SME instructions 121*43a90889SApple OSS Distributionsexecuted at EL0 and EL1 will interpret `SMPRI_EL1` as an index into 122*43a90889SApple OSS Distributions`SMPRIMAP_EL2` rather than as a raw priority value. 123*43a90889SApple OSS Distributions 124*43a90889SApple OSS Distributions`SMIDR_EL1` advertises hardware properties about the SME implementation, 125*43a90889SApple OSS Distributionsincluding whether SME execution priority is implemented. 126*43a90889SApple OSS Distributions 127*43a90889SApple OSS Distributions`CPACR_EL1` and `CPTR_ELx` have controls that can trap SVE and SME operations. 128*43a90889SApple OSS DistributionsTwo of these are relevant to Apple's SME 129*43a90889SApple OSS Distributionsimplementation<sup>[3](#cpacr_zen_footnote)</sup>: 130*43a90889SApple OSS Distributions 131*43a90889SApple OSS Distributions- `SMEN`: trap SME instructions and register accesses, including SVE 132*43a90889SApple OSS Distributions instructions executed during streaming SVE mode. 133*43a90889SApple OSS Distributions- `FPEN`: trap FPSIMD, SME, and SVE instructions and most register accesses, but 134*43a90889SApple OSS Distributions *not* `SVCR` accesses. Lower priority than `SMEN`. 135*43a90889SApple OSS Distributions 136*43a90889SApple OSS DistributionsSeveral SME registers aren't affected by these controls, since they have their 137*43a90889SApple OSS Distributionsown trapping mechanisms. `SMPRI_EL1` has fine-grained hypervisor trap controls 138*43a90889SApple OSS Distributionsas described above. `SMIDR_EL1` accesses can trap to the hypervisor using the 139*43a90889SApple OSS Distributionsexisting `HCR_EL2.TID1` control bit. Finally `TPIDR2_EL0` has a dedicated 140*43a90889SApple OSS Distributionscontrol bit `SCTLR_ELx.EnTP2` along with fine-grained trap controls 141*43a90889SApple OSS Distributions`HFG{R,W}TR_EL2.TPIDR2_EL0`. 142*43a90889SApple OSS Distributions 143*43a90889SApple OSS Distributions 144*43a90889SApple OSS DistributionsSoftware usage 145*43a90889SApple OSS Distributions-------------- 146*43a90889SApple OSS Distributions 147*43a90889SApple OSS Distributions### SME `PSTATE` management 148*43a90889SApple OSS Distributions 149*43a90889SApple OSS Distributionsxnu has in-kernel SIMD instructions<sup>[4](#xnu_simd_footnote)</sup> which 150*43a90889SApple OSS Distributionsbecome illegal while the CPU is in streaming SVE mode. This poses a problem if 151*43a90889SApple OSS Distributionsxnu interrupts EL0 while it is in the middle of executing SME-accelerated code. 152*43a90889SApple OSS Distributions 153*43a90889SApple OSS DistributionsHence, anytime xnu enters the kernel with `PSTATE.SM` set, it saves the current 154*43a90889SApple OSS Distributions`Z`, `P`, and `SVCR` values and then clears `PSTATE.SM`. xnu later restores 155*43a90889SApple OSS Distributionsthese values during kernel exit. These operations occur in an assembly-only 156*43a90889SApple OSS Distributionsmodule (`locore.s`) where we have strict control over code generation, and can 157*43a90889SApple OSS Distributionsguarantee that no problematic SIMD instructions are executed while `PSTATE.SM` 158*43a90889SApple OSS Distributionsis set. 159*43a90889SApple OSS Distributions 160*43a90889SApple OSS DistributionsSince the kernel may interrupt *itself*, kernel code is forbidden from entering 161*43a90889SApple OSS Distributionsstreaming SVE mode. This policy means that xnu does not need to preserve 162*43a90889SApple OSS Distributions`TPIDR2_EL0`, `ZA`, or `ZT0` during kernel entry and exit, since there are no 163*43a90889SApple OSS Distributionsin-kernel SME operations that could clobber them. 164*43a90889SApple OSS Distributions 165*43a90889SApple OSS Distributions### Context switching 166*43a90889SApple OSS Distributions 167*43a90889SApple OSS Distributionsxnu saves and restores `TPIDR2_EL0`, `ZA`, and `ZT0` inside the ARM64 168*43a90889SApple OSS Distributionsimplementation of `machine_switch_context()`, specifically as the routines 169*43a90889SApple OSS Distributions`machine_{save,restore}_sme_context()` in `osfmk/arm64/pcb.c`. These in turn 170*43a90889SApple OSS Distributionsbuild on lower-level routines to save and load SME register state, located in 171*43a90889SApple OSS Distributions`osfmk/arm64/sme.c`. The low-level routines are built on top of the SME `str` 172*43a90889SApple OSS Distributionsand `ldr` instructions, which can be executed outside of streaming SVE mode. 173*43a90889SApple OSS Distributions 174*43a90889SApple OSS Distributions`machine_{save,restore}_sme_context()` unconditionally save and restore 175*43a90889SApple OSS Distributions`TPIDR2_EL0`, since its contents are valid even when EL0 isn't actually using 176*43a90889SApple OSS DistributionsSME. However `ZA`'s and `ZT0`'s contents are often invalid and hence do not 177*43a90889SApple OSS Distributionsrequire context-switching. `machine_save_sme_context()` reads `SVCR.ZA` 178*43a90889SApple OSS Distributionsto determine if the `ZA` and `ZT0` arrays were actually valid at context-switch 179*43a90889SApple OSS Distributionstime. If not, it skips saving the invalid `ZA` and `ZT0` contents. 180*43a90889SApple OSS Distributions 181*43a90889SApple OSS DistributionsLikewise, when context-switching back to a thread where the saved-state 182*43a90889SApple OSS Distributions`SVCR.ZA` is cleared, `machine_restore_sme_context()` simply ensures that the 183*43a90889SApple OSS DistributionsCPU's `PSTATE.ZA` bit is cleared (executing `smstop za` if necessary). xnu does 184*43a90889SApple OSS Distributionsnot need to manually invalidate any `ZA` or `ZT0` state left by a previous 185*43a90889SApple OSS Distributionsthread: the next time `PSTATE.ZA` is enabled, the CPU is architecturally 186*43a90889SApple OSS Distributionsguaranteed to zero out both register files. 187*43a90889SApple OSS Distributions 188*43a90889SApple OSS DistributionsAs noted above, xnu saves `SVCR` on kernel entry and uses it to restore 189*43a90889SApple OSS Distributions`PSTATE.SM` on kernel exit. Hence `machine_restore_sme_context()` updates 190*43a90889SApple OSS Distributions`PSTATE.ZA` to match the new process's saved state, but doesn't update 191*43a90889SApple OSS Distributions`PSTATE.SM`. Likewise `machine_restore_sme_context()` doesn't manipulate the `Z` 192*43a90889SApple OSS Distributionsor `P` register files, since these will be updated on kernel exit. 193*43a90889SApple OSS Distributions 194*43a90889SApple OSS DistributionsSince SME thread state (`thread->machine.usme`) is large, and won't be used by 195*43a90889SApple OSS Distributionsmost threads, xnu lazily allocates the backing memory the first time a thread 196*43a90889SApple OSS Distributionsencounters an SME instruction. This is implemented by clearing `SCTLR_EL1.SMEN` 197*43a90889SApple OSS Distributionsinside `machine_restore_sme_context()`, then performing the allocation during 198*43a90889SApple OSS Distributionsthe subsequent SME trap. 199*43a90889SApple OSS Distributions 200*43a90889SApple OSS Distributions### Execution priority 201*43a90889SApple OSS Distributions 202*43a90889SApple OSS Distributionsxnu does not currently have an API for changing SME execution priority. 203*43a90889SApple OSS DistributionsAccordingly xnu resets `SMPRI_EL1` to `0` during CPU initialization, and 204*43a90889SApple OSS Distributionsotherwise does not modify it at runtime. 205*43a90889SApple OSS Distributions 206*43a90889SApple OSS Distributions### Power management 207*43a90889SApple OSS Distributions 208*43a90889SApple OSS Distributionsxnu updates `PSTATE.ZA` during `machine_switch_sme_context()` using the `SVCR` 209*43a90889SApple OSS Distributionsvalue stashed in the new thread's SME state. If the new process has never used 210*43a90889SApple OSS DistributionsSME, and hence doesn't have saved `ZA` state, xnu unconditionally clears 211*43a90889SApple OSS Distributions`PSTATE.ZA`. This policy means that xnu issues the power-down hint 212*43a90889SApple OSS Distributions`PSTATE.{SM,ZA} = {0,0}` on every context-switch, unless the new thread has live 213*43a90889SApple OSS Distributions`ZA` state. (Recall that `PSTATE.SM` was previously cleared on kernel entry.) 214*43a90889SApple OSS Distributions 215*43a90889SApple OSS DistributionsBy extension, xnu will always issue this hint before entering WFI. In order to 216*43a90889SApple OSS Distributionsreach `arm64_retention_wfi()`, xnu must first context-switch to the idle thread, 217*43a90889SApple OSS Distributionswhich never has `ZA` state. 218*43a90889SApple OSS Distributions 219*43a90889SApple OSS Distributions### Virtualizing SME 220*43a90889SApple OSS Distributions 221*43a90889SApple OSS DistributionsSME introduces a number of new registers that the hypervisor needs to manage. 222*43a90889SApple OSS Distributions`SMCR_ELx` is the only one of these that's banked between EL1 and EL2. The 223*43a90889SApple OSS Distributions`SVCR`, `SMPRI_EL1`, and `TPIDR2_EL0` system registers are all shared between 224*43a90889SApple OSS Distributionsthe host and guest, and must be managed by the host hypervisor accordingly. 225*43a90889SApple OSS Distributions 226*43a90889SApple OSS DistributionsMore critically, the `Z`, `P`, `ZA`, and `ZT0` register files are also shared 227*43a90889SApple OSS Distributionsacross all exception levels. To minimize the cost of managing this unbanked SME 228*43a90889SApple OSS Distributionsregister state, xnu tries to keep the guest matrix state resident in the CPU as 229*43a90889SApple OSS Distributionslong as possible, even when the guest traps to EL2. xnu will only spill the `ZA` 230*43a90889SApple OSS Distributionsand `ZT0` state back to memory when one of two things happens: 231*43a90889SApple OSS Distributions 232*43a90889SApple OSS Distributions(1) The `hv_vcpu_run` trap handler returns control all the way back to the VMM 233*43a90889SApple OSS Distributions thread at host EL0 234*43a90889SApple OSS Distributions 235*43a90889SApple OSS Distributions(2) xnu needs to context-switch the host VMM thread that owns the vCPU 236*43a90889SApple OSS Distributions 237*43a90889SApple OSS DistributionsIn these cases xnu will spill the guest `ZA` and `ZT0` state back to memory, 238*43a90889SApple OSS Distributionsthen replace them with the VMM thread's or new thread's state (respectively). 239*43a90889SApple OSS Distributions 240*43a90889SApple OSS DistributionsUnfortunately since xnu has to disable streaming SVE mode to handle traps, it's 241*43a90889SApple OSS Distributionsstill forced to spill `Z` and `P` state to memory anytime the guest traps to EL2 242*43a90889SApple OSS Distributionswith `PSTATE.SM` set. 243*43a90889SApple OSS Distributions 244*43a90889SApple OSS Distributions 245*43a90889SApple OSS DistributionsSince xnu doesn't currently support SME prioritization, it sets `HCRX_EL2.SMPME` 246*43a90889SApple OSS Distributionsand populates all `SMPRIMAP_EL2` entries with a value of `0`. Guest OSes are 247*43a90889SApple OSS Distributionsstill allowed to write to `SMPRI_EL1`, but currently this has no effect on 248*43a90889SApple OSS Distributionsthe actual hardware priority. 249*43a90889SApple OSS Distributions 250*43a90889SApple OSS Distributions 251*43a90889SApple OSS DistributionsAppendix: Mach thread-state APIs 252*43a90889SApple OSS Distributions-------------------------------- 253*43a90889SApple OSS Distributions 254*43a90889SApple OSS DistributionsLow-level tools (e.g., debuggers) may access thread SVE and SME state through 255*43a90889SApple OSS Distributionsthe standard Mach APIs `thread_{get,set}_state`. But because SVE and SME 256*43a90889SApple OSS Distributionsregister state are large and have implementation-defined size, accessing this 257*43a90889SApple OSS Distributionsstate can be more complicated than other thread state flavors. 258*43a90889SApple OSS Distributions 259*43a90889SApple OSS Distributionsxnu splits the SVE and SME thread state into several flavors: 260*43a90889SApple OSS Distributions 261*43a90889SApple OSS Distributions| Flavor | C thread-state type | Description | 262*43a90889SApple OSS Distributions|----------------------------------------------|-----------------------|---------------------------| 263*43a90889SApple OSS Distributions| `ARM_SME_STATE` | `arm_sme_state_t` | SVCR, TPIDR2_EL0, and SVL | 264*43a90889SApple OSS Distributions| `ARM_SVE_Z_STATE1`, `ARM_SME_Z_STATE2` | `arm_sve_z_state_t` | Z register file | 265*43a90889SApple OSS Distributions| `ARM_SVE_P_STATE` | `arm_sve_p_state_t` | P register file | 266*43a90889SApple OSS Distributions| `ARM_SME_ZA_STATE1` ... `ARM_SME_ZA_STATE16` | `arm_sme_za_state_t` | ZA register file | 267*43a90889SApple OSS Distributions| `ARM_SME2_STATE` | `arm_sme2_state_t` | ZT0 register file | 268*43a90889SApple OSS Distributions 269*43a90889SApple OSS Distributions`arm_sve_z_state_t`, `arm_sve_p_state_t`, and `arm_sme_za_state_t` are 270*43a90889SApple OSS Distributionsstatically sized for a vector length of 2048 bits, the largest vector length 271*43a90889SApple OSS Distributionsallowed by the ARM architecture. In practice, all Apple CPUs currently use a 272*43a90889SApple OSS Distributionssmaller vector length. In this case `thread_get_state` will pad the unused 273*43a90889SApple OSS Distributionsupper bits of each `z`, `p`, and `za` field with zeroes. Likewise, 274*43a90889SApple OSS Distributions`thread_set_state` will ignore any unused upper bits. 275*43a90889SApple OSS Distributions 276*43a90889SApple OSS Distributions`Z` can architecturally be up to 8 kilobytes in size. Since this is too large 277*43a90889SApple OSS Distributionsto fit in a single Mach message, xnu's Mach thread-state APIs divide the `Z` 278*43a90889SApple OSS Distributionsregister space into two different thread-state flavors. Thread-state flavor 279*43a90889SApple OSS Distributions`ARM_SME_ZA_STATE1` accesses Z0-Z15, and thread-state flavor `ARM_SME_ZA_STATE2` 280*43a90889SApple OSS Distributionsaccesses Z16-Z31. 281*43a90889SApple OSS Distributions 282*43a90889SApple OSS Distributionsxnu likewise divides `ZA` into 4-kilobyte windows. Thread-state flavor 283*43a90889SApple OSS Distributions`ARM_SME_ZA_STATE1` accesses the first 4 kilobytes of ZA space, 284*43a90889SApple OSS Distributions`ARM_SME_ZA_STATE2` accesses the next 4 kilobytes of ZA space, and so on up to 285*43a90889SApple OSS Distributions`ARM_SME_ZA_STATE16`. When `ZA` is smaller than 4 kilobytes, `thread_get_state` 286*43a90889SApple OSS Distributionswill pad the unused upper bytes of `arm_sme_za_state_t::za` with zeroes, and 287*43a90889SApple OSS Distributions`thread_set_state` will ignore any unused upper bytes. 288*43a90889SApple OSS Distributions 289*43a90889SApple OSS Distributions`thread_{get,set}_state` will return `KERN_INVALID_ARGUMENT` if software tries 290*43a90889SApple OSS Distributionsto do any of the following: 291*43a90889SApple OSS Distributions 292*43a90889SApple OSS Distributions* Access SME or SME2 state on a CPU that doesn't implement FEAT_SME or FEAT_SME2 293*43a90889SApple OSS Distributions (respectively) 294*43a90889SApple OSS Distributions* Access `Z` or `P` state when the target thread's `SVCR.SM` bit is cleared 295*43a90889SApple OSS Distributions* Access `ZA` or `ZT0` state when the target thread's `SVCR.ZA` bit is cleared 296*43a90889SApple OSS Distributions* Change the current `svl` value while setting `ARM_SME_STATE` 297*43a90889SApple OSS Distributions 298*43a90889SApple OSS Distributionsxnu does not currently support sending SME or SVE thread state with Mach 299*43a90889SApple OSS Distributionsexception messages. Mach APIs that set exception ports, such as 300*43a90889SApple OSS Distributions`thread_set_exception_ports`, will return `KERN_INVALID_ARGUMENT` if the 301*43a90889SApple OSS Distributionsrequested `flavor` is one of the values described in this appendix. 302*43a90889SApple OSS Distributions 303*43a90889SApple OSS Distributions### Sample code 304*43a90889SApple OSS Distributions 305*43a90889SApple OSS DistributionsThe following C code illustrates how to interpret SME and SME2 state returned by 306*43a90889SApple OSS Distributions`thread_get_state`. (To keep the code as simple as possible, it ignores all of 307*43a90889SApple OSS Distributionsthe possible error cases listed above.) 308*43a90889SApple OSS Distributions 309*43a90889SApple OSS Distributions```c 310*43a90889SApple OSS Distributionsarm_sme_state_t sme_state; mach_msg_type_number_t sme_state_count = ARM_SME_STATE_COUNT; 311*43a90889SApple OSS Distributions// Read SVL_B and SVCR 312*43a90889SApple OSS Distributionsthread_get_state(thread, ARM_SME_STATE, &sme_state, &sme_state_count); 313*43a90889SApple OSS Distributions 314*43a90889SApple OSS Distributionsconst uint64_t SVCR_SM = (1 << 0); 315*43a90889SApple OSS Distributions// Are Z and P valid? 316*43a90889SApple OSS Distributionsif (sme_state.__svcr & SVCR_SM) { 317*43a90889SApple OSS Distributions size_t z_element_size = sme_state.__svl_b; 318*43a90889SApple OSS Distributions char z[32][z_element_size]; 319*43a90889SApple OSS Distributions size_t p_element_size = sme_state.__svl_b / 8; 320*43a90889SApple OSS Distributions char p[16][p_element_size]; 321*43a90889SApple OSS Distributions 322*43a90889SApple OSS Distributions arm_sve_z_state_t z_state; mach_msg_type_number_t z_state_count = ARM_SVE_Z_STATE_COUNT; 323*43a90889SApple OSS Distributions // Read Z0-Z15 and copy active bits 324*43a90889SApple OSS Distributions thread_get_state(thread, ARM_SVE_Z_STATE1, &z_state, &z_state_count); 325*43a90889SApple OSS Distributions for (int i = 0; i < 16; i++) { 326*43a90889SApple OSS Distributions memcpy(z[i], z_state.__z[i], z_element_size); 327*43a90889SApple OSS Distributions } 328*43a90889SApple OSS Distributions // Read Z16-Z32 and copy active bits 329*43a90889SApple OSS Distributions thread_get_state(thread, ARM_SVE_Z_STATE2, &z_state, &z_state_count); 330*43a90889SApple OSS Distributions for (int i = 0; i < 16; i++) { 331*43a90889SApple OSS Distributions memcpy(z[i + 16], z_state.__z[i], z_element_size); 332*43a90889SApple OSS Distributions } 333*43a90889SApple OSS Distributions 334*43a90889SApple OSS Distributions arm_sve_p_state_t p_state; mach_msg_type_number_t p_state_count = ARM_SVE_P_STATE_COUNT; 335*43a90889SApple OSS Distributions // Read P0-P15 and copy active bits 336*43a90889SApple OSS Distributions thread_get_state(thread, ARM_SVE_P_STATE, &p_state, &p_state_count); 337*43a90889SApple OSS Distributions for (int i = 0; i < 16; i++) { 338*43a90889SApple OSS Distributions memcpy(p[i], p_state.__p[i], p_element_size); 339*43a90889SApple OSS Distributions } 340*43a90889SApple OSS Distributions} 341*43a90889SApple OSS Distributions 342*43a90889SApple OSS Distributionsconst uint64_t SVCR_ZA = (1 << 1); 343*43a90889SApple OSS Distributions// Are ZA and ZT0 valid? 344*43a90889SApple OSS Distributionsif (sme_state.__svcr & SVCR_ZA) { 345*43a90889SApple OSS Distributions size_t za_size = sme_state.__svl_b * sme_state.__svl_b; 346*43a90889SApple OSS Distributions char za[za_size]; 347*43a90889SApple OSS Distributions const size_t zt0_size = 64; 348*43a90889SApple OSS Distributions char zt0[zt0_size]; 349*43a90889SApple OSS Distributions 350*43a90889SApple OSS Distributions const size_t max_chunk_size = 4096; 351*43a90889SApple OSS Distributions int n_chunks; size_t chunk_size; 352*43a90889SApple OSS Distributions if (za_size <= max_chunk_size) { 353*43a90889SApple OSS Distributions n_chunks = 1; 354*43a90889SApple OSS Distributions chunk_size = za_size; 355*43a90889SApple OSS Distributions } else { 356*43a90889SApple OSS Distributions n_chunks = za_size / max_chunk_size; 357*43a90889SApple OSS Distributions chunk_size = max_chunk_size; 358*43a90889SApple OSS Distributions } 359*43a90889SApple OSS Distributions 360*43a90889SApple OSS Distributions for (int i = 0; i < n_chunks; i++) { 361*43a90889SApple OSS Distributions arm_sme_za_state_t za_state; mach_msg_type_number_t za_state_count = ARM_SME_ZA_STATE_COUNT; 362*43a90889SApple OSS Distributions // Read next chunk of ZA 363*43a90889SApple OSS Distributions thread_get_state(thread, ARM_SME_ZA_STATE1 + i, &za_state, &za_state_count); 364*43a90889SApple OSS Distributions memcpy(&za[chunk_size * i], za_state.__za, chunk_size); 365*43a90889SApple OSS Distributions } 366*43a90889SApple OSS Distributions 367*43a90889SApple OSS Distributions arm_sme2_state_t sme2_state; mach_msg_type_number_t sme2_state_count = ARM_SME2_STATE; 368*43a90889SApple OSS Distributions thread_get_state(thread, ARM_SME2_STATE, &sme2_state, &sme2_state_count); 369*43a90889SApple OSS Distributions memcpy(zt0, sme2_state.__zt0, zt0_size); 370*43a90889SApple OSS Distributions} 371*43a90889SApple OSS Distributions``` 372*43a90889SApple OSS Distributions 373*43a90889SApple OSS Distributions 374*43a90889SApple OSS DistributionsFootnotes 375*43a90889SApple OSS Distributions--------- 376*43a90889SApple OSS Distributions 377*43a90889SApple OSS Distributions<a name="feat_sve_footnote"></a>1. For simplicity, this section describes the 378*43a90889SApple OSS Distributionsbehavior on Apple CPUs. Details like register length and accessibility may 379*43a90889SApple OSS Distributionsdepend on whether the CPU is in streaming SVE mode (described later in the 380*43a90889SApple OSS Distributionsdocument). Apple's current SME implementation simply makes SVE features 381*43a90889SApple OSS Distributionsinaccessible outside this mode. 382*43a90889SApple OSS Distributions 383*43a90889SApple OSS Distributions<a name="feat_sme_fa64_footnote"></a>2. The optional CPU feature FEAT_SME_FA64 384*43a90889SApple OSS Distributionsallows full use of the SIMD instruction set inside streaming SVE mode. 385*43a90889SApple OSS DistributionsHowever xnu does not currently support any CPUs which implement FEAT_SME_FA64. 386*43a90889SApple OSS Distributions 387*43a90889SApple OSS Distributions<a name="cpacr_zen_footnote"></a>3. `CPACR_EL1` and `CPTR_ELx` also have a 388*43a90889SApple OSS Distributionsdiscrete trap control `ZEN` for SVE instruction and register accesses performed 389*43a90889SApple OSS Distributionsoutside streaming SVE mode. This trap control isn't currently relevant to Apple 390*43a90889SApple OSS DistributionsCPUs, since Apple's current SME implementation only allows SVE accesses inside 391*43a90889SApple OSS Distributionsstreaming SVE mode. 392*43a90889SApple OSS Distributions 393*43a90889SApple OSS Distributions<a name="xnu_simd_footnote"></a>4. LLVM is surprisingly aggressive about 394*43a90889SApple OSS Distributionsemitting SIMD instructions unless explicitly inhibited by compiler flags. Even 395*43a90889SApple OSS Distributionsif the xnu build started inhibiting these instructions for targets that support 396*43a90889SApple OSS DistributionsSME, they could still appear in existing kext binaries. 397*43a90889SApple OSS Distributions 398