xref: /xnu-11417.140.69/doc/arm/sme.md (revision 43a90889846e00bfb5cf1d255cdc0a701a1e05a4)
1*43a90889SApple OSS DistributionsARM Scalable Matrix Extension
2*43a90889SApple OSS Distributions=============================
3*43a90889SApple OSS Distributions
4*43a90889SApple OSS DistributionsManaging hardware resources related to SME state.
5*43a90889SApple OSS Distributions
6*43a90889SApple OSS DistributionsIntroduction
7*43a90889SApple OSS Distributions------------
8*43a90889SApple OSS Distributions
9*43a90889SApple OSS DistributionsThis document describes how xnu manages the hardware resources associated with
10*43a90889SApple OSS DistributionsARM's Scalable Matrix Extension (SME).
11*43a90889SApple OSS Distributions
12*43a90889SApple OSS DistributionsSME is an ARMv9 extension intended to accelerate matrix math operations.  SME
13*43a90889SApple OSS Distributionsbuilds on top of ARM's previous Scalable Vector Extension (SVE), which extends
14*43a90889SApple OSS Distributionsthe length of the FPSIMD register files and adds new 1D vector-math
15*43a90889SApple OSS Distributionsinstructions.  SME extends SVE by adding a matrix register file and associated
16*43a90889SApple OSS Distributions2D matrix-math instructions.  SME2 further extends SME with additional
17*43a90889SApple OSS Distributionsinstructions and register state.
18*43a90889SApple OSS Distributions
19*43a90889SApple OSS DistributionsThis document summarizes SVE, SME, and SME2 hardware features that are relevant
20*43a90889SApple OSS Distributionsto xnu.  It is not intended as a full programming guide for SVE or SME: readers
21*43a90889SApple OSS Distributionsmay find a full description of these ISAs in the
22*43a90889SApple OSS Distributions[SVE supplement to the ARM ARM](https://developer.arm.com/documentation/ddi0584/latest/)
23*43a90889SApple OSS Distributionsand [SME supplement to the ARM ARM](https://developer.arm.com/documentation/ddi0616/latest/),
24*43a90889SApple OSS Distributionsrespectively.
25*43a90889SApple OSS Distributions
26*43a90889SApple OSS Distributions
27*43a90889SApple OSS Distributions
28*43a90889SApple OSS DistributionsHardware overview
29*43a90889SApple OSS Distributions-----------------
30*43a90889SApple OSS Distributions
31*43a90889SApple OSS Distributions### EL0-accessible state
32*43a90889SApple OSS Distributions
33*43a90889SApple OSS DistributionsSVE, SME, and SME2 introduce four new EL0-accessible register
34*43a90889SApple OSS Distributionsfiles<sup>[1](#feat_sve_footnote)</sup>:
35*43a90889SApple OSS Distributions
36*43a90889SApple OSS Distributions- vector registers `Z0`-`Z31`
37*43a90889SApple OSS Distributions- predicate registers `P0`-`P15`
38*43a90889SApple OSS Distributions- matrix data `ZA` (SME/SME2 only)
39*43a90889SApple OSS Distributions- look-up table `ZT0` (SME2 only)
40*43a90889SApple OSS Distributions
41*43a90889SApple OSS DistributionsThese register files are unbanked, i.e., their contents are shared across all
42*43a90889SApple OSS Distributionsexception levels.  Data can be copied between these registers and system memory
43*43a90889SApple OSS Distributionsusing specialized `ldr` and `str` variants.  SME also adds `mov` variants that
44*43a90889SApple OSS Distributionscan directly copy data between the vector and matrix register files.
45*43a90889SApple OSS Distributions
46*43a90889SApple OSS DistributionsMost of these register files supplement, rather than replace, the existing ARM
47*43a90889SApple OSS Distributionsregister files.  However the `Z` register file effectively extends the length of
48*43a90889SApple OSS Distributionsthe existing FPSIMD `V` register file.  Instructions targeting the `V` register
49*43a90889SApple OSS Distributionsfile will now access the lower 128 bits of the corresponding `Z` register.
50*43a90889SApple OSS Distributions
51*43a90889SApple OSS DistributionsThe size of most of these files is defined by the *streaming vector length*
52*43a90889SApple OSS Distributions(SVL), a power-of-two between 128 and 2048 inclusive.  Each `Z` register is SVL
53*43a90889SApple OSS Distributionsbits in size; each `P` register is SVL / 8 bits in size; and `ZA` is SVL x SVL
54*43a90889SApple OSS Distributionsbits in size.  The value of SVL is determined by both hardware and software.
55*43a90889SApple OSS DistributionsHardware places an implementation-defined cap on SVL, and privileged software
56*43a90889SApple OSS Distributionscan further reduce SVL for itself and lower exception levels.
57*43a90889SApple OSS Distributions
58*43a90889SApple OSS DistributionsIn contrast, `ZT0` is fixed at 512 bits, independent of SVL.
59*43a90889SApple OSS Distributions
60*43a90889SApple OSS DistributionsSME also adds a single EL0-accessible system register `TPIDR2_EL0`.  Like
61*43a90889SApple OSS Distributions`TPIDR_EL0`, `TPIDR2_EL0` is officially reserved for ABI use, but its contents
62*43a90889SApple OSS Distributionshave no particular meaning to hardware.
63*43a90889SApple OSS Distributions
64*43a90889SApple OSS Distributions### `PSTATE` changes
65*43a90889SApple OSS Distributions
66*43a90889SApple OSS DistributionsSME adds two orthogonal states to `PSTATE`.
67*43a90889SApple OSS Distributions
68*43a90889SApple OSS Distributions`PSTATE.SM` moves the CPU in and out of a special execution mode called
69*43a90889SApple OSS Distributions*streaming SVE mode*.  Software must enter streaming SVE mode to execute most
70*43a90889SApple OSS DistributionsSME instructions.  However software must then exit streaming SVE mode to execute
71*43a90889SApple OSS Distributionsmany legacy SIMD instructions<sup>[2](#feat_sme_fa64_footnote)</sup>.  To make
72*43a90889SApple OSS Distributionsthings even more complicated, these transitions cause the CPU to zero out the
73*43a90889SApple OSS Distributions`V`/`Z` and `P` register files, and to set all `FPSR` flags.  When software
74*43a90889SApple OSS Distributionsneeds to retain this state across `PSTATE.SM` transitions, it must manually
75*43a90889SApple OSS Distributionsstash the state in memory.
76*43a90889SApple OSS Distributions
77*43a90889SApple OSS Distributions`PSTATE.ZA` independently controls whether the contents of `ZA` and `ZT0` are
78*43a90889SApple OSS Distributionsvalid.  Setting `PSTATE.ZA` zeroes out both register files, and enables
79*43a90889SApple OSS Distributionsinstructions that access them.  Clearing `PSTATE.ZA` causes `ZA` and `ZT0`
80*43a90889SApple OSS Distributionsaccesses to trap.
81*43a90889SApple OSS Distributions
82*43a90889SApple OSS DistributionsMost SME instructions require both `PSTATE.SM` and `PSTATE.ZA` to be
83*43a90889SApple OSS Distributionsset, so software usually toggles both bits at the same time.  However setting
84*43a90889SApple OSS Distributionsthese bits independently can be useful when software needs to interleave SME and
85*43a90889SApple OSS DistributionsFPSIMD instructions.  If software needs to temporarily exit streaming SVE mode
86*43a90889SApple OSS Distributionsto execute FPSIMD instructions, setting `PSTATE.{SM,ZA} = {0,1}` will do so
87*43a90889SApple OSS Distributionswithout clobbering the `ZA` or `ZT0` array.
88*43a90889SApple OSS Distributions
89*43a90889SApple OSS Distributions`PSTATE.{SM,ZA} = {0,0}` acts as a hint to the CPU that it may power down
90*43a90889SApple OSS DistributionsSME-related hardware.  Hence software should clear these bits as soon as
91*43a90889SApple OSS DistributionsSME state can be discarded.
92*43a90889SApple OSS Distributions
93*43a90889SApple OSS DistributionsThese `PSTATE` bits are accessible to software in several ways:
94*43a90889SApple OSS Distributions
95*43a90889SApple OSS Distributions- Reads or writes to the `SVCR` system register, which packs both bits into
96*43a90889SApple OSS Distributions  a single register
97*43a90889SApple OSS Distributions- Writes to the `SVCRSM`, `SVCRZA`, or `SVCRSMZA` system registers with the
98*43a90889SApple OSS Distributions  immediate values `0` or `1`, which directly modify the specified bit(s)
99*43a90889SApple OSS Distributions- `sm{start,stop} (sm|za)` pseudo-instructions, which are assembler aliases for
100*43a90889SApple OSS Distributions  the above `msr` instructions
101*43a90889SApple OSS Distributions
102*43a90889SApple OSS DistributionsRegardless of which method is used to access these bits, software generally does
103*43a90889SApple OSS Distributionsnot need explicit barriers.  Specifically, ARM guarantees that all direct and
104*43a90889SApple OSS Distributionsindirect reads from these bits will appear in program order relative to any
105*43a90889SApple OSS Distributionsdirect writes.
106*43a90889SApple OSS Distributions
107*43a90889SApple OSS Distributions### Other hardware resources
108*43a90889SApple OSS Distributions
109*43a90889SApple OSS DistributionsAn implementation may share SME compute resources across multiple CPUs.  In this
110*43a90889SApple OSS Distributionscase, the per-CPU `SMPRI_EL1` controls the relative priority of the SME
111*43a90889SApple OSS Distributionsinstructions issued by that CPU.  ARM guarantees that higher `SMPRI_EL1` values
112*43a90889SApple OSS Distributionsindicate higher priorities, and that setting `SMPRI_EL1 = 0` on all CPUs is a
113*43a90889SApple OSS Distributionssafe way to disable SME prioritization.  Otherwise the exact meaning of
114*43a90889SApple OSS Distributions`SMPRI_EL1` is implementation-defined.
115*43a90889SApple OSS Distributions
116*43a90889SApple OSS DistributionsEL2 may trap guest reads and writes to `SMPRI_EL1` using the fine-grained trap
117*43a90889SApple OSS Distributionscontrols `HFGRTR_EL2.nSMPRI_EL1` and `HFGWTR_EL2.nSMPRI_EL1`, respectively.
118*43a90889SApple OSS DistributionsAlternatively, EL2 may adjust the effective SME priority at EL0 and EL1 without
119*43a90889SApple OSS Distributionstrapping, by populating the lookup table register `SMPRIMAP_EL2` and setting the
120*43a90889SApple OSS Distributionscontrol bit `HCRX_EL2.SMPME`.  When `HCRX_EL2.SMPME` is set, SME instructions
121*43a90889SApple OSS Distributionsexecuted at EL0 and EL1 will interpret `SMPRI_EL1` as an index into
122*43a90889SApple OSS Distributions`SMPRIMAP_EL2` rather than as a raw priority value.
123*43a90889SApple OSS Distributions
124*43a90889SApple OSS Distributions`SMIDR_EL1` advertises hardware properties about the SME implementation,
125*43a90889SApple OSS Distributionsincluding whether SME execution priority is implemented.
126*43a90889SApple OSS Distributions
127*43a90889SApple OSS Distributions`CPACR_EL1` and `CPTR_ELx` have controls that can trap SVE and SME operations.
128*43a90889SApple OSS DistributionsTwo of these are relevant to Apple's SME
129*43a90889SApple OSS Distributionsimplementation<sup>[3](#cpacr_zen_footnote)</sup>:
130*43a90889SApple OSS Distributions
131*43a90889SApple OSS Distributions- `SMEN`: trap SME instructions and register accesses, including SVE
132*43a90889SApple OSS Distributions  instructions executed during streaming SVE mode.
133*43a90889SApple OSS Distributions- `FPEN`: trap FPSIMD, SME, and SVE instructions and most register accesses, but
134*43a90889SApple OSS Distributions  *not* `SVCR` accesses.  Lower priority than `SMEN`.
135*43a90889SApple OSS Distributions
136*43a90889SApple OSS DistributionsSeveral SME registers aren't affected by these controls, since they have their
137*43a90889SApple OSS Distributionsown trapping mechanisms.  `SMPRI_EL1` has fine-grained hypervisor trap controls
138*43a90889SApple OSS Distributionsas described above.  `SMIDR_EL1` accesses can trap to the hypervisor using the
139*43a90889SApple OSS Distributionsexisting `HCR_EL2.TID1` control bit.  Finally `TPIDR2_EL0` has a dedicated
140*43a90889SApple OSS Distributionscontrol bit `SCTLR_ELx.EnTP2` along with fine-grained trap controls
141*43a90889SApple OSS Distributions`HFG{R,W}TR_EL2.TPIDR2_EL0`.
142*43a90889SApple OSS Distributions
143*43a90889SApple OSS Distributions
144*43a90889SApple OSS DistributionsSoftware usage
145*43a90889SApple OSS Distributions--------------
146*43a90889SApple OSS Distributions
147*43a90889SApple OSS Distributions### SME `PSTATE` management
148*43a90889SApple OSS Distributions
149*43a90889SApple OSS Distributionsxnu has in-kernel SIMD instructions<sup>[4](#xnu_simd_footnote)</sup> which
150*43a90889SApple OSS Distributionsbecome illegal while the CPU is in streaming SVE mode.  This poses a problem if
151*43a90889SApple OSS Distributionsxnu interrupts EL0 while it is in the middle of executing SME-accelerated code.
152*43a90889SApple OSS Distributions
153*43a90889SApple OSS DistributionsHence, anytime xnu enters the kernel with `PSTATE.SM` set, it saves the current
154*43a90889SApple OSS Distributions`Z`, `P`, and `SVCR` values and then clears `PSTATE.SM`.  xnu later restores
155*43a90889SApple OSS Distributionsthese values during kernel exit.  These operations occur in an assembly-only
156*43a90889SApple OSS Distributionsmodule (`locore.s`) where we have strict control over code generation, and can
157*43a90889SApple OSS Distributionsguarantee that no problematic SIMD instructions are executed while `PSTATE.SM`
158*43a90889SApple OSS Distributionsis set.
159*43a90889SApple OSS Distributions
160*43a90889SApple OSS DistributionsSince the kernel may interrupt *itself*, kernel code is forbidden from entering
161*43a90889SApple OSS Distributionsstreaming SVE mode.  This policy means that xnu does not need to preserve
162*43a90889SApple OSS Distributions`TPIDR2_EL0`, `ZA`, or `ZT0` during kernel entry and exit, since there are no
163*43a90889SApple OSS Distributionsin-kernel SME operations that could clobber them.
164*43a90889SApple OSS Distributions
165*43a90889SApple OSS Distributions### Context switching
166*43a90889SApple OSS Distributions
167*43a90889SApple OSS Distributionsxnu saves and restores `TPIDR2_EL0`, `ZA`, and `ZT0` inside the ARM64
168*43a90889SApple OSS Distributionsimplementation of `machine_switch_context()`, specifically as the routines
169*43a90889SApple OSS Distributions`machine_{save,restore}_sme_context()` in `osfmk/arm64/pcb.c`.  These in turn
170*43a90889SApple OSS Distributionsbuild on lower-level routines to save and load SME register state, located in
171*43a90889SApple OSS Distributions`osfmk/arm64/sme.c`.  The low-level routines are built on top of the SME `str`
172*43a90889SApple OSS Distributionsand `ldr` instructions, which can be executed outside of streaming SVE mode.
173*43a90889SApple OSS Distributions
174*43a90889SApple OSS Distributions`machine_{save,restore}_sme_context()` unconditionally save and restore
175*43a90889SApple OSS Distributions`TPIDR2_EL0`, since its contents are valid even when EL0 isn't actually using
176*43a90889SApple OSS DistributionsSME.  However `ZA`'s and `ZT0`'s contents are often invalid and hence do not
177*43a90889SApple OSS Distributionsrequire context-switching.  `machine_save_sme_context()` reads `SVCR.ZA`
178*43a90889SApple OSS Distributionsto determine if the `ZA` and `ZT0` arrays were actually valid at context-switch
179*43a90889SApple OSS Distributionstime.  If not, it skips saving the invalid `ZA` and `ZT0` contents.
180*43a90889SApple OSS Distributions
181*43a90889SApple OSS DistributionsLikewise, when context-switching back to a thread where the saved-state
182*43a90889SApple OSS Distributions`SVCR.ZA` is cleared, `machine_restore_sme_context()` simply ensures that the
183*43a90889SApple OSS DistributionsCPU's `PSTATE.ZA` bit is cleared (executing `smstop za` if necessary).  xnu does
184*43a90889SApple OSS Distributionsnot need to manually invalidate any `ZA` or `ZT0` state left by a previous
185*43a90889SApple OSS Distributionsthread: the next time `PSTATE.ZA` is enabled, the CPU is architecturally
186*43a90889SApple OSS Distributionsguaranteed to zero out both register files.
187*43a90889SApple OSS Distributions
188*43a90889SApple OSS DistributionsAs noted above, xnu saves `SVCR` on kernel entry and uses it to restore
189*43a90889SApple OSS Distributions`PSTATE.SM` on kernel exit.  Hence `machine_restore_sme_context()` updates
190*43a90889SApple OSS Distributions`PSTATE.ZA` to match the new process's saved state, but doesn't update
191*43a90889SApple OSS Distributions`PSTATE.SM`.  Likewise `machine_restore_sme_context()` doesn't manipulate the `Z`
192*43a90889SApple OSS Distributionsor `P` register files, since these will be updated on kernel exit.
193*43a90889SApple OSS Distributions
194*43a90889SApple OSS DistributionsSince SME thread state (`thread->machine.usme`) is large, and won't be used by
195*43a90889SApple OSS Distributionsmost threads, xnu lazily allocates the backing memory the first time a thread
196*43a90889SApple OSS Distributionsencounters an SME instruction.  This is implemented by clearing `SCTLR_EL1.SMEN`
197*43a90889SApple OSS Distributionsinside `machine_restore_sme_context()`, then performing the allocation during
198*43a90889SApple OSS Distributionsthe subsequent SME trap.
199*43a90889SApple OSS Distributions
200*43a90889SApple OSS Distributions### Execution priority
201*43a90889SApple OSS Distributions
202*43a90889SApple OSS Distributionsxnu does not currently have an API for changing SME execution priority.
203*43a90889SApple OSS DistributionsAccordingly xnu resets `SMPRI_EL1` to `0` during CPU initialization, and
204*43a90889SApple OSS Distributionsotherwise does not modify it at runtime.
205*43a90889SApple OSS Distributions
206*43a90889SApple OSS Distributions### Power management
207*43a90889SApple OSS Distributions
208*43a90889SApple OSS Distributionsxnu updates `PSTATE.ZA` during `machine_switch_sme_context()` using the `SVCR`
209*43a90889SApple OSS Distributionsvalue stashed in the new thread's SME state.  If the new process has never used
210*43a90889SApple OSS DistributionsSME, and hence doesn't have saved `ZA` state, xnu unconditionally clears
211*43a90889SApple OSS Distributions`PSTATE.ZA`.  This policy means that xnu issues the power-down hint
212*43a90889SApple OSS Distributions`PSTATE.{SM,ZA} = {0,0}` on every context-switch, unless the new thread has live
213*43a90889SApple OSS Distributions`ZA` state.  (Recall that `PSTATE.SM` was previously cleared on kernel entry.)
214*43a90889SApple OSS Distributions
215*43a90889SApple OSS DistributionsBy extension, xnu will always issue this hint before entering WFI.  In order to
216*43a90889SApple OSS Distributionsreach `arm64_retention_wfi()`, xnu must first context-switch to the idle thread,
217*43a90889SApple OSS Distributionswhich never has `ZA` state.
218*43a90889SApple OSS Distributions
219*43a90889SApple OSS Distributions### Virtualizing SME
220*43a90889SApple OSS Distributions
221*43a90889SApple OSS DistributionsSME introduces a number of new registers that the hypervisor needs to manage.
222*43a90889SApple OSS Distributions`SMCR_ELx` is the only one of these that's banked between EL1 and EL2.  The
223*43a90889SApple OSS Distributions`SVCR`, `SMPRI_EL1`, and `TPIDR2_EL0` system registers are all shared between
224*43a90889SApple OSS Distributionsthe host and guest, and must be managed by the host hypervisor accordingly.
225*43a90889SApple OSS Distributions
226*43a90889SApple OSS DistributionsMore critically, the `Z`, `P`, `ZA`, and `ZT0` register files are also shared
227*43a90889SApple OSS Distributionsacross all exception levels.  To minimize the cost of managing this unbanked SME
228*43a90889SApple OSS Distributionsregister state, xnu tries to keep the guest matrix state resident in the CPU as
229*43a90889SApple OSS Distributionslong as possible, even when the guest traps to EL2.  xnu will only spill the `ZA`
230*43a90889SApple OSS Distributionsand `ZT0` state back to memory when one of two things happens:
231*43a90889SApple OSS Distributions
232*43a90889SApple OSS Distributions(1) The `hv_vcpu_run` trap handler returns control all the way back to the VMM
233*43a90889SApple OSS Distributions    thread at host EL0
234*43a90889SApple OSS Distributions
235*43a90889SApple OSS Distributions(2) xnu needs to context-switch the host VMM thread that owns the vCPU
236*43a90889SApple OSS Distributions
237*43a90889SApple OSS DistributionsIn these cases xnu will spill the guest `ZA` and `ZT0` state back to memory,
238*43a90889SApple OSS Distributionsthen replace them with the VMM thread's or new thread's state (respectively).
239*43a90889SApple OSS Distributions
240*43a90889SApple OSS DistributionsUnfortunately since xnu has to disable streaming SVE mode to handle traps, it's
241*43a90889SApple OSS Distributionsstill forced to spill `Z` and `P` state to memory anytime the guest traps to EL2
242*43a90889SApple OSS Distributionswith `PSTATE.SM` set.
243*43a90889SApple OSS Distributions
244*43a90889SApple OSS Distributions
245*43a90889SApple OSS DistributionsSince xnu doesn't currently support SME prioritization, it sets `HCRX_EL2.SMPME`
246*43a90889SApple OSS Distributionsand populates all `SMPRIMAP_EL2` entries with a value of `0`.  Guest OSes are
247*43a90889SApple OSS Distributionsstill allowed to write to `SMPRI_EL1`, but currently this has no effect on
248*43a90889SApple OSS Distributionsthe actual hardware priority.
249*43a90889SApple OSS Distributions
250*43a90889SApple OSS Distributions
251*43a90889SApple OSS DistributionsAppendix: Mach thread-state APIs
252*43a90889SApple OSS Distributions--------------------------------
253*43a90889SApple OSS Distributions
254*43a90889SApple OSS DistributionsLow-level tools (e.g., debuggers) may access thread SVE and SME state through
255*43a90889SApple OSS Distributionsthe standard Mach APIs `thread_{get,set}_state`.  But because SVE and SME
256*43a90889SApple OSS Distributionsregister state are large and have implementation-defined size, accessing this
257*43a90889SApple OSS Distributionsstate can be more complicated than other thread state flavors.
258*43a90889SApple OSS Distributions
259*43a90889SApple OSS Distributionsxnu splits the SVE and SME thread state into several flavors:
260*43a90889SApple OSS Distributions
261*43a90889SApple OSS Distributions| Flavor                                       | C thread-state type   | Description               |
262*43a90889SApple OSS Distributions|----------------------------------------------|-----------------------|---------------------------|
263*43a90889SApple OSS Distributions| `ARM_SME_STATE`                              | `arm_sme_state_t`     | SVCR, TPIDR2_EL0, and SVL |
264*43a90889SApple OSS Distributions| `ARM_SVE_Z_STATE1`, `ARM_SME_Z_STATE2`       | `arm_sve_z_state_t`   | Z register file           |
265*43a90889SApple OSS Distributions| `ARM_SVE_P_STATE`                            | `arm_sve_p_state_t`   | P register file           |
266*43a90889SApple OSS Distributions| `ARM_SME_ZA_STATE1` ... `ARM_SME_ZA_STATE16` | `arm_sme_za_state_t`  | ZA register file          |
267*43a90889SApple OSS Distributions| `ARM_SME2_STATE`                             | `arm_sme2_state_t`    | ZT0 register file         |
268*43a90889SApple OSS Distributions
269*43a90889SApple OSS Distributions`arm_sve_z_state_t`, `arm_sve_p_state_t`, and `arm_sme_za_state_t` are
270*43a90889SApple OSS Distributionsstatically sized for a vector length of 2048 bits, the largest vector length
271*43a90889SApple OSS Distributionsallowed by the ARM architecture.  In practice, all Apple CPUs currently use a
272*43a90889SApple OSS Distributionssmaller vector length.  In this case `thread_get_state` will pad the unused
273*43a90889SApple OSS Distributionsupper bits of each `z`, `p`, and `za` field with zeroes.  Likewise,
274*43a90889SApple OSS Distributions`thread_set_state` will ignore any unused upper bits.
275*43a90889SApple OSS Distributions
276*43a90889SApple OSS Distributions`Z` can architecturally be up to 8 kilobytes in size.  Since this is too large
277*43a90889SApple OSS Distributionsto fit in a single Mach message, xnu's Mach thread-state APIs divide the `Z`
278*43a90889SApple OSS Distributionsregister space into two different thread-state flavors.  Thread-state flavor
279*43a90889SApple OSS Distributions`ARM_SME_ZA_STATE1` accesses Z0-Z15, and thread-state flavor `ARM_SME_ZA_STATE2`
280*43a90889SApple OSS Distributionsaccesses Z16-Z31.
281*43a90889SApple OSS Distributions
282*43a90889SApple OSS Distributionsxnu likewise divides `ZA` into 4-kilobyte windows.  Thread-state flavor
283*43a90889SApple OSS Distributions`ARM_SME_ZA_STATE1` accesses the first 4 kilobytes of ZA space,
284*43a90889SApple OSS Distributions`ARM_SME_ZA_STATE2` accesses the next 4 kilobytes of ZA space, and so on up to
285*43a90889SApple OSS Distributions`ARM_SME_ZA_STATE16`.  When `ZA` is smaller than 4 kilobytes, `thread_get_state`
286*43a90889SApple OSS Distributionswill pad the unused upper bytes of `arm_sme_za_state_t::za` with zeroes, and
287*43a90889SApple OSS Distributions`thread_set_state` will ignore any unused upper bytes.
288*43a90889SApple OSS Distributions
289*43a90889SApple OSS Distributions`thread_{get,set}_state` will return `KERN_INVALID_ARGUMENT` if software tries
290*43a90889SApple OSS Distributionsto do any of the following:
291*43a90889SApple OSS Distributions
292*43a90889SApple OSS Distributions* Access SME or SME2 state on a CPU that doesn't implement FEAT_SME or FEAT_SME2
293*43a90889SApple OSS Distributions  (respectively)
294*43a90889SApple OSS Distributions* Access `Z` or `P` state when the target thread's `SVCR.SM` bit is cleared
295*43a90889SApple OSS Distributions* Access `ZA` or `ZT0` state when the target thread's `SVCR.ZA` bit is cleared
296*43a90889SApple OSS Distributions* Change the current `svl` value while setting `ARM_SME_STATE`
297*43a90889SApple OSS Distributions
298*43a90889SApple OSS Distributionsxnu does not currently support sending SME or SVE thread state with Mach
299*43a90889SApple OSS Distributionsexception messages.  Mach APIs that set exception ports, such as
300*43a90889SApple OSS Distributions`thread_set_exception_ports`, will return `KERN_INVALID_ARGUMENT` if the
301*43a90889SApple OSS Distributionsrequested `flavor` is one of the values described in this appendix.
302*43a90889SApple OSS Distributions
303*43a90889SApple OSS Distributions### Sample code
304*43a90889SApple OSS Distributions
305*43a90889SApple OSS DistributionsThe following C code illustrates how to interpret SME and SME2 state returned by
306*43a90889SApple OSS Distributions`thread_get_state`.  (To keep the code as simple as possible, it ignores all of
307*43a90889SApple OSS Distributionsthe possible error cases listed above.)
308*43a90889SApple OSS Distributions
309*43a90889SApple OSS Distributions```c
310*43a90889SApple OSS Distributionsarm_sme_state_t sme_state; mach_msg_type_number_t sme_state_count = ARM_SME_STATE_COUNT;
311*43a90889SApple OSS Distributions// Read SVL_B and SVCR
312*43a90889SApple OSS Distributionsthread_get_state(thread, ARM_SME_STATE, &sme_state, &sme_state_count);
313*43a90889SApple OSS Distributions
314*43a90889SApple OSS Distributionsconst uint64_t SVCR_SM = (1 << 0);
315*43a90889SApple OSS Distributions// Are Z and P valid?
316*43a90889SApple OSS Distributionsif (sme_state.__svcr & SVCR_SM) {
317*43a90889SApple OSS Distributions    size_t z_element_size = sme_state.__svl_b;
318*43a90889SApple OSS Distributions    char z[32][z_element_size];
319*43a90889SApple OSS Distributions    size_t p_element_size = sme_state.__svl_b / 8;
320*43a90889SApple OSS Distributions    char p[16][p_element_size];
321*43a90889SApple OSS Distributions
322*43a90889SApple OSS Distributions    arm_sve_z_state_t z_state; mach_msg_type_number_t z_state_count = ARM_SVE_Z_STATE_COUNT;
323*43a90889SApple OSS Distributions    // Read Z0-Z15 and copy active bits
324*43a90889SApple OSS Distributions    thread_get_state(thread, ARM_SVE_Z_STATE1, &z_state, &z_state_count);
325*43a90889SApple OSS Distributions    for (int i = 0; i < 16; i++) {
326*43a90889SApple OSS Distributions       memcpy(z[i], z_state.__z[i], z_element_size);
327*43a90889SApple OSS Distributions    }
328*43a90889SApple OSS Distributions    // Read Z16-Z32 and copy active bits
329*43a90889SApple OSS Distributions    thread_get_state(thread, ARM_SVE_Z_STATE2, &z_state, &z_state_count);
330*43a90889SApple OSS Distributions    for (int i = 0; i < 16; i++) {
331*43a90889SApple OSS Distributions       memcpy(z[i + 16], z_state.__z[i], z_element_size);
332*43a90889SApple OSS Distributions    }
333*43a90889SApple OSS Distributions
334*43a90889SApple OSS Distributions    arm_sve_p_state_t p_state; mach_msg_type_number_t p_state_count = ARM_SVE_P_STATE_COUNT;
335*43a90889SApple OSS Distributions    // Read P0-P15 and copy active bits
336*43a90889SApple OSS Distributions    thread_get_state(thread, ARM_SVE_P_STATE, &p_state, &p_state_count);
337*43a90889SApple OSS Distributions    for (int i = 0; i < 16; i++) {
338*43a90889SApple OSS Distributions       memcpy(p[i], p_state.__p[i], p_element_size);
339*43a90889SApple OSS Distributions    }
340*43a90889SApple OSS Distributions}
341*43a90889SApple OSS Distributions
342*43a90889SApple OSS Distributionsconst uint64_t SVCR_ZA = (1 << 1);
343*43a90889SApple OSS Distributions// Are ZA and ZT0 valid?
344*43a90889SApple OSS Distributionsif (sme_state.__svcr & SVCR_ZA) {
345*43a90889SApple OSS Distributions    size_t za_size = sme_state.__svl_b * sme_state.__svl_b;
346*43a90889SApple OSS Distributions    char za[za_size];
347*43a90889SApple OSS Distributions    const size_t zt0_size = 64;
348*43a90889SApple OSS Distributions    char zt0[zt0_size];
349*43a90889SApple OSS Distributions
350*43a90889SApple OSS Distributions    const size_t max_chunk_size = 4096;
351*43a90889SApple OSS Distributions    int n_chunks; size_t chunk_size;
352*43a90889SApple OSS Distributions    if (za_size <= max_chunk_size) {
353*43a90889SApple OSS Distributions        n_chunks = 1;
354*43a90889SApple OSS Distributions        chunk_size = za_size;
355*43a90889SApple OSS Distributions    } else {
356*43a90889SApple OSS Distributions        n_chunks = za_size / max_chunk_size;
357*43a90889SApple OSS Distributions        chunk_size = max_chunk_size;
358*43a90889SApple OSS Distributions    }
359*43a90889SApple OSS Distributions
360*43a90889SApple OSS Distributions    for (int i = 0; i < n_chunks; i++) {
361*43a90889SApple OSS Distributions        arm_sme_za_state_t za_state; mach_msg_type_number_t za_state_count = ARM_SME_ZA_STATE_COUNT;
362*43a90889SApple OSS Distributions        // Read next chunk of ZA
363*43a90889SApple OSS Distributions        thread_get_state(thread, ARM_SME_ZA_STATE1 + i, &za_state, &za_state_count);
364*43a90889SApple OSS Distributions        memcpy(&za[chunk_size * i], za_state.__za, chunk_size);
365*43a90889SApple OSS Distributions    }
366*43a90889SApple OSS Distributions
367*43a90889SApple OSS Distributions    arm_sme2_state_t sme2_state; mach_msg_type_number_t sme2_state_count = ARM_SME2_STATE;
368*43a90889SApple OSS Distributions    thread_get_state(thread, ARM_SME2_STATE, &sme2_state, &sme2_state_count);
369*43a90889SApple OSS Distributions    memcpy(zt0, sme2_state.__zt0, zt0_size);
370*43a90889SApple OSS Distributions}
371*43a90889SApple OSS Distributions```
372*43a90889SApple OSS Distributions
373*43a90889SApple OSS Distributions
374*43a90889SApple OSS DistributionsFootnotes
375*43a90889SApple OSS Distributions---------
376*43a90889SApple OSS Distributions
377*43a90889SApple OSS Distributions<a name="feat_sve_footnote"></a>1. For simplicity, this section describes the
378*43a90889SApple OSS Distributionsbehavior on Apple CPUs.  Details like register length and accessibility may
379*43a90889SApple OSS Distributionsdepend on whether the CPU is in streaming SVE mode (described later in the
380*43a90889SApple OSS Distributionsdocument).  Apple's current SME implementation simply makes SVE features
381*43a90889SApple OSS Distributionsinaccessible outside this mode.
382*43a90889SApple OSS Distributions
383*43a90889SApple OSS Distributions<a name="feat_sme_fa64_footnote"></a>2. The optional CPU feature FEAT_SME_FA64
384*43a90889SApple OSS Distributionsallows full use of the SIMD instruction set inside streaming SVE mode.
385*43a90889SApple OSS DistributionsHowever xnu does not currently support any CPUs which implement FEAT_SME_FA64.
386*43a90889SApple OSS Distributions
387*43a90889SApple OSS Distributions<a name="cpacr_zen_footnote"></a>3. `CPACR_EL1` and `CPTR_ELx` also have a
388*43a90889SApple OSS Distributionsdiscrete trap control `ZEN` for SVE instruction and register accesses performed
389*43a90889SApple OSS Distributionsoutside streaming SVE mode.  This trap control isn't currently relevant to Apple
390*43a90889SApple OSS DistributionsCPUs, since Apple's current SME implementation only allows SVE accesses inside
391*43a90889SApple OSS Distributionsstreaming SVE mode.
392*43a90889SApple OSS Distributions
393*43a90889SApple OSS Distributions<a name="xnu_simd_footnote"></a>4. LLVM is surprisingly aggressive about
394*43a90889SApple OSS Distributionsemitting SIMD instructions unless explicitly inhibited by compiler flags.  Even
395*43a90889SApple OSS Distributionsif the xnu build started inhibiting these instructions for targets that support
396*43a90889SApple OSS DistributionsSME, they could still appear in existing kext binaries.
397*43a90889SApple OSS Distributions
398