xref: /xnu-11417.101.15/doc/observability/mt_stackshot.md (revision e3723e1f17661b24996789d8afc084c0c3303b26)
1*e3723e1fSApple OSS Distributions# Multithreaded Stackshot
2*e3723e1fSApple OSS Distributions
3*e3723e1fSApple OSS DistributionsStackshot has been retrofitted to take advantage of multiple CPUs. This document
4*e3723e1fSApple OSS Distributionsdetails the design of multithreaded stackshot.
5*e3723e1fSApple OSS Distributions
6*e3723e1fSApple OSS Distributions## Terminology
7*e3723e1fSApple OSS Distributions
8*e3723e1fSApple OSS Distributions- **Initiating / Calling CPU**: The CPU which stackshot was called from.
9*e3723e1fSApple OSS Distributions- **Main CPU**: The CPU which populates workqueues and collects global state.
10*e3723e1fSApple OSS Distributions- **Auxiliary CPU**: A CPU which is not the main CPU.
11*e3723e1fSApple OSS Distributions- **KCData**: The containerized data structure that stackshot outputs. See
12*e3723e1fSApple OSS Distributions  `osfmk/kern/kcdata.h` for more information.
13*e3723e1fSApple OSS Distributions
14*e3723e1fSApple OSS Distributions## Overview
15*e3723e1fSApple OSS Distributions
16*e3723e1fSApple OSS DistributionsWhen a stackshot is taken, the initiating CPU (the CPU from which stackshot was
17*e3723e1fSApple OSS Distributionscalled) sets up state. Then, it enters the debugger trap, and IPIs the other
18*e3723e1fSApple OSS Distributionscores into the debugger trap as well. The other CPUs call into stackshot from
19*e3723e1fSApple OSS Distributionsthe debugger trap instead of spinning, and determine if they are eligible to
20*e3723e1fSApple OSS Distributionswork based on perfcontrol's recommendation. (We need to do this because even if
21*e3723e1fSApple OSS Distributionsa CPU is derecommended due to thermal limits or otherwise, it will still be
22*e3723e1fSApple OSS DistributionsIPI'd into the debugger trap, and we want to avoid overheating the CPU).
23*e3723e1fSApple OSS Distributions
24*e3723e1fSApple OSS DistributionsOn AMP systems, a suitable P-core is chosen to be the “main” CPU, and begins
25*e3723e1fSApple OSS Distributionspopulating queues of tasks to be put into the stackshot and collecting bits of
26*e3723e1fSApple OSS Distributionsglobal state (On SMP systems, the initiating CPU is always assigned to be the
27*e3723e1fSApple OSS Distributionsmain CPU).
28*e3723e1fSApple OSS Distributions
29*e3723e1fSApple OSS DistributionsThe other CPUs begin chipping away at the queues, and the main CPU joins
30*e3723e1fSApple OSS Distributionsin once it is done populating them. Once all CPUs are finished, they exit the
31*e3723e1fSApple OSS Distributionsdebugger trap, interrupts are re-enabled, and the kcdata from all of the CPUs
32*e3723e1fSApple OSS Distributionsare collated together by the caller CPU. The output is identical to
33*e3723e1fSApple OSS Distributionssingle-threaded stackshot.
34*e3723e1fSApple OSS Distributions
35*e3723e1fSApple OSS DistributionsIt is important to note that since stackshot happens outside of the context of
36*e3723e1fSApple OSS Distributionsthe scheduler and with interrupts disabled, it does not use "actual" threads to
37*e3723e1fSApple OSS Distributionsdo its work - each CPU has its own execution context and no context switching
38*e3723e1fSApple OSS Distributionsoccurs. Nothing else runs on the system while a stackshot is happening; this
39*e3723e1fSApple OSS Distributionsallows for stackshot to grab an atomic snapshot of the entire system's state.
40*e3723e1fSApple OSS Distributions
41*e3723e1fSApple OSS Distributions## Work Queues
42*e3723e1fSApple OSS Distributions
43*e3723e1fSApple OSS DistributionsIn order to split up work between CPUs, each task is put into a workqueue for
44*e3723e1fSApple OSS DistributionsCPUs to pull from. On SMP systems, there is only one queue. On AMP systems,
45*e3723e1fSApple OSS Distributionsthere are two, and tasks are sorted between the queues based on their
46*e3723e1fSApple OSS Distributions"difficulty" (i.e. the number of threads they have). E cores will work on the
47*e3723e1fSApple OSS Distributionseasier queue first, and P cores will work on the harder queue first. Once a CPU
48*e3723e1fSApple OSS Distributionsfinishes with its first queue, it will move on to the other.
49*e3723e1fSApple OSS Distributions
50*e3723e1fSApple OSS DistributionsIf latency collection is enabled, each CPU will record information about its run
51*e3723e1fSApple OSS Distributionsin a `stackshot_latency_cpu` structure in the KCData. This includes information
52*e3723e1fSApple OSS Distributionssuch as the amount of time spent waiting for the queue and the number of tasks /
53*e3723e1fSApple OSS Distributionsthreads processed by the CPU during its run.
54*e3723e1fSApple OSS Distributions
55*e3723e1fSApple OSS Distributions## Buffers and Memory
56*e3723e1fSApple OSS Distributions
57*e3723e1fSApple OSS DistributionsStackshot is given a fixed-size buffer upfront since it cannot allocate any
58*e3723e1fSApple OSS Distributionsmemory for itself. The size estimation logic in multithreaded stackshot is
59*e3723e1fSApple OSS Distributionsimproved from that of singlethreaded stackshot - it uses various heuristics such
60*e3723e1fSApple OSS Distributionsas the number of tasks and threads on the system, the flags passed, sizes of
61*e3723e1fSApple OSS Distributionsdata structures, and a fudge factor to give a reasonable estimate for a buffer
62*e3723e1fSApple OSS Distributionssize. Should the buffer be too small, stackshot will try again with a bigger
63*e3723e1fSApple OSS Distributionsone. The number of tries is recorded in the `stackshot_latency_collection_v2`
64*e3723e1fSApple OSS Distributionsstruct if latency collection is enabled.
65*e3723e1fSApple OSS Distributions
66*e3723e1fSApple OSS Distributions### Bump Allocator
67*e3723e1fSApple OSS Distributions
68*e3723e1fSApple OSS DistributionsStackshot uses a basic per-cluster bump allocator to allocate space within the
69*e3723e1fSApple OSS Distributionsbuffer. Each cluster gets its own bump allocator to mitigate cache contention,
70*e3723e1fSApple OSS Distributionswith space split evenly between each cluster. If a cluster runs out of buffer
71*e3723e1fSApple OSS Distributionsspace, it can reach into other clusters for more.
72*e3723e1fSApple OSS Distributions
73*e3723e1fSApple OSS DistributionsMemory that is freed is put into a per-cluster freelist. Even if the data was
74*e3723e1fSApple OSS Distributionsoriginally allocated from a different cluster's buffer, it will be put into the
75*e3723e1fSApple OSS Distributionscurrent cluster's freelist (again, to reduce cache effects). The freelist is a
76*e3723e1fSApple OSS Distributionslast resort, and is only used if the current cluster's buffer space fills.
77*e3723e1fSApple OSS Distributions
78*e3723e1fSApple OSS DistributionsEach CPU will report information about its buffers in its
79*e3723e1fSApple OSS Distributions`stackshot_latency_cpu` struct. This includes the total amount of buffer space
80*e3723e1fSApple OSS Distributionsused and the amount of buffer space allocated from other clusters.
81*e3723e1fSApple OSS Distributions
82*e3723e1fSApple OSS Distributions### Linked-List kcdata
83*e3723e1fSApple OSS Distributions
84*e3723e1fSApple OSS DistributionsEach CPU needs its own kcdata descriptor, but we don't know exactly how big each
85*e3723e1fSApple OSS Distributionsone should be ahead of time. Because of this, allocate kcdata buffers in
86*e3723e1fSApple OSS Distributionsreasonably-sized chunks as we need them. We also want the output to have each
87*e3723e1fSApple OSS Distributionstask in order (to keep the output identical to singlethreaded stackshot), so we
88*e3723e1fSApple OSS Distributionsmaintain a linked list of these kcdata chunks for each task in the queue.
89*e3723e1fSApple OSS Distributions
90*e3723e1fSApple OSS DistributionsThe chunks are sized such that only one is needed for the average task. If we
91*e3723e1fSApple OSS Distributionshave any extra room at the end of the current chunk once we finish with a task,
92*e3723e1fSApple OSS Distributionswe can add it to the freelist - but this is not ideal. So, stackshot uses
93*e3723e1fSApple OSS Distributionsvarious heuristics including flags and current task / thread counts to estimate
94*e3723e1fSApple OSS Distributionsa good chunk size. The amount of memory added to the freelist is reported by
95*e3723e1fSApple OSS Distributionsnamed uint64 in the KCData (`stackshot_buf_overhead`).
96*e3723e1fSApple OSS Distributions
97*e3723e1fSApple OSS Distributions```
98*e3723e1fSApple OSS Distributions Workqueue
99*e3723e1fSApple OSS Distributions
100*e3723e1fSApple OSS Distributions⎡ Task #1 ⎤
101*e3723e1fSApple OSS Distributions⎢  CPU 0  ⎥
102*e3723e1fSApple OSS Distributions⎣ kcdata* ⎦-->[ KCData A ]--[ KCData B ]
103*e3723e1fSApple OSS Distributions⎡ Task #2 ⎤
104*e3723e1fSApple OSS Distributions⎢  CPU 1  ⎥
105*e3723e1fSApple OSS Distributions⎣ kcdata* ⎦-->[ KCData C ]
106*e3723e1fSApple OSS Distributions⎡ Task #3 ⎤
107*e3723e1fSApple OSS Distributions⎢  CPU 2  ⎥
108*e3723e1fSApple OSS Distributions⎣ kcdata* ⎦-->[ KCData D ]--[ KCData E ]--[ KCData F ]
109*e3723e1fSApple OSS Distributions    ...
110*e3723e1fSApple OSS Distributions```
111*e3723e1fSApple OSS Distributions
112*e3723e1fSApple OSS DistributionsOne the stackshot is finished and interrupts are reenabled, this data is woven
113*e3723e1fSApple OSS Distributionsback together into a single KCData buffer by the initiating thread, such that it
114*e3723e1fSApple OSS Distributionsis indistinguishable from the output of a singlethreaded stackshot (essentially,
115*e3723e1fSApple OSS Distributionswe memcpy the contents of each kcdata chunk into a single buffer, stripping off
116*e3723e1fSApple OSS Distributionsthe headers and footers).
117*e3723e1fSApple OSS Distributions
118*e3723e1fSApple OSS Distributions## “Tracing”
119*e3723e1fSApple OSS Distributions
120*e3723e1fSApple OSS DistributionsIn debug and development builds, Stackshot takes a "trace" of itself during
121*e3723e1fSApple OSS Distributionsexecution. There are circular per-cpu buffers containing a list of tracepoints,
122*e3723e1fSApple OSS Distributionswhich consist of a timestamp, line number, and an arbitrary uintpr_t-sized piece
123*e3723e1fSApple OSS Distributionsof extra data. This allows for basic tracing of stackshot's execution on each
124*e3723e1fSApple OSS DistributionsCPU which can be seen from a debugger.
125*e3723e1fSApple OSS Distributions
126*e3723e1fSApple OSS DistributionsBy default, tracepoints are only emitted when stackshot runs into an error (with
127*e3723e1fSApple OSS Distributionsthe error number as the data), but it's trivial to add more with the
128*e3723e1fSApple OSS Distributions`STACKSHOT_TRACE(data)` macro.
129*e3723e1fSApple OSS Distributions
130*e3723e1fSApple OSS DistributionsAn lldb macro is in the works which will allow this data to be examined more
131*e3723e1fSApple OSS Distributionseasily, but for now, it can be examined in lldb with `showpcpu -V
132*e3723e1fSApple OSS Distributionsstackshot_trace_buffer`.
133*e3723e1fSApple OSS Distributions
134*e3723e1fSApple OSS Distributions## Panics
135*e3723e1fSApple OSS Distributions
136*e3723e1fSApple OSS DistributionsDuring a panic stackshot, stackshot handles basically identically to how it did
137*e3723e1fSApple OSS Distributionsbefore (with a single CPU/thread) - with the only difference being that we can
138*e3723e1fSApple OSS Distributionsnow take a stackshot if the system panicked during a stackshot, since state has
139*e3723e1fSApple OSS Distributionsbeen compartmentalized. If the system panics during a panic stackshot, another
140*e3723e1fSApple OSS Distributionsstackshot will not be taken.
141*e3723e1fSApple OSS Distributions
142*e3723e1fSApple OSS DistributionsSince stackshot takes place entirely from within the debugger trap, if an
143*e3723e1fSApple OSS Distributionsauxilliary CPU (i.e. a CPU other than the one which initiated the stackshot)
144*e3723e1fSApple OSS Distributionspanics, it will not be able to acquire the debugger lock since it is already
145*e3723e1fSApple OSS Distributionsbeing held by the initiating CPU. To mitigate this, when a CPU panics during a
146*e3723e1fSApple OSS Distributionsstackshot, it sets a flag in stackshot's state to indicate there was a panic by
147*e3723e1fSApple OSS Distributionscalling into `stackshot_cpu_signal_panic`.
148*e3723e1fSApple OSS Distributions
149*e3723e1fSApple OSS DistributionsThere are checks for this flag at various points in stackshot, and once a CPU
150*e3723e1fSApple OSS Distributionsnotices it is set, it will spin in place. Before the initiating CPU spins in
151*e3723e1fSApple OSS Distributionsplace, it will release the debugger lock. Once all CPUs are spinning, the panic
152*e3723e1fSApple OSS Distributionswill continue.
153*e3723e1fSApple OSS Distributions
154*e3723e1fSApple OSS Distributions## Future Work
155*e3723e1fSApple OSS Distributions
156*e3723e1fSApple OSS Distributions- It might be more elegant to give stackshot its own IPI flavor instead of
157*e3723e1fSApple OSS Distributions  piggybacking on the debugger trap.
158*e3723e1fSApple OSS Distributions- The tracing buffer isn't easily inspected - an LLDB macro to walk the circular
159*e3723e1fSApple OSS Distributions  buffer and print a trace would be helpful.
160*e3723e1fSApple OSS Distributions- Chunk size is currently static for the entire stackshot - instead of
161*e3723e1fSApple OSS Distributions  estimating it once, we could estimate it for every task to further eliminate
162*e3723e1fSApple OSS Distributions  overhead.
163