1*8d741a5dSApple OSS Distributions# Multithreaded Stackshot 2*8d741a5dSApple OSS Distributions 3*8d741a5dSApple OSS DistributionsStackshot has been retrofitted to take advantage of multiple CPUs. This document 4*8d741a5dSApple OSS Distributionsdetails the design of multithreaded stackshot. 5*8d741a5dSApple OSS Distributions 6*8d741a5dSApple OSS Distributions## Terminology 7*8d741a5dSApple OSS Distributions 8*8d741a5dSApple OSS Distributions- **Initiating / Calling CPU**: The CPU which stackshot was called from. 9*8d741a5dSApple OSS Distributions- **Main CPU**: The CPU which populates workqueues and collects global state. 10*8d741a5dSApple OSS Distributions- **Auxiliary CPU**: A CPU which is not the main CPU. 11*8d741a5dSApple OSS Distributions- **KCData**: The containerized data structure that stackshot outputs. See 12*8d741a5dSApple OSS Distributions `osfmk/kern/kcdata.h` for more information. 13*8d741a5dSApple OSS Distributions 14*8d741a5dSApple OSS Distributions## Overview 15*8d741a5dSApple OSS Distributions 16*8d741a5dSApple OSS DistributionsWhen a stackshot is taken, the initiating CPU (the CPU from which stackshot was 17*8d741a5dSApple OSS Distributionscalled) sets up state. Then, it enters the debugger trap, and IPIs the other 18*8d741a5dSApple OSS Distributionscores into the debugger trap as well. The other CPUs call into stackshot from 19*8d741a5dSApple OSS Distributionsthe debugger trap instead of spinning, and determine if they are eligible to 20*8d741a5dSApple OSS Distributionswork based on perfcontrol's recommendation. (We need to do this because even if 21*8d741a5dSApple OSS Distributionsa CPU is derecommended due to thermal limits or otherwise, it will still be 22*8d741a5dSApple OSS DistributionsIPI'd into the debugger trap, and we want to avoid overheating the CPU). 23*8d741a5dSApple OSS Distributions 24*8d741a5dSApple OSS DistributionsOn AMP systems, a suitable P-core is chosen to be the “main” CPU, and begins 25*8d741a5dSApple OSS Distributionspopulating queues of tasks to be put into the stackshot and collecting bits of 26*8d741a5dSApple OSS Distributionsglobal state (On SMP systems, the initiating CPU is always assigned to be the 27*8d741a5dSApple OSS Distributionsmain CPU). 28*8d741a5dSApple OSS Distributions 29*8d741a5dSApple OSS DistributionsThe other CPUs begin chipping away at the queues, and the main CPU joins 30*8d741a5dSApple OSS Distributionsin once it is done populating them. Once all CPUs are finished, they exit the 31*8d741a5dSApple OSS Distributionsdebugger trap, interrupts are re-enabled, and the kcdata from all of the CPUs 32*8d741a5dSApple OSS Distributionsare collated together by the caller CPU. The output is identical to 33*8d741a5dSApple OSS Distributionssingle-threaded stackshot. 34*8d741a5dSApple OSS Distributions 35*8d741a5dSApple OSS DistributionsIt is important to note that since stackshot happens outside of the context of 36*8d741a5dSApple OSS Distributionsthe scheduler and with interrupts disabled, it does not use "actual" threads to 37*8d741a5dSApple OSS Distributionsdo its work - each CPU has its own execution context and no context switching 38*8d741a5dSApple OSS Distributionsoccurs. Nothing else runs on the system while a stackshot is happening; this 39*8d741a5dSApple OSS Distributionsallows for stackshot to grab an atomic snapshot of the entire system's state. 40*8d741a5dSApple OSS Distributions 41*8d741a5dSApple OSS Distributions## Work Queues 42*8d741a5dSApple OSS Distributions 43*8d741a5dSApple OSS DistributionsIn order to split up work between CPUs, each task is put into a workqueue for 44*8d741a5dSApple OSS DistributionsCPUs to pull from. On SMP systems, there is only one queue. On AMP systems, 45*8d741a5dSApple OSS Distributionsthere are two, and tasks are sorted between the queues based on their 46*8d741a5dSApple OSS Distributions"difficulty" (i.e. the number of threads they have). E cores will work on the 47*8d741a5dSApple OSS Distributionseasier queue first, and P cores will work on the harder queue first. Once a CPU 48*8d741a5dSApple OSS Distributionsfinishes with its first queue, it will move on to the other. 49*8d741a5dSApple OSS Distributions 50*8d741a5dSApple OSS DistributionsIf latency collection is enabled, each CPU will record information about its run 51*8d741a5dSApple OSS Distributionsin a `stackshot_latency_cpu` structure in the KCData. This includes information 52*8d741a5dSApple OSS Distributionssuch as the amount of time spent waiting for the queue and the number of tasks / 53*8d741a5dSApple OSS Distributionsthreads processed by the CPU during its run. 54*8d741a5dSApple OSS Distributions 55*8d741a5dSApple OSS Distributions## Buffers and Memory 56*8d741a5dSApple OSS Distributions 57*8d741a5dSApple OSS DistributionsStackshot is given a fixed-size buffer upfront since it cannot allocate any 58*8d741a5dSApple OSS Distributionsmemory for itself. The size estimation logic in multithreaded stackshot is 59*8d741a5dSApple OSS Distributionsimproved from that of singlethreaded stackshot - it uses various heuristics such 60*8d741a5dSApple OSS Distributionsas the number of tasks and threads on the system, the flags passed, sizes of 61*8d741a5dSApple OSS Distributionsdata structures, and a fudge factor to give a reasonable estimate for a buffer 62*8d741a5dSApple OSS Distributionssize. Should the buffer be too small, stackshot will try again with a bigger 63*8d741a5dSApple OSS Distributionsone. The number of tries is recorded in the `stackshot_latency_collection_v2` 64*8d741a5dSApple OSS Distributionsstruct if latency collection is enabled. 65*8d741a5dSApple OSS Distributions 66*8d741a5dSApple OSS Distributions### Bump Allocator 67*8d741a5dSApple OSS Distributions 68*8d741a5dSApple OSS DistributionsStackshot uses a basic per-cluster bump allocator to allocate space within the 69*8d741a5dSApple OSS Distributionsbuffer. Each cluster gets its own bump allocator to mitigate cache contention, 70*8d741a5dSApple OSS Distributionswith space split evenly between each cluster. If a cluster runs out of buffer 71*8d741a5dSApple OSS Distributionsspace, it can reach into other clusters for more. 72*8d741a5dSApple OSS Distributions 73*8d741a5dSApple OSS DistributionsMemory that is freed is put into a per-cluster freelist. Even if the data was 74*8d741a5dSApple OSS Distributionsoriginally allocated from a different cluster's buffer, it will be put into the 75*8d741a5dSApple OSS Distributionscurrent cluster's freelist (again, to reduce cache effects). The freelist is a 76*8d741a5dSApple OSS Distributionslast resort, and is only used if the current cluster's buffer space fills. 77*8d741a5dSApple OSS Distributions 78*8d741a5dSApple OSS DistributionsEach CPU will report information about its buffers in its 79*8d741a5dSApple OSS Distributions`stackshot_latency_cpu` struct. This includes the total amount of buffer space 80*8d741a5dSApple OSS Distributionsused and the amount of buffer space allocated from other clusters. 81*8d741a5dSApple OSS Distributions 82*8d741a5dSApple OSS Distributions### Linked-List kcdata 83*8d741a5dSApple OSS Distributions 84*8d741a5dSApple OSS DistributionsEach CPU needs its own kcdata descriptor, but we don't know exactly how big each 85*8d741a5dSApple OSS Distributionsone should be ahead of time. Because of this, allocate kcdata buffers in 86*8d741a5dSApple OSS Distributionsreasonably-sized chunks as we need them. We also want the output to have each 87*8d741a5dSApple OSS Distributionstask in order (to keep the output identical to singlethreaded stackshot), so we 88*8d741a5dSApple OSS Distributionsmaintain a linked list of these kcdata chunks for each task in the queue. 89*8d741a5dSApple OSS Distributions 90*8d741a5dSApple OSS DistributionsThe chunks are sized such that only one is needed for the average task. If we 91*8d741a5dSApple OSS Distributionshave any extra room at the end of the current chunk once we finish with a task, 92*8d741a5dSApple OSS Distributionswe can add it to the freelist - but this is not ideal. So, stackshot uses 93*8d741a5dSApple OSS Distributionsvarious heuristics including flags and current task / thread counts to estimate 94*8d741a5dSApple OSS Distributionsa good chunk size. The amount of memory added to the freelist is reported by 95*8d741a5dSApple OSS Distributionsnamed uint64 in the KCData (`stackshot_buf_overhead`). 96*8d741a5dSApple OSS Distributions 97*8d741a5dSApple OSS Distributions``` 98*8d741a5dSApple OSS Distributions Workqueue 99*8d741a5dSApple OSS Distributions 100*8d741a5dSApple OSS Distributions⎡ Task #1 ⎤ 101*8d741a5dSApple OSS Distributions⎢ CPU 0 ⎥ 102*8d741a5dSApple OSS Distributions⎣ kcdata* ⎦-->[ KCData A ]--[ KCData B ] 103*8d741a5dSApple OSS Distributions⎡ Task #2 ⎤ 104*8d741a5dSApple OSS Distributions⎢ CPU 1 ⎥ 105*8d741a5dSApple OSS Distributions⎣ kcdata* ⎦-->[ KCData C ] 106*8d741a5dSApple OSS Distributions⎡ Task #3 ⎤ 107*8d741a5dSApple OSS Distributions⎢ CPU 2 ⎥ 108*8d741a5dSApple OSS Distributions⎣ kcdata* ⎦-->[ KCData D ]--[ KCData E ]--[ KCData F ] 109*8d741a5dSApple OSS Distributions ... 110*8d741a5dSApple OSS Distributions``` 111*8d741a5dSApple OSS Distributions 112*8d741a5dSApple OSS DistributionsOne the stackshot is finished and interrupts are reenabled, this data is woven 113*8d741a5dSApple OSS Distributionsback together into a single KCData buffer by the initiating thread, such that it 114*8d741a5dSApple OSS Distributionsis indistinguishable from the output of a singlethreaded stackshot (essentially, 115*8d741a5dSApple OSS Distributionswe memcpy the contents of each kcdata chunk into a single buffer, stripping off 116*8d741a5dSApple OSS Distributionsthe headers and footers). 117*8d741a5dSApple OSS Distributions 118*8d741a5dSApple OSS Distributions## “Tracing” 119*8d741a5dSApple OSS Distributions 120*8d741a5dSApple OSS DistributionsIn debug and development builds, Stackshot takes a "trace" of itself during 121*8d741a5dSApple OSS Distributionsexecution. There are circular per-cpu buffers containing a list of tracepoints, 122*8d741a5dSApple OSS Distributionswhich consist of a timestamp, line number, and an arbitrary uintpr_t-sized piece 123*8d741a5dSApple OSS Distributionsof extra data. This allows for basic tracing of stackshot's execution on each 124*8d741a5dSApple OSS DistributionsCPU which can be seen from a debugger. 125*8d741a5dSApple OSS Distributions 126*8d741a5dSApple OSS DistributionsBy default, tracepoints are only emitted when stackshot runs into an error (with 127*8d741a5dSApple OSS Distributionsthe error number as the data), but it's trivial to add more with the 128*8d741a5dSApple OSS Distributions`STACKSHOT_TRACE(data)` macro. 129*8d741a5dSApple OSS Distributions 130*8d741a5dSApple OSS DistributionsAn lldb macro is in the works which will allow this data to be examined more 131*8d741a5dSApple OSS Distributionseasily, but for now, it can be examined in lldb with `showpcpu -V 132*8d741a5dSApple OSS Distributionsstackshot_trace_buffer`. 133*8d741a5dSApple OSS Distributions 134*8d741a5dSApple OSS Distributions## Panics 135*8d741a5dSApple OSS Distributions 136*8d741a5dSApple OSS DistributionsDuring a panic stackshot, stackshot handles basically identically to how it did 137*8d741a5dSApple OSS Distributionsbefore (with a single CPU/thread) - with the only difference being that we can 138*8d741a5dSApple OSS Distributionsnow take a stackshot if the system panicked during a stackshot, since state has 139*8d741a5dSApple OSS Distributionsbeen compartmentalized. If the system panics during a panic stackshot, another 140*8d741a5dSApple OSS Distributionsstackshot will not be taken. 141*8d741a5dSApple OSS Distributions 142*8d741a5dSApple OSS DistributionsSince stackshot takes place entirely from within the debugger trap, if an 143*8d741a5dSApple OSS Distributionsauxilliary CPU (i.e. a CPU other than the one which initiated the stackshot) 144*8d741a5dSApple OSS Distributionspanics, it will not be able to acquire the debugger lock since it is already 145*8d741a5dSApple OSS Distributionsbeing held by the initiating CPU. To mitigate this, when a CPU panics during a 146*8d741a5dSApple OSS Distributionsstackshot, it sets a flag in stackshot's state to indicate there was a panic by 147*8d741a5dSApple OSS Distributionscalling into `stackshot_cpu_signal_panic`. 148*8d741a5dSApple OSS Distributions 149*8d741a5dSApple OSS DistributionsThere are checks for this flag at various points in stackshot, and once a CPU 150*8d741a5dSApple OSS Distributionsnotices it is set, it will spin in place. Before the initiating CPU spins in 151*8d741a5dSApple OSS Distributionsplace, it will release the debugger lock. Once all CPUs are spinning, the panic 152*8d741a5dSApple OSS Distributionswill continue. 153*8d741a5dSApple OSS Distributions 154*8d741a5dSApple OSS Distributions## Future Work 155*8d741a5dSApple OSS Distributions 156*8d741a5dSApple OSS Distributions- It might be more elegant to give stackshot its own IPI flavor instead of 157*8d741a5dSApple OSS Distributions piggybacking on the debugger trap. 158*8d741a5dSApple OSS Distributions- The tracing buffer isn't easily inspected - an LLDB macro to walk the circular 159*8d741a5dSApple OSS Distributions buffer and print a trace would be helpful. 160*8d741a5dSApple OSS Distributions- Chunk size is currently static for the entire stackshot - instead of 161*8d741a5dSApple OSS Distributions estimating it once, we could estimate it for every task to further eliminate 162*8d741a5dSApple OSS Distributions overhead. 163