xnu-10002.81.5/doc/sched_clutch_edge.md

*5e3eaea3SApple OSS Distributions# Clutch Scheduler
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions## Background
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe XNU kernel runs on a variety of platforms with strong requirements for being dynamic and efficient. It needs to deliver on a wide range of requirements; from quick access to CPU for latency sensitive workloads (eg. UI interactions, multimedia recording/playback) to starvation avoidance for lower priority batch workloads (eg. photos sync, source compilation). The traditional Mach scheduler attempts to achieve these goals by expecting all threads in the system to be tagged with a priority number and treating high priority threads as interactive threads and low priority threads as batch threads. It then uses a timesharing model based on priority decay to penalize threads as they use CPU to achieve fairshare and starvation avoidance. This approach however loses the relationship between threads and higher level user workloads, making it impossible for the scheduler to reason about the workload as a whole which is what the end user cares about. One artifact of this thread based timesharing approach is that threads at the same priority level are treated similarly irrespective of which user workload they are servicing, which often leads to non-optimal decisions. It ultimately leads to priority inflation across the platform with individual subsystems raising their priority to avoid starvation and timesharing with other unrelated threads. The traditional thread level scheduling model also suffers from the following issues:
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions* **Inaccurate accounting**: CPU accounting at the thread level incentivizes creating more threads on the system. Also in the world of GCD and workqueues where threads are created and destroyed rapidly, thread level accounting is inaccurate and allows excessive CPU usage.
*5e3eaea3SApple OSS Distributions* **Poor isolation**: In the Mach scheduler, timesharing is achieved by decaying the priority of threads depending on global system load. This property could lead to a burst of activity at the same or lower priority band causing decay for the App/UI thread leading to poor performance and responsiveness. The scheduler offers very limited isolation between threads working on latency sensitive UI workloads and threads performing bulk non-latency sensitive operations.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe clutch scheduler is the timesharing algorithm for threads on a single cluster. The **Edge scheduler** extends on the clutch scheduler design to support multiple clusters of different performance and efficiency charecterstics. The Edge scheduler uses the clutch timesharing per cluster and adds other multi-cluster features such as thread placement, migration, round-robining etc.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions## Clutch Scheduler Design
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsIn order to reason about higher level user workloads, the clutch scheduler schedules groups of threads instead of individual threads. Breaking away from the traditional single-tier scheduling model, it implements a hierarchical scheduler which makes optimal decisions at various thread grouping levels. The hierarchical scheduler, as its implemented today, has 3 levels:
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions* Scheduling Bucket Level
*5e3eaea3SApple OSS Distributions* Thread Group Level
*5e3eaea3SApple OSS Distributions* Thread Level
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions### Scheduling Bucket Level
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe highest level is the scheduling bucket level which decides which class of threads should be picked for execution. The kernel maintains a notion of scheduling bucket per thread which are defined based on the base/scheduling priority of the threads. These scheduling buckets roughly map to the QoS classes used by the OS runtime to define performance expectations for various pieces of work. All runnable threads with the same scheduling bucket are represented by a single entry at this level. These entries are known as *root buckets* throughout the implementation. The goal of this level is to provide low latency access to the CPU for high QoS classes while ensuring starvation avoidance for the low QoS classes. This level also maintains separate root buckets for threads bounded to this hierarchy and cluster which allows the scheduler to timeshare effectively between bounded and unbounded threads at various QoS levels.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions**Implementation**
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe scheduling bucket level uses an Earliest Deadline First (EDF) algorithm to decide which root bucket should be selected next for execution. Each root bucket with runnable threads is represented as an entry in a priority queue which is ordered by the bucket's deadline. The bucket selection algorithm simply selects the root bucket with the earliest deadline in the priority queue. The deadline for a root bucket is calculated based on its first-runnable timestamp and its **Worst Case Execution Latency (WCEL)** value which is pre-defined for each bucket. The WCEL values are picked based on the decay curve followed by the Mach timesharing algorithm to allow the system to function similar to the existing scheduler from a higher level perspective.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions```
*5e3eaea3SApple OSS Distributionsstatic uint32_t sched_clutch_root_bucket_wcel_us[TH_BUCKET_SCHED_MAX] = {
*5e3eaea3SApple OSS Distributions        SCHED_CLUTCH_INVALID_TIME_32,                   /* FIXPRI */
*5e3eaea3SApple OSS Distributions        0,                                              /* FG */
*5e3eaea3SApple OSS Distributions        37500,                                          /* IN (37.5ms) */
*5e3eaea3SApple OSS Distributions        75000,                                          /* DF (75ms) */
*5e3eaea3SApple OSS Distributions        150000,                                         /* UT (150ms) */
*5e3eaea3SApple OSS Distributions        250000                                          /* BG (250ms) */
*5e3eaea3SApple OSS Distributions};
*5e3eaea3SApple OSS Distributions```
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsWhenever a root bucket transitions from non-runnable to runnable, its deadline is set to `(now + WCEL[bucket])`. This ensures that the bucket would be scheduled at WCEL[bucket] even in a heavily loaded system. Once the root bucket is picked for execution, its deadline is pushed by WCEL[bucket] into the future. This basic implementation of EDF suffers from one major issue. In a heavily loaded system, it is possible that the higher buckets have used up enough CPU in the recent past such that its behind the lower buckets in deadline order. Now, if a small burst of user-critical workload shows up, the high bucket has to wait for the lower buckets to run before it can get CPU which might lead to performance issues. In order to address that, the bucket level scheduler implements a root bucket warp mechanism. Each bucket is provided a warp value which is refreshed whenever the bucket is selected due to its deadline expiring.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions```
*5e3eaea3SApple OSS Distributionsstatic uint32_t sched_clutch_root_bucket_warp_us[TH_BUCKET_SCHED_MAX] = {
*5e3eaea3SApple OSS Distributions        SCHED_CLUTCH_INVALID_TIME_32,                   /* FIXPRI */
*5e3eaea3SApple OSS Distributions        8000,                                           /* FG (8ms)*/
*5e3eaea3SApple OSS Distributions        4000,                                           /* IN (4ms) */
*5e3eaea3SApple OSS Distributions        2000,                                           /* DF (2ms) */
*5e3eaea3SApple OSS Distributions        1000,                                           /* UT (1ms) */
*5e3eaea3SApple OSS Distributions        0                                               /* BG (0ms) */
*5e3eaea3SApple OSS Distributions};
*5e3eaea3SApple OSS Distributions```
*5e3eaea3SApple OSS DistributionsThe root bucket selection logic finds the earliest deadline bucket and then checks if there are any higher (in natural priority order) buckets that have warp remaining. If there is such a higher bucket, it would select that bucket and effectively open a warp window. During this warp window the scheduler would continue to select this warping bucket over lower priority buckets. Once the warping bucket is drained or the warp window expires, the scheduler goes back to scheduling buckets in deadline order. This mechanism provides a bounded advantage to higher level buckets to allow them to remain responsive in the presence of bursty workloads.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe `FIXPRI` bucket is special cased since it contains extremely latency sensitive threads. Since the priority range for `FIXPRI (aka AboveUI)` and `FG Timeshare` buckets overlap, it is important to maintain some native priority order between those buckets. The policy implemented here is to compare the highest clutch buckets of both buckets; if the Above UI bucket is higher, schedule it immediately, otherwise fall through to the deadline based scheduling as described above. The implementation allows extremely low latency CPU access for Above UI threads while supporting the use case of high priority timeshare threads contending with lower priority fixed priority threads which is observed in some media workloads. Since the timeshare bucket will eventually drop in priority as it consumes CPU, this model provides the desired behavior for timeshare threads above UI.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsWhen the EDF algorithm selects a low QoS root bucket even when a higher QoS root bucket is runnable due to deadline ordering, the root bucket selection algorithm marks the root bucket as being in starvation avoidance mode and opens a starvation avoidance window of `thread_quantum[bucket] / cpu`. During this window all root bucket selections will pick the bucket the starved root bucket. This starvation avoidance window ensures that starved root buckets get a fair shot at draining the lower QoS threads even during heavy contention from higher priority threads.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe EDF algorithm is the best choice for this level due to the following reasons:
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions* Deadline based scheduling allows the scheduler to define strict bounds on worst case execution latencies for all scheduling buckets.
*5e3eaea3SApple OSS Distributions* The EDF algorithm is dynamic based on bucket runnability and selection. Since all deadline updates are computationally cheap, the algorithm can maintain up-to-date information without measurable overhead.
*5e3eaea3SApple OSS Distributions* It achieves the goals of maintaining low scheduling latency for high buckets and starvation avoidance for low buckets efficiently.
*5e3eaea3SApple OSS Distributions* Since the bucket level scheduler deals with a fixed small number of runnable buckets in the worst case, it is easy to configure in terms of defining deadlines, warps etc.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions### Thread Group Level
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe second level is the “thread group” level which decides which thread group within a QoS bucket should be selected next for execution. Thread groups  represent a collection of threads working on behalf of a specific workload. The goal of this level is to share the CPU among various user workloads with preference to interactive applications over compute-intensive batch workloads.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions**Implementation**
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe thread group level implements a variation of the FreeBSD ULE scheduler to decide which thread group should be selected next for execution. Each thread group with runnable threads within a QoS bucket is represented using by `struct sched_clutch_bucket_group`. For multi-cluster platforms, the `sched_clutch_bucket_group` represents threads enqueued on all clusters on the platform. The clutch bucket group maintains the CPU utilization history, runnable history and some timesharing information for the next level scheduler.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe clutch bucket group has an entry to represent runnable threads for the thread group per cluster on the platform. This entry is the `sched_clutch_bucket` and this level of the algorithm is trying to find the best clutch bucket to schedule on each root hierarchy. Each clutch bucket with runnable threads is represented as an entry in a runqueue which is ordered by clutch bucket priorities. The clutch bucket selection algorithm simply selects the clutch bucket with the highest priority in the clutch bucket runqueue. The priority calculation for the clutch buckets is based on the following factors:
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions* **Highest runnable thread in the clutch bucket**: The clutch bucket maintains a priority queue which contains threads ordered by their promoted or base priority (whichever property made the thread eligible to be part of that clutch bucket). It uses the highest of these threads to calculate the base priority of the clutch bucket. The use of both base and sched priority allows the scheduler to honor priority differences specified from userspace via SPIs, priority boosts due to priority inheritance mechanisms like turnstiles and other priority affecting mechanisms outside the core scheduler.
*5e3eaea3SApple OSS Distributions* **Interactivity score**: The scheduler calculates an interactivity score based on the ratio of voluntary blocking time and CPU usage time for the clutch bucket group as a whole. This score allows the scheduler to prefer highly interactive thread groups over batch processing compute intensive thread groups.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe clutch bucket group maintains a few metrics to allow calculation of the interactivity score for the thread group:
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions* **Clutch Bucket Group Blocked Time**: Maintains the amount of time no threads were runnable for the clutch bucket group.
*5e3eaea3SApple OSS Distributions* **Clutch Bucket Group Pending Time**: Maintains the amount of time threads were pending execution for this clutch bucket group. This value is reset as soon as one of the threads of the clutch bucket group is executed.
*5e3eaea3SApple OSS Distributions* **Clutch Bucket Group CPU Time**: Maintains the CPU time used by all threads of this clutch bucket group.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe interactivity score based algorithm is well suited for this level due to the following reasons:
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions* It allows for a fair sharing of CPU among thread groups based on their recent behavior. Since the algorithm only looks at recent CPU usage history, it also adapts to changing behavior quickly.
*5e3eaea3SApple OSS Distributions* Since the priority calculation is fairly cheap, the scheduler is able to maintain up-to-date information about all thread groups which leads to more optimal decisions.
*5e3eaea3SApple OSS Distributions* Thread groups provide a convenient abstraction for groups of threads working together for a user workload. Basing scheduling decisions on this abstraction allows the system to make interesting choices such as preferring Apps over daemons which is typically better for system responsiveness.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe clutch bucket runqueue data structure allows the clutch buckets to be inserted at the head of the queue when threads from that clutch bucket are pre-empted. The runqueues also rotate the clutch bucket to the end of the runqueue at the same priority level when a thread is selected for execution from the clutch bucket. This allows the system to round robin efficiently among thread groups at the same priority value especially on highly contended low CPU systems.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions### Thread Level
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsAt the lowest level the scheduler decides which thread within a clutch bucket should be selected next for execution. Each runnable thread in the clutch bucket is represented as an entry in a runqueue which is organized based on the `sched_pri` of threads. The thread selection algorithm simply selects the highest priority thread in the runqueue. The `sched_pri` calculation for the threads is based on the traditional Mach scheduling algorithm which uses load & CPU usage to decay priority for a thread. The thread decay model is more suited at this level as compared to the global scheduler because the load calculation only accounts for threads in the same clutch bucket group. Since all threads in the same clutch bucket group belong to the same thread group and scheduling bucket, this algorithm provides quick CPU access for latency sensitive threads within the clutch bucket group without impacting other non-related threads in the system.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions**Implementation**
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe thread level scheduler implements the Mach timesharing algorithm to decide which thread within the clutch bucket should be selected next for execution. All runnable threads in a clutch bucket are inserted into the runqueue based on the `sched_pri`. The scheduler calculates the `sched_pri` of the threads in a clutch bucket group based on the number of runnable threads in the clutch bucket group and the CPU usage of individual threads. The load information is updated every scheduler tick and the threads use this information for priority decay calculation as they use CPU. The priority decay algorithm attempts to reward bursty interactive threads and penalize CPU intensive threads. Once a thread is selected for running, it is assigned a quantum which is based on the scheduling bucket it belongs to. The quanta for various buckets are defined statically as:
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions```
*5e3eaea3SApple OSS Distributionsstatic uint32_t sched_clutch_thread_quantum_us[TH_BUCKET_SCHED_MAX] = {
*5e3eaea3SApple OSS Distributions        10000,                                          /* FIXPRI (10ms) */
*5e3eaea3SApple OSS Distributions        10000,                                          /* FG (10ms) */
*5e3eaea3SApple OSS Distributions        8000,                                           /* IN (8ms) */
*5e3eaea3SApple OSS Distributions        6000,                                           /* DF (6ms) */
*5e3eaea3SApple OSS Distributions        4000,                                           /* UT (4ms) */
*5e3eaea3SApple OSS Distributions        2000                                            /* BG (2ms) */
*5e3eaea3SApple OSS Distributions};
*5e3eaea3SApple OSS Distributions```
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe per-bucket thread quantum allows the scheduler to bound the worst case execution latency for a low priority thread which has been starved by higher priority threads.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions## Scheduler Priority Calculations
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions### Root Priority Calculation
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe scheduler maintains a root level priority for the hierarchy in order to make decisions regarding pre-emptions and thread selection. The root priority is updated as threads are inserted/removed from the hierarchy. The root level also maintains the urgency bits to help with pre-emption decisions. Since the root level priority/urgency is used for pre-emption decisions, it is based on the threads in the hierarchy and is calculated as follows:
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions```
*5e3eaea3SApple OSS DistributionsRoot Priority Calculation:
*5e3eaea3SApple OSS Distributions* If AboveUI bucket is runnable,
*5e3eaea3SApple OSS Distributions*     Compare priority of AboveUI highest clutch bucket (CBUI) with Timeshare FG highest clutch bucket (CBFG)
*5e3eaea3SApple OSS Distributions*     If pri(CBUI) >= pri(CBFG), select CBUI
*5e3eaea3SApple OSS Distributions* Otherwise find the (non-AboveUI) highest priority root bucket that is runnable and select its highest clutch bucket
*5e3eaea3SApple OSS Distributions* Find the highest priority (promoted or base pri) thread within that clutch bucket and assign that as root priority
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsRoot Urgency Calculation:
*5e3eaea3SApple OSS Distributions* On thread insertion into the hierarchy, increment the root level urgency based on thread's sched_pri
*5e3eaea3SApple OSS Distributions* On thread removal from the hierarchy, decrement the root level urgency based on thread's sched_pri
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions```
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions### Root Bucket Priority Calculation
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe root bucket priority is simply the deadline of the root bucket which is calculated by adding the WCEL of the bucket to the timestamp of the root bucket becoming runnable.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions```
*5e3eaea3SApple OSS Distributionsroot-bucket priority = now + WCEL[bucket]
*5e3eaea3SApple OSS Distributions```
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions### Clutch Bucket Priority Calculation
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsAs mentioned earlier, the priority value of a clutch bucket is calculated based on the highest runnable thread and interactivity score. The actual calculation algorithm is as follows:
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions```
*5e3eaea3SApple OSS Distributions* Find the highest runnable thread (promoted or basepri) in the clutch bucket (maxpri)
*5e3eaea3SApple OSS Distributions* Calculate the ratio of CPU blocked and CPU used for the clutch bucket.
*5e3eaea3SApple OSS Distributions*      If blocked > used, assign a score (interactivity_score) in the higher range.
*5e3eaea3SApple OSS Distributions*      Else, assign a score (interactivity_score) in the lower range.
*5e3eaea3SApple OSS Distributions* clutch-bucket priority = maxpri + interactivity_score
*5e3eaea3SApple OSS Distributions```
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions### Thread Priority Calculation
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe thread priority calculation is based on the Mach timesharing algorithm. It is calculated in the following manner:
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions```
*5e3eaea3SApple OSS Distributions* Every scheduler tick, snapshot the load for the clutch bucket
*5e3eaea3SApple OSS Distributions* Use the load value to calculate the priority shift values for all threads in the clutch bucket
*5e3eaea3SApple OSS Distributions* thread priority = base priority - (thread CPU usage >> priority shift)
*5e3eaea3SApple OSS Distributions```
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions## Edge Scheduler Design
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe Edge scheduler implements all the necessary features needed for scheduling on multi-cluster asymmetric platforms. On each cluster, it uses the clutch scheduler timesharing design described above. In terms of thread placement and load balancing, the Edge scheduler represents the machine as a graph where each node is a compute cluster and the directional edges describe the likelihood of migrating threads from one cluster to another. The Edge scheduler works closely with the performance controller to define
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions### Edge Scheduler System Goals
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions* The system should be **compact**. Limit small-width workload threads to a single cluster as much as possible. A few reasons why this property is essential:
*5e3eaea3SApple OSS Distributions	* Better LLC usage.
*5e3eaea3SApple OSS Distributions	* Improved performance from avoid expensive inter-cluster, inter-die cache fills
*5e3eaea3SApple OSS Distributions	* Power gate or power down unused ACCs
*5e3eaea3SApple OSS Distributions* **Rapidly open up clusters** if the workload can use them (e.g., for parallelism, to mitigate passenger effects) for the following reasons:
*5e3eaea3SApple OSS Distributions	* No dark silicon
*5e3eaea3SApple OSS Distributions	* Efficiency cores offer meaningful performance uplift for benchmarks as well as throughput-oriented workloads.
*5e3eaea3SApple OSS Distributions* Allow **low latency access to CPU** for high QoS work. This property ensures that the high QoS threads experience low scheduling latency under heavy CPU contention as well.
*5e3eaea3SApple OSS Distributions* **Migrate "down" only**. Threads should migrate to clusters where execution efficiency won’t be significantly worse. This might be a reason to open up new clusters (in contrast to the first system goal of keeping the system compact).
*5e3eaea3SApple OSS Distributions* Manage **passenger effects**. When the desired performance of workloads sharing a cluster begins to diverge, some will pay a “passenger tax”. In those cases, it is desirable to split up the workloads to ensure the most efficient execution on the platform.
*5e3eaea3SApple OSS Distributions* Adapt to **rapidly changing workload widths**.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsWhen the scheduler and performance controller for Skye (the first AMP platform) were being designed, much of the emphasis was placed on delineating work across E-cores and P-cores. Concepts such as thread groups, asymmetric spill & steal, etc. were invented to efficiently exploit this performance & energy efficiency heterogeneity in the hardware. However the same concepts largely apply to a platform with several homogenous clusters and heterogenous clusters with independent DVFM domains as well.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions### Edge Scheduler Thread Placement Strategy
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe Edge scheduler uses **per-thread group recommendations** from the performance controller and **per-cluster runqueues** in the scheduler. The design aims to provide the performance controller with the ability to influence the width of the system (in terms of number of clusters), while retaining the scheduler's ability to go wide in the event of a thread storm.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions#### Thread Group Cluster Recommendations
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe scheduler expects the performance controller to specify a cluster recommendation for each thread group. To allow finer grained thread placement, the Edge scheduler allows the performance controller to specify a preferred cluster per QoS within the thread group i.e. per sched\_clutch\_bucket\_group. The ability to prefer specific clusters rather than cluster types allows the performance controller to implement passenger tax reduction policies and co-locate workloads which expect similar performance charecterstics.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsWhen a thread becomes runnable, the scheduler looks at the preferred cluster recommendation of the sched\_clutch\_bucket_group it belongs to and uses that as  the starting decision point for thread placement. If the preferred cluster is idle or running lower QoS workloads, the scheduler simply selects the preferred cluster for enqueing the thread. Otherwise, the scheduler uses the thread migration strategy described in the next section.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsWhen the performance controller changes the preferred cluster for a sched\_clutch\_bucket\_group, the Edge scheduler also provides an option to migrate the running and runnable threads for that group immediately (as opposed to the next scheduling point where the preferred cluster will be re-evaluated). The performance controller can use this feature to change recommendations for latency sensitive workloads from efficient to performance clusters and ensure that the workload threads get placed on the newly preferred cluster immediately.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions#### Thread Migration Strategy
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsIn order to choose a cluster & processor for a runnable thread, the edge scheduler uses the preferred cluster for the thread's sched\_clutch\_bucket_group. If the preferred cluster is idle or running lower QoS workloads, the scheduler simply selects the preferred cluster for enqueing the thread. Otherwise, the scheduler evaluates the outgoing edges from the preferred cluster for migration decisions.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions**Edge Scheduler Edge Matrix**
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe Edge scheduler maintains a thread migration graph where each node represents a cluster and each directional edge represents the likelihood of migrating threads across that edge. Each graph edge encodes the following attributes:
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions```
*5e3eaea3SApple OSS Distributionstypedef union sched_clutch_edge {
*5e3eaea3SApple OSS Distributions        struct {
*5e3eaea3SApple OSS Distributions                uint32_t
*5e3eaea3SApple OSS Distributions                /* boolean_t */ sce_migration_allowed : 1,
*5e3eaea3SApple OSS Distributions                /* boolean_t */ sce_steal_allowed     : 1,
*5e3eaea3SApple OSS Distributions                                _reserved             : 30;
*5e3eaea3SApple OSS Distributions                uint32_t        sce_migration_weight;
*5e3eaea3SApple OSS Distributions        };
*5e3eaea3SApple OSS Distributions        uint64_t sce_edge_packed;
*5e3eaea3SApple OSS Distributions} sched_clutch_edge;
*5e3eaea3SApple OSS Distributions```
*5e3eaea3SApple OSS DistributionsThe `sce_migration_allowed` & `sce_steal_allowed` flags indicate if threads are allowed to be migrated & stolen across the edge. The `sce_migration_weight` is a measure of the scheduling latency delta that should exist between the source and destination nodes (i.e. clusters) for the thread to be migrated. The per-cluster scheduling latency metric is described in the next section.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe performance controller can dynamically update the weights and properties of the edge matrix dynamically to change the width of the system for performance and efficiency reasons.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions**Edge Scheduler Cluster Scheduling Latency Metric**
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe Edge scheduler maintains a per-cluster scheduling latency metric which indicates the latency for a thread of a given QoS to get on core on the cluster. The metric has its roots in queueing delay algorithms and calculates the amount of time it would take for a newly runnable thread to get picked by a core on the cluster. The scheduling latency metric is calculated using the following formula:
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions```
*5e3eaea3SApple OSS DistributionsScheduling-Latency(QoS) = Cumulative Higher QoS Load(QoS) * Avg. Execution Latency(QoS)
*5e3eaea3SApple OSS Distributions```
*5e3eaea3SApple OSS Distributions* Cumulative Higher QoS Load: The cumulative higher QoS load metric calculates the number of runnable and running threads of a higher or equal QoS that are enqueued or running on the cluster. This measures the number of threads that are ahead of the newly made runnable thread in terms of getting a chance to execute on the cluster.
*5e3eaea3SApple OSS Distributions* Avg. Execution Latency: The avg execution latency metric tracks the average execution latency of threads at a particular QoS. This value tracks the amount of work threads at this QoS typically do before blocking or context switching.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsBoth metrics are maintained as an exponentially moving weighted average to make sure they capture the recent behavior of the threads on the system. The per-cluster scheduling latency metric is used to decide thread placement based on the following algorithm:
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions```
*5e3eaea3SApple OSS Distributions* On thread becoming runnable, get the scheduling latency metric for the thread's QoS and preferred cluster (as specified by CLPC)
*5e3eaea3SApple OSS Distributions* If preferred cluster scheduling latency is 0, return preferred cluster
*5e3eaea3SApple OSS Distributions* Otherwise, for each cluster which is not the preferred cluster,
*5e3eaea3SApple OSS Distributions*    Calculate the scheduling latency metric for the cluster and the thread's QoS
*5e3eaea3SApple OSS Distributions*    If scheduling latency metric is 0, return cluster
*5e3eaea3SApple OSS Distributions*    Otherwise, calulate the scheduling latency delta between the cluster and the preferred cluster
*5e3eaea3SApple OSS Distributions*    If delta is less than the edge weight between preferred cluster and cluster, continue
*5e3eaea3SApple OSS Distributions*    Otherwise, if delta is greater than largest delta, store delta as largest delta
*5e3eaea3SApple OSS Distributions* Return cluster with largest scheduling latency delta
*5e3eaea3SApple OSS Distributions```
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe order of cluster iteration in the algorithm above specifically picks homogeneous clusters before asymmetric clusters to ensure the threads migrate to idle clusters with similar performance characteristics before asymmetric idle clusters.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions#### Thread Stealing/Rebalancing Strategy
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe `SCHED(steal_thread)` scheduler callout is invoked when the processor does not find any thread for execution in its runqueue. The aim of the steal operation is to find other threads running/runnable in other clusters which should be executed here. If the steal callout does not return a thread, the `thread_select()` logic calls `SCHED(processor_balance)` callout which is supposed to IPI other CPUs to rebalance threads and idle out the current CPU waiting for the IPI'ed thread to reschedule the thread onto this CPU.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions**Edge Scheduler Foreign Threads**
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe Edge scheduler identifies clutch buckets (and correspondingly the threads in the clutch bucket) as foreign when these clutch buckets are enqueued on a cluster which are asymmetric from the preferred cluster of the thread group. The foreign clutch buckets are part of the regular hierarchy of the clutch root but are also linked in a special "foreign" priority queue maintained at the root. This foreign priority queue allows other clusters to easily rebalance threads from asymmetric clusters when they run out of threads in their local hierarchy runqueue.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions**Edge scheduler steal implementation**
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe edge scheduler implements the steal operation via `sched_edge_processor_idle()`. This routine tries to do the following operations in order:
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions```
*5e3eaea3SApple OSS Distributions* (1) Find foreign runnnable threads in non-native cluster runqueues (sched_edge_foreign_runnable_thread_remove())
*5e3eaea3SApple OSS Distributions* (2) Steal a runnable thread from a native cluster runqueue (sched_edge_steal_thread())
*5e3eaea3SApple OSS Distributions* (3) Check if foreign threads are running on the non-native clusters (sched_edge_foreign_running_thread_available())
*5e3eaea3SApple OSS Distributions*     If available, return THREAD_NULL for the steal callout and perform rebalancing as part of SCHED(processor_balance) i.e. sched_edge_balance()
*5e3eaea3SApple OSS Distributions* (4) Steal a thread from another cluster based on sce_steal_allowed & cluster loads (sched_edge_steal_thread())
*5e3eaea3SApple OSS Distributions```
*5e3eaea3SApple OSS DistributionsThe policy of doing these operations in this specific order is chosen to ensure that threads are not runnable or executing on cluster types which are different from its preferred cluster type. If no such thread is found, then the scheduler aims to reduce the load on other clusters by stealing threads from them.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions**Edge scheduler rebalance operation**
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsIf `SCHED(steal_thread)` did not return a thread for the processor, it indicates that the processor found a thread running on a "foreign" cluster and would like to rebalance it onto itself. The implementation (`sched_edge_balance()`) sends an IPI to the foreign CPU, idles itself and waits for the foreign CPU to rebalance the thread on this idle CPU.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions#### Cluster Shared Resource Threads Management
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe Edge scheduler attempts to load balance cluster shared resource intensive threads across clusters in order to reduce contention on the shared resources. It achieves that by maintaining the runnable and running shared resource load on each cluster and balancing the load across multiple clusters. The current implementation for cluster shared resource load balancing looks at the per-cluster load at thread runnable time to enqueue the thread in the appropriate cluster.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions**Cluster shared resource thread scheduling policy**
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe threads for shared resources can be scheduled using one of the two policies:
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions* EDGE\_SHARED\_RSRC\_SCHED\_POLICY\_RR
*5e3eaea3SApple OSS DistributionsThis policy distributes the threads so that they spread across all available clusters irrespective of type. The idea is that this scheduling policy will put a shared resource thread on each cluster on the platform before it starts doubling up on clusters.
*5e3eaea3SApple OSS Distributions* EDGE\_SHARED\_RSRC\_SCHED\_POLICY\_NATIVE\_FIRST
*5e3eaea3SApple OSS DistributionsThis policy distributes threads so that the threads first fill up all the capacity on the preferred cluster and its homogeneous peers before spilling to different core type. The current implementation defines capacity based on the number of CPUs in the cluster; so a cluster's shared resource is considered full if there are "n" runnable + running shared resource threads on the cluster with n cpus. This policy is different from the default scheduling policy of the edge scheduler since this always tries to fill up the native clusters to capacity even when non-native clusters might be idle.
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS Distributions#### Long Running Workload AMP Round Robining
*5e3eaea3SApple OSS Distributions
*5e3eaea3SApple OSS DistributionsThe Edge scheduler implements a policy to round robining long running workload threads across clusters of various types to ensure that all threads of the workload make equal progress aka "stir-the-pot". This is essential for performance of workloads that statically partition work among ncpu threads. The scheduler invokes this mechanism when a thread expires a quantum on a non-preferred cluster (most likely due to migration/spilling from the preferred cluster). The scheduler recognizes this (via `AST_QUANTUM` and `AST_REBALANCE` being set) and enqueues it on a cluster native to the preferred cluster. On the next scheduling event for that cluster, the CPU will pickup this thread and spill/migrate the thread previously running onto the non-preferred cluster. In order to make sure all clusters native to the preferred cluster are euqally subject to this round-robining, the scheduler maintains a `scbg_amp_rebalance_last_chosen` value per sched_clutch_bucket_group (which represents all threads of a workload at the same QoS level).