A Casual Talk About CPU Timing and Modern Operating Systems

January 20, 2023 Observability CPU, Linux, Operating System 3497 words 17 min read

🔊

First, let’s review time-sharing systems. The time-sharing system is a very important operating system concept that maximizes computer utilization and is a crucial means of implementing multi-program concurrency.

The Linux kernel we use daily also adopts the time-sharing system philosophy, mainly reflected in the following aspects:

Time Slice:

Linux uses a time slice mechanism to divide CPU time. Each process can only execute for one time slice before yielding the CPU to other processes. This achieves CPU time sharing and fair allocation.

Context Switching:

When a time slice expires or a process voluntarily gives up the CPU, context switching occurs, saving the current process context and restoring the next process context. This allows the CPU to efficiently switch between different processes.

Process Scheduling:

Linux uses the CFS scheduler to select the most suitable process to run based on each process’s time slice, which embodies the time-sharing system philosophy. Different scheduling strategies can achieve different time-sharing effects.

Interrupt Mechanism:

Linux uses an interrupt mechanism for time management and scheduling. Clock interrupts can notify the kernel to perform context switching and scheduling when a time slice expires. This provides the power basis for the time-sharing system.

Clock Events:

Linux manages various time events based on the time-sharing system, such as timers, sleep/wakeup, etc. This requires the kernel to manage and schedule based on clock interrupts.

In addition, the time-sharing system philosophy is also reflected in Linux in:
Multi-program design Linux supports concurrent execution of multiple programs, which also relies on the CPU time-sharing mechanism implemented by the time-sharing system.
Real-time capability By setting real-time scheduling policies and optimizing interrupt handling, Linux can provide good real-time response performance. This also requires time-sharing system support.
Sleep and wakeup Processes can voluntarily sleep and release the CPU, which requires the time-sharing system to reschedule CPU time when they wake up.
Synchronization mechanisms Linux provides various synchronization mechanisms, all of which require the time-sharing system for coordination and scheduling between processes.

The time-sharing system is the foundation for Linux to implement multi-programming, concurrent execution, real-time response, and time management. It enables Linux to fully utilize CPU resources, achieving efficient and fair scheduling. The time-sharing system philosophy permeates every aspect of the Linux kernel and is an important concept for understanding Linux scheduling and concurrency.

Let’s Talk About Linux Timing

Before we proceed, we need to understand a few concepts:

Some Words

Jiffies

If you’re interested, you can check out Kernel Clock Issues.

jiffies is a very important Linux kernel variable that represents the timestamp since system boot, expressed as the count of clock interrupts. It has the following characteristics:

jiffies measures time by the number of clock interrupts, so its precision is determined by the clock interrupt frequency. At 1000HZ clock interrupt, jiffies precision is 1ms.
jiffies is an unsigned long integer value that continuously accumulates with each clock interrupt. Its value range determines the maximum continuous runtime of Linux.
jiffies wraps around to 0 after overflow. So it can only be used to measure short time intervals from a certain point, not to represent absolute system time.
Many time-related parameters in the kernel use jiffies as a unit, such as time slice size, clock interrupt interval, etc. This allows Linux to migrate across different clock sources.
jiffies can be used to compare the time difference between two timestamps, determine if an event has timed out, etc. But special attention must be paid to jiffies overflow when comparing absolute time.
The rate of change of jiffies can be used to roughly judge Linux kernel load. The faster jiffies changes, the more frequent clock interrupts, likely indicating heavier system load.
In user space, the system boot time and current jiffies value in seconds can be obtained through the /proc/uptime file. This can be used to calculate absolute system time.

Therefore, as the timestamp variable in the Linux kernel, jiffies has the following important meanings:

It decouples Linux’s time unit from specific clock sources, allowing migration across different clocks.
It is used to measure short time intervals, determine timeouts, etc., but due to possible overflow, it is not suitable for representing absolute time.
Its rate of change can reflect system load, used for rough system performance assessment.
Through it, different time units can be translated between kernel and user space, such as HZ and seconds.
Many kernel timer and timing-related parameters are based on jiffies, making it the foundation of Linux time management.

Understanding the meaning and role of jiffies helps us gain a deeper understanding of the Linux kernel’s time management mechanism. It decouples Linux’s time unit from hardware clock sources and is the cornerstone of Linux’s flexible time management.

CPU Time Slice

The CPU time slice is a mechanism introduced by the operating system to implement CPU scheduling. Its main functions are:

Achieve fair sharing of CPU resources by periodically interrupting the currently running process, allowing other processes to run, preventing high-priority processes from monopolizing the CPU.
Prevent any single process from occupying the CPU for too long, affecting other processes’ operation and system response speed.
Provide opportunities for the CPU scheduler to periodically re-evaluate process priorities and select the most suitable process to run.

The principle of CPU time slice is:

The operating system isolates and limits CPU usage based on clock interrupts. When the time slice expires, the currently running process is paused, and the CPU scheduler selects another process to run.
The CFS scheduler allocates a time slice to each process based on its priority and other factors, determining how long it can run.
Processes consume their time slice while running; when the time slice expires, they are scheduled off the CPU, making time for other processes.
If a process voluntarily gives up the CPU before its time slice expires (e.g., for I/O blocking), its remaining time slice is saved and continues to be used when it next gets CPU time.
Real-time processes can set fixed time slices not managed by the CFS scheduler to ensure their real-time response requirements.

The size of the CPU time slice (clock interrupt frequency) directly affects scheduling frequency and fairness. Smaller time slices enhance fairness and response speed but also increase context switching overhead. Time slice settings need to balance fairness and efficiency.

In the Linux kernel, the clock interrupt frequency is 1000HZ, and the default time slice is 1ms. The CFS scheduler dynamically adjusts time slice size based on each process’s vruntime, but doesn’t exceed the default maximum time slice. This achieves a good balance between scheduling fairness and overhead.

CPU Queues

CPU queues are data structures in the Linux kernel used to manage runnable processes (TASK_RUNNING state). They mainly serve the following purposes:

Temporarily save runnable processes not yet selected by the current CPU, waiting for the next scheduling.
Sort processes according to their priority and scheduling policy, providing candidate processes for CPU scheduling.
Achieve fair CPU resource sharing, preventing high-priority processes from monopolizing the CPU.

The Linux kernel’s CPU queues mainly include the following types:

Runqueue: One per CPU, saves processes currently running or ready to run on that CPU. The CFS scheduler directly selects processes from this queue for scheduling.
Backup queue: Global singleton, saves processes pushed out from other CPUs, waiting to be rescheduled. Primarily prevents high-priority processes from being starved.
Expired queue: Global singleton, saves processes whose time slices have expired but with high priority that haven’t been immediately scheduled off the CPU. Awaiting next scheduling consideration.
Wakeup queue: Global singleton, saves processes woken from sleep state, awaiting next scheduling.

The data flow between these CPU queues is as follows:

Newly created process → Wakeup queue → Backup queue → Expired queue → Runqueue

Processes in RUNNING state are added to their respective CPU runqueues awaiting scheduling; if scheduled off the CPU, they enter the backup queue or expired queue; when woken from sleep, they enter the wakeup queue. These queues manage and sort processes around RUNNABLE and RUNNING states.

The CFS scheduler selects processes from the runqueue and backup queue for scheduling. Processes in the expired queue and wakeup queue can also be scheduled to run early under appropriate circumstances. The use of these queues and the flow of processes between them ensure that each CPU’s resources can be fully and fairly utilized by processes that need them, implementing Linux’s time-sharing scheduling policy.

So?

In the Linux kernel, the CPU time slice is used to implement time sharing in CPU scheduling. Its processing logic is as follows:

The kernel periodically isolates the execution of processes or threads from the CPU based on clock interrupts. This is CPU context switching, aimed at reselecting other processes or threads to run.
The clock interrupt frequency determines the time slice size. Faster clocks mean smaller time slices and more frequent CPU scheduling. Early systems used PIT providing 100HZ (10ms) clock interrupts; modern kernels mainly use HPET providing 1000HZ (1ms) interrupts.
Time slice allocation is determined by the CFS scheduler. In principle, each process’s time slice size should be proportional to its priority and vruntime. CFS calculates each process’s vruntime in real-time to determine its time slice.
The kernel manages process context to record the process’s time slice usage, including: time slice size, used time slice, whether running, etc. This information is stored in the task_struct structure.
When a process gets CPU execution resources, the used time slice is subtracted from time_slice. If time_slice runs out, it means the time slice is exhausted, and scheduling should occur to select another process to run.
If a process voluntarily gives up the CPU before its time slice expires (e.g., sleep, I/O), the remaining time slice is preserved and continues to be used when it next gets CPU execution. However, after exceeding a certain number of switches, the original time slice is invalidated and reallocated.
The kernel periodically checks the time slice usage of the currently running process based on the clock interrupt cycle. Once the time slice is exhausted, the process is marked as schedulable, and another process is selected to run.
For real-time processes, the kernel can set static time slices not managed by the CFS scheduler, ensuring they get regular CPU execution time. This meets the needs of real-time processes.

Therefore, in Linux kernel CPU scheduling, the time slice plays a very important role. It achieves fair CPU sharing and meets different processes’ scheduling requirements. The kernel uses clock interrupts, the CFS scheduler, and process context to implement the allocation, management, and usage checking of time slices, controlling the execution time each process gets within each CPU computation cycle, thus achieving CPU resource allocation and scheduling.

What’s the Relationship and Difference Between Clock Interrupt Frequency and Time Slice?

Clock interrupt frequency and time slice are two related but different concepts:

Clock interrupt frequency determines the number of clock interrupts, i.e., the frequency of CPU context switching. The more frequent the clock interrupts, the more frequent the context switching, and the smaller the time slice.
The time slice determines the maximum time a process can occupy the CPU after getting CPU time. The time slice size is determined by the operating system based on clock interrupt frequency and scheduling policy.
The clock interrupt is an event that notifies the operating system to isolate the currently running process and select another process to run. Its frequency reflects the number of these isolation actions.
The time slice is a resource allocation mechanism that determines how long a process can exclusively use the CPU after getting CPU time.

Their relationship is:

The higher the clock interrupt frequency, the smaller the time slice. For example, 100HZ corresponds to a 10ms time slice, 1000HZ corresponds to a 1ms time slice.
The time slice size limits the maximum time a process can exclusively run on the CPU. The clock interrupt is used to forcibly isolate the current process when the time slice expires, allowing other processes to run.
The clock interrupt must be greater than or equal to the time slice size. If the clock interrupt frequency is 10HZ but the time slice is 5ms, this would be unreasonable, causing the timer to be ineffective within the time slice.
Smaller time slices enhance scheduling fairness and response speed but increase context switching overhead. The clock interrupt frequency and time slice setting need to balance these two aspects.

So, to summarize:

Clock interrupt frequency determines the frequency of CPU context switching, affecting the operating system’s scheduling frequency.
The time slice determines each process’s CPU occupation time, affecting scheduling fairness and process waiting time.
The higher the clock interrupt frequency, the smaller the time slice; but it also increases context switching overhead.
They need to be set with appropriate parameters based on system requirements.

Correctly understanding the relationship and difference between clock interrupts and time slices is very important for learning the operating system’s scheduling mechanism. So, in general, clock interrupts provide the power for scheduling (determining the rhythm of scheduling), and time slices implement the fairness policy of scheduling (determining the time allocation for each process). The two work together to complete the efficient and reasonable CPU scheduling of the operating system.

What Use Is There in Learning These Concepts?

Our developed programs and products run on modern operating systems, including Kubernetes. Famous experts are studying time-related issues. If interested, click on the links below: https://github.com/kubernetes/kubernetes/pull/111520 https://github.com/zalando-incubator/kubernetes-on-aws/pull/923 Including the well-known theory of Queuing Theory

Queuing Theory

Queuing theory is an important branch of operations research, mainly studying the performance analysis and optimization of queuing systems. It creates mathematical models to analyze factors such as customer arrival patterns, service patterns, and queue discipline on system performance, helping us understand and optimize complex queuing systems.

Queuing theory can be effectively used to describe and analyze CPU timing issues. We can view the CPU as a server, tasks or processes as customers, and establish mathematical models to analyze the impact of different scheduling strategies on CPU performance.

Specifically, we can model it as follows:

Customers: Tasks or processes on the CPU that continuously arrive and wait for CPU service. The arrival pattern of these tasks can be considered random.
Servers: CPU cores that provide computing services to tasks according to certain rules. CPU service time can also be considered a random variable.
Queue area: Ready queue, which holds tasks that are ready and waiting to run. Different scheduling rules correspond to different queuing disciplines.
Service mechanism: CPU scheduler that selects tasks from the ready queue and schedules them to run on the CPU. Different scheduling algorithms correspond to different service mechanisms.

Based on this model, we can analyze the impact of different scheduling strategies on CPU performance:

FCFS: The ready queue effectively operates as FIFO, which is unfair to short tasks, and long tasks experience long response times.
SJF: Can optimize average response time but may cause starvation for long tasks.
Round-robin: Indirectly limits each task’s CPU time, improving fairness but potentially reducing CPU utilization.
Priority: Provides faster service for high-priority tasks, improving efficiency, but problems like priority inversion can occur.
Feedback queue: Dynamically adjusts task priority based on CPU usage, achieving a balance between fairness and efficiency.
Increasing CPU count: Improves system throughput but also increases hardware cost and management complexity.
Affinity: Binding tasks to specific CPUs can improve cache utilization and performance, requiring a trade-off with task migration costs.

So, we can see that many issues in CPU scheduling, such as fairness, throughput, responsiveness, and priority, can be studied using queuing theory models and analysis methods. By adjusting service modes, queuing rules, CPU count, and other parameters, different scheduling strategies can be implemented to optimize CPU performance.

Queuing theory provides us with very useful tools and approaches for understanding and analyzing complex CPU scheduling mechanisms. It can mathematize and model scheduling problems, helping us think about and optimize problems from a higher perspective. Therefore, queuing theory is one of the important foundational theories for studying operating systems and learning CPU scheduling mechanisms.

Let’s Also Chat About Some Side Topics

Observing CPU Throttling

The principle of CPU throttling is to control process access to the CPU to ensure system stability and fairness. Its main idea is: when CPU usage exceeds a certain threshold, limit process access to the CPU, preventing excessive CPU utilization.

CPU Throttling is Commonly Used in the Following Scenarios:

Preventing CPU overload. When system CPU utilization is too high, CPU throttling can limit individual processes’ CPU usage, avoiding CPU overload leading to system crashes or slow response.
Ensuring service quality. CPU throttling can guarantee that critical business processes get sufficient CPU resources, preventing them from being crowded out by other processes. This provides stable service quality for critical businesses.
Preventing process CPU starvation. CPU throttling can prevent individual processes from being unable to get CPU time slices for extended periods, ensuring every process gets a chance to run.

CPU Throttling Typically Has Two Main Approaches:

Time slice throttling. Limit the CPU time slice a process can use within each clock interrupt cycle. For example, limit it to 80% of the original.
I/O rate limiting. Limit a process’s I/O throughput rate, indirectly limiting its CPU usage. Because CPU usage is usually accompanied by I/O operations, limiting I/O can reduce CPU utilization.

In Linux systems, tools like ulimit, cpulimit can be used to set CPU throttling for processes. Additionally, cgroups provides CPU limiting functionality, allowing more fine-grained control over container/process CPU usage.

CPU throttling is a very useful technique that helps system administrators better manage CPU resources, providing more stable and reliable services. However, if used improperly, it can also affect system and business performance, so CPU limits for each process need to be set carefully.

The Calculation Logic of CPU Throttling Mainly Involves Two Aspects:

CPU utilization calculation. Real-time calculation of system and process CPU utilization as a criterion for CPU throttling decisions.
The formula for CPU utilization is: CPU usage time / measurement interval. CPU usage time can be obtained from /proc/stat, and the measurement interval is typically 1 second.
- System CPU utilization = (user + nice + system + idle) / total time
- Process CPU utilization = process CPU time / total time
CPU throttling decision and execution. Determining whether CPU throttling is needed based on CPU utilization, and if so, calculating each process’s CPU limit and applying it.
The decision logic could be:
- If system CPU utilization > threshold (e.g., 80%), trigger system-level CPU throttling
- If critical process CPU utilization < minimum threshold (e.g., 50%), trigger CPU throttling for that process
- If normal process CPU utilization > maximum threshold (e.g., 20%), trigger CPU throttling for that process
CPU limit calculation can be based on the process’s original CPU share:
- System level: Each process CPU limit = Process original CPU share * system threshold (e.g., 80%)
- Process level: This process’s CPU limit = Process original CPU share * process threshold (e.g., 50% or 20%)
- Manually set CPU limits for certain critical processes, with remaining processes allocated proportionally

A Short Story About Prometheus and CPU Throttling

General Host Scenario

Prometheus can conveniently implement CPU throttling monitoring and alerting, mainly through the following:

CPU utilization metrics:
- Calculate overall system CPU utilization: rate(node_cpu_seconds_total{mode=“system”}[1m]) / 60
- Calculate each process’s CPU utilization: rate(process_cpu_seconds_total{name=“process_name”}[1m]) / 60

Determine if CPU utilization exceeds the threshold; if so, trigger CPU throttling or alerts:

bash
1
2
alert: CPUUtilizationTooHigh
  expr: rate(node_cpu_seconds_total{mode="system"}[1m]) / 60 > 0.8

bash
1
2
alert: ProcessCPUOveruse
  expr: rate(process_cpu_seconds_total{name="process_name"}[1m]) / 60 > 0.5

If CPU throttling is triggered, calculate each process’s CPU limit and apply it:
Process CPU limit = Process original CPU usage * (1 - system CPU usage ratio exceeding the threshold)
Example:
- System CPU utilization threshold: 80%
- Current system CPU utilization: 90%
- A process’s original CPU usage: 20%
Then: Process CPU limit = 20% * (1 - (90% - 80%) / 90%) = 10%
This means the process’s CPU utilization limit is reduced from 20% to 10%, achieving CPU throttling.
Apply the process’s CPU limit using cgroup or other mechanisms:
bash
1
cgroup_cpu_limit{process_name="process_name", cpu_limit="10%"}
At this point, Prometheus calculates CPU utilization in step 1, determines if throttling is needed in step 2, calculates each process’s CPU limit in step 3, and sets the cgroup CPU limit for the process in step 4, completing the entire CPU throttling workflow.

Kubernetes Cluster

In Kubernetes, Prometheus can monitor Pod CPU utilization and perform CPU throttling. The main steps are:

Monitor Pod CPU utilization metrics:

bash
1
rate(container_cpu_usage_seconds_total{pod=~"pod_name", container=~"container_name"}[1m])

Set a CPU utilization threshold; trigger an alert if exceeded (indicating throttling is needed for this Pod):

bash
1
2
alert: PodCPUOveruse
  expr: rate(container_cpu_usage_seconds_total{pod=~"pod_name", container=~"container_name"}[1m]) > 0.8

Calculate Pod CPU limit: Pod CPU limit = Total CPU quota * (1 - CPU usage ratio exceeding the threshold)
For example, if the Pod’s total CPU quota is 2 CPUs, current CPU utilization is 90%, and the threshold is 80%: Pod CPU limit = 2 * (1 - (90% - 80%) / 90%) = 1 CPU

Set the Pod’s CPU limit, with two main approaches:

bash
1
kubectl patch pod pod-name -p '{"spec":{"containers":[{"name":"container-name","resources":{"limits":{"cpu": "1"}}}]} }'