Wednesday, May 24, 2017

Oracle Linux - capture context switching in Linux

Before we dive into the subject, context switching is normal, the Linux operating system needs context switching to function, no need to worry. The question is, if it is normal why would you like to monitor it? Because, as with everything, normal behavior is accepted however behavior which gets out of bounds will cause an issue. With context switching, you are ok in expecting a certain number of context switching at every given moment in time, however, when the number of context switches get out of hand this can result in slow execution of processes for the users of the system.

Definition of context switching
The definition to context switching given by the Linux Information Project is as follows: “A context switch (also sometimes referred to as a process switch or a task switch) is the switching of the CPU (central processing unit) from one process or thread to another. A process (also sometimes referred to as a task) is an executing (i.e., running) instance of a program. In Linux, threads are lightweight processes that can run in parallel and share an address space (i.e., a range of memory locations) and other resources with their parent processes (i.e., the processes that created them).

A context switch comes with a cost, it takes and capacity to undertake the context switch. Meaning, if you can prevent a context switch this is good and will help in the overall performance of the systems. In effect, context switching comes in two different types, voluntary context switches and non-voluntary context switches.

Voluntary context switches
When running a process can decide to initiate a context switch, if the decision is made by the code itself we talk about a voluntary context switch (voluntary_ctxt_switches). This can be for example that you voluntarily give up your execution time by calling sched_yield or you can put a process to sleep while waiting for some event to happen.

Additionally, a voluntary context switch will happen when your computation completes prior to the allocated timeslice expires.

All acceptable when used in the right manner and when you are aware of the costs of a context switch.

non-voluntary context switches
Next to the voluntary context we have the non-voluntary context switches (nonvoluntary_ctxt_switches). A non-voluntary context switch happens when a process becomes unresponsive, however, it also happens when the task is not completed within the given timeslice. When the task is not completed in the given timeslice the state will be saved and a non-voluntary context switch happens.

Prevent context switching
When trying to develop high performance computing solutions you should try to, at least, be aware of context switching and take it into account. Even better try to minimize the number of voluntary context switches and try to find the cause of every non-voluntary context switch.
As context switching comes with a cost you want to minimize this as much as possible, and when a non-voluntary context switch happens the state needs to be saved and the task is placed back in the scheduler queue needing to wait again for a execution timeslice. This makes the overall performance of your system slow down and the specific code you have written becomes even more slow.

Check proc context switches
When working on Linux, we are using Oracle Linux in this example however this applies for most systems, you can check information on context switches by looking into the status which can be located at /proc/{PID}/status in the below example we check for the voluntarty and non-voluntary context switches of pid 25334.

[root@ce /]#
[root@ce /]# cat /proc/25334/status | grep _ctxt_
voluntary_ctxt_switches: 687
nonvoluntary_ctxt_switches: 208
[root@ce /]#

As you can see the number of voluntary context switches is (at this moment) 687 and the number of non-voluntary context switches is 208. This is a quick and dirty way of determining the number of context switches that a specific PID has had at a specific moment.

Monitor context switches
You can monitor your systems for context switching. Even though you are able to do so, you will need a good case to do it. Even though it provides information on your system in most cases and deployments there is no real need to monitor the number of context switches constantly. Having stated that, there are also a lot of cases where monitoring context switching can be vital for ensuring the health of your server and/or compare nodes in a wide cluster.

A quick and dirty way of monitoring your context switches is by taking a sample. For example you could take a sample of the average number of context switches for all processes on you Linux instance that execute a context switch in the sample timeframe.

The below example script takes a 10 second sample of the context switches and provide the output of only the relevant data for this we use the pidstat command which can be installed by installing the sysstat package which is available on the Oracle Linux YUM repository.

pidstat -w 2 1 | grep Average | grep -v pidstat | sort -n -k4 | awk '{ if ($2 != "PID") print "ctxt sample:" $2" - "  $3 " - " $4 " - "  $5}'

The full example in our case looks like the one below:

[root@ce tmp]# pidstat -w 2 1 | grep Average | grep -v pidstat | sort -n -k4 | awk '{ if ($2 != "PID") print "ctxt sample:" $2" - "  $3 " - " $4 " - "  $5}'
ctxt sample:12 - 0.50 - 0.00 - watchdog/0
ctxt sample:13 - 0.50 - 0.00 - watchdog/1
ctxt sample:15 - 0.50 - 0.00 - ksoftirqd/1
ctxt sample:18 - 3.00 - 0.00 - rcuos/1
ctxt sample:2183 - 1.00 - 0.00 - memcached
ctxt sample:2220 - 1.00 - 0.00 - httpd
ctxt sample:52 - 1.00 - 0.00 - kworker/1:1
ctxt sample:56 - 1.50 - 0.00 - kworker/0:2
ctxt sample:7 - 14.00 - 0.00 - rcu_sched
ctxt sample:9 - 11.50 - 0.00 - rcuos/0
[root@ce tmp]#

to understand the output we have to look at how pidstat normally provides the output. The below is an example of the standard pidstat output:

[root@ce tmp]# pidstat -w 2 1
Linux 4.1.12-61.1.28.el6uek.x86_64 (testbox7.int)  05/23/2017  _x86_64_ (2 CPU)

03:24:37 PM       PID   cswch/s nvcswch/s  Command
03:24:39 PM         3      0.50      0.00  ksoftirqd/0
03:24:39 PM         7     14.43      0.00  rcu_sched
03:24:39 PM         9      9.45      0.00  rcuos/0
03:24:39 PM        18      3.98      0.00  rcuos/1
03:24:39 PM        52      1.00      0.00  kworker/1:1
03:24:39 PM        56      1.49      0.00  kworker/0:2
03:24:39 PM      1557      0.50      0.50  pidstat
03:24:39 PM      2183      1.00      0.00  memcached
03:24:39 PM      2220      1.00      0.00  httpd

Average:          PID   cswch/s nvcswch/s  Command
Average:            3      0.50      0.00  ksoftirqd/0
Average:            7     14.43      0.00  rcu_sched
Average:            9      9.45      0.00  rcuos/0
Average:           18      3.98      0.00  rcuos/1
Average:           52      1.00      0.00  kworker/1:1
Average:           56      1.49      0.00  kworker/0:2
Average:         1557      0.50      0.50  pidstat
Average:         2183      1.00      0.00  memcached
Average:         2220      1.00      0.00  httpd
[root@ce tmp]#

As you can see from the “script” we print $2, $3, $4 and $5 for all average data where S2 is not “PID”. This gives us all the clear data. In our case the columns we show are the following:

$2 – the PID
$3 – number of voluntary context switches in the given sample time
$4 – number of non-voluntary context switches in the given sample time
$5 – the command name


How to use the monitor data
Collecting data, collecting sample data via monitoring is great, however when not used it is worthless and has to justify the costs of running the collector. As collecting the number of context switches has a cost you need to make sure you really need the data. A couple of ways you can use the data are described below and given a potential value in your maintenance and support effort.

Case 1 - Node comparison
This can be useful when you want to compare nodes in a wider cluster. Checking the number of context switches will be part of a wider set of checks and taking sample data. The number of context switches can be a good datapoint in the overall comparison of what is happening and what the difference between nodes is.

Case 2 - Version comparison
This can be a good solution in cases where you often have new version (builds / deployments) of code to your systems and want to track subtle changes in behavior of how the systems are working in a subtle manner.

Case 3 – Outlier detection
Outlier detection to detect subtle changes in the way the system is behaving over time. You can couple this to machine learning to detect changes over time. The number of context switches changing over time can be an indicator of a number of things and can be a pointer for a deeper investigation to tune your code.

Case 4 – (auto) scaling
Detecting the number of context switches, in combination with other datapoints can be input for scaling the number of nodes up and down. This in general is coupled with CPU usage, transaction timing and others. Adding context switching as an additional datapoint can be very valuable.

The site reliability engineering way
When applying the above you can adopt this in your SRE (site reliability engineering) strategy as one of the inputs to monitor your systems, automatically detect trends and prevent potential issues and feedback to developers on certain behaviour of the code in real production deployments.

No comments: