Eliminating OS-induced jitter on Linux-like platforms

Jitter when making API calls can result in inconsistent execution time for making those calls. For example, the median call to return_loan() may take ~4,500 nanoseconds, but there could be execution outliers that take 5-10x this median time. What could cause this kind of behavior? The cause of such outliers could be OS-induced jitter. Thread context switches and CPU migrations are two main contributing factors.

In our internal testing, RTI was able to eliminate jitter above a desired threshold (e.g. 2-3x median) by ensuring our test program runs with as little interruption from the scheduler as possible.

Perf stat

You can run a program with perf stat to collect statistics about the number of CPU migrations and context switches. This output can be used to determine if the tuning steps you’re taking are improving the jitter or not. You can run a program with perf stat on Linux-like platforms as follows:

perf stat <path to program> <arguments to program>

At the end of the program (or when you hit Control-C), perf stat will print statistics like the following:

Performance counter stats for 'objs/x64Linux3gcc4.8.2/myType_publisher 107':
            37.90 msec task-clock # 0.032 CPUs utilized
             327 context-switches # 0.009 M/sec
                10 cpu-migrations # 0.264 K/sec
                2,451 page-faults # 0.065 M/sec
                59,063,105 cycles # 1.559 GHz
          64,381,085 instructions # 1.09 insn per cycle
              15,303,993 branches # 403.840 M/sec
            380,709 branch-misses # 2.49% of all branches
        1.184106390 seconds time elapsed
        0.025892000 seconds user
        0.016028000 seconds sys

The number of context switches and CPU migrations strongly correlate to how much jitter is observed when making API calls.

Tuning advice

There are a number of knobs we can tune to help make execution of any process more deterministic, which ultimately helps with the jitter. Using perf stat to track the number of context switches and CPU migrations when performing this tuning will be helpful to measure your progress. In general, fewer context switches and CPU migrations indicates reduced jitter and therefore implies the tuning change was helpful.

The items that seem to help the most in RTI’s internal testing are:

  • Increasing thread priority for the thread making the API call where the jitter is observed
    • We can configure this thread to use real-time priorities, e.g. RT priority 99 on Linux
    • Depending on which thread you're using to call the API in question, we have different options to configure this, e.g. RECEIVER_POOL’s thread settings if the API was called from the receive thread.
  • Avoiding CPU migrations. On Linux we can use the taskset tool to set the CPU affinity for a process.
    • e.g. taskset -c 3 <path to program>
      • Sets the program to run on CPU 3
    • There is also a QoS configuration we can use to set CPU affinity on internal Connext threads. For example, below we depict RECEIVER_POOL’s thread settings with a CPU affinity set to core 3.
  • Isolating the CPU core being used, so that no other process can run on that core besides the one calling the API in question.
    • In general Linux platforms, there is the isolcpus configuration for this purpose.
    • In RedHawk, which is a real-time Linux OS, there is the concept of "CPU shielding" which makes this very easy to set up dynamically without needing to use isolcpus. In our testing on RedHawk, CPU shielding was very effective at reducing jitter. You can run the shield command on RedHawk, e.g.:
      • e.g. shield -a 3
        • This commands sets the shielding mask on CPU 3 for all possible shielding attributes. See “man shield” for more information.
        • After running this shield command, next run the program you’re testing on the CPU you just shielded using taskset or CPU affinity, so that it is the only process running on the CPU you shielded
  • It is recommended to avoid CPU 0 and CPU 1 when trying to avoid jitter. CPU cores 0 and 1 tend to be busy compared to other cores. We recommend trying different cores to find the one that gives the best performance.