This morning I attended a two-hour presentation by Ankita Garg of IBM, Bangalore. The event was organized by ACM Bangalore at the premises of Computer Society of India (CSI), Bangalore Chapter on Infantry Road. It was about making Linux into an enterprise real-time OS. Here I present a summary of the talk. If you are more curious, you can download the complete presentation.
How does one define a real-time system? It’s all about guarantees made on latency and response times. Guarantees are made about an upper bound on the response time. For this to be possible the run-time behaviour of the system must be optimized by design. This leads to deterministic response times.
We have all heard of soft and hard real time systems. The difference between them is that the former is tolerant to occasional lapse in the guarantees while the latter is not. A hard real-time that doesn’t meet its deadline is said to have failed. If a system can meet its deadline 100% of the time then it can be formally called a hard real-time system.
Problems with Linux
Linux was not designed for real-time behaviour. For example, when a system call is made, kernel code is executed. Response time cannot be guaranteed because during its execution an interrupt could occur. This introduces non-determinism to response times.
The scheduling in Linux tries to achieve fairness while considering the priority of tasks. This is the case with priority based Round Robin scheduling. Scheduling is unpredictable because priorities are not always strict. A lower priority process could be running at times until the scheduler gets a chance to reschedule a higher priority process. Kernel is non-pre-emptive. When kernel is executing critical sections interrupts are disabled. One more problem is the resolution of timer which is at best 4 ms and usually 10 ms.
Solutions in the Real-Time Branch
A branch of development has been forked from the main branch. This is called CONFIG_PREEMPT_RT. This contains enhancements to support real-time requirements. Some of these enhancements have been ported to the main branch as well.
One important change is on spin locks. These are more like semaphores. Interrupts are not disabled so that these spin locks are pre-emptive. However, spin locks can be used in the old way as well. It all depends if the APIs are called for spinlock_t or raw_spinlock_t.
The sched_yield has a different behaviour too. A task that calls this is added back to the runnable queue but not at the end of the queue. Instead it is added to its right level of priority. This means that a lower priority process can face starvation. If such a thing does happen, it is only because design is faulty. Users need to consider setting correct priorities to their tasks. There is still the problem of priority inversion which is usually overcome using priority inheritance.
There is also the concept of push and pull. In a multiprocessor system, decision has to be made about the CPU where a task will run. A task waiting in the runnable queue of a particular CPU can be pushed or pulled to another CPU depending on tasks just completed.
Another area of change is IRQ. IRQ is kept simple while the processing is moved to an IRQ handler. There was some talk on soft IRQ, top half and bottom half, something I didn’t understand. I suppose these will be familiar to those who have worked on interrupt code on Linux.
In plain Linux, timers are based on the OS timer tick. This does not give high resolution. High resolution is achieved by using programmable interrupt timers, which requires support from hardware. Thus timers are separated from the OS timer ticks.
Futex is a new type of mutex that is fast if uncontested. It happens in the user space. Only if the mutex is busy it goes to kernel space and it takes the slower route.
In IBM, the speaker mentioned the tools she had used to tune the system for real-time: SystemTap, ftrace, oprofile, tuna
Other than what’s been discussed above, other solutions are available – RTLinux, L4Linux, Dual OS/Dual Core and using Virtual Logic, timesys… There was not a lot of discussion about these implementations.
Why is real-time behaviour important for enterprises? This is because enterprises make guarantees through Service Level Agreements (SLA). They guarantee certain maximum delays which can only be achieved on an RTOS. The greater issue here is that such delays are not limited to the OS. Delays are as perceived by users. This means that delays at the application layer have to be considered too. It is easy to see that designers have to first address issues of real-time at the OS level before considering the same at the application layer.
The presenter gave application examples based on Java. Java, rather than C or C++, is more common for enterprise solutions these days than perhaps a decade ago. The problem with Java is that there are factors that introduce non-determinism:
Garbage collection: runs when system is low on free memory
Dynamic class loading: loads when the class is first accessed
Dynamic compilation: compiled once when required
Solutions exist for all these. For example, the garbage collector is made to run periodically when the application task is inactive. This makes response times more deterministic. Static class loading can be performed in advance. Instead of just-in-time (JIT) compilation, ahead-of-time compilation can be done – this replaces the usual *.jar files with *.jxe files. Some of these are part of IBM’s solution named WebSphere Real-Time.
There is wider community that is looking at RTSJ – Real-Time Specifications for Java.
Real-time guarantee is not just about the software. It is for the system as a whole. Hardware may provide certain functionality to enable real-time, as we have seen for the case of higher resolution of timers. Since real-time behaviour is about response times, in some cases performance may be compromised. Certain tasks may be slower but this is necessarily so because they there far more important tasks that need a time guarantee. There is indeed a trade-off between performance and real-time requirement. On average, real-time systems may not have much better response times than normal systems. However, building a real-time system is not about averages. It is about an absolute guarantee that is far difficult to meet on a non-real-time system.