With the coming of age of packet-switched technology in preference to circuit-switched connections, many network systems are already based on soft-switch architecture. The fundamental concept in this architecture is that switching happens mostly in software. The role of software was previously minimal because switching used to happen in hardware. Connection circuits used to be established at the start of a call and torn down at call termination. With soft-switch architecture, the role of software is much more. Software therefore needs to be much more robust. Creating robust software involves many things but before that designers need to understand the requirements of such a software.
Yesterday at the British Council Library in Bangalore I picked up a wonderful book: Robust Communications Software by Greg Utas, published by John Wiley & Sons. The bulk of the book deals with the questions of “how” but the first chapter outlines the requirements of a carrier-grade system. I summarize these requirements with thoughts of my own.
Figure 1 shows the five basic attributes of a carrier-grade system. Although all are important in their own way, they are not necessarily of the same level of importance. It may be the case that reliability is more important that scalability. It may be that having a highly scalable system may require some compromise in terms of capacity. Some of these issues will be explained later in this post.
While I was working in the UK, one morning the office phone system was down for about a couple of hours. During that time, there was no way for customers to reach either the support staff or the sales staff. The loss to the business was somewhat mitigated because some customers had the mobile contacts of our staff. Internally, e-mails served well during that interim period. I don’t know if we had a softswitch architecture in place. Either way, availability means that public communication sytems should be operational 24/7 all days of the week. Imagine someone trying to place an emergency call only to find that the system is down.
Likewise, on another occasion, I was at a seminar and the presenter failed to give his demo to the audience. The reason was that there had been a planned power shutdown at his office, of which he had been unaware. Shut downs should be minimized and in some cases eliminated. Where shutdowns are necessary for upgrades and regular maintenance, alternative systems should be operational during the shutdown phase.
A rule of thumb is to achieve an availability of five-nines: the system is available 99.999% of the time. This equates to a stringent downtime of 6 seconds in a week or 5 minutes 15 seconds in a year. System availability is dependent on the availability of its components. A chain is only as strong as its weakest link. Thus, if a system needs five-nines availability, then the software should provide six-nines availability and the hardware should provide six-nines availability (0.999999 x 0.999999 = 0.999998).
This implies that the system does not breakdown, behave erratically or respond in unexpected ways. The system should perform its job as expected. It is functionally correct and of a certain guaranteed quality. Availability and reliability when taken together make for robustness of the system.
Some months ago I was writing a piece of software for decoding MAC-hs PDUs. These are PDUs received on the DL on HSDPA transport channel HS-DSCH. One of the requirements from this decoder was that it should be robust. This meant that given an illegal combination of input stream and channel configuration, the decoder does not crash the system. It meant that the decoder identifies the set of possible errors and takes corrective/preventive actions as appropriate. This meant that given any random input (used for testing robustness), the decoder does a best-effort decoding without crashing the thread.
In telephone systems, reliability means that calls are handled as expected – no wrong numbers, no premature call termination, no temporary loss of the link, no crosstalk, no loss of overall quality. A typical goal is achieve four-nines reliability: only one in 10,000 calls is mishandled. However, for financial transactions the requirement is a lot more stringent.
This is important from the outset but this is often overlooked. When an new cellular operator sources for equipment, his requirement may be only a dozen base stations to cover a city. His expected subscriber base may be no more than a million. What happens if his growth is phenomenal? What if within the first year of operation he has to expand his network, improve coverage and capacity, and satisfy more than 10 million subscribers? He would like to scale his current system rather than purchase a new and bigger one.
Scalability is often in reference to architecture. A system that has five units can scale to fifty easily because the architecture allows for it. On the other hand, a system designed specifically for five units cannot scale to fifty because its architecture is inadequate. Changing system architecture in a fundamental during system resizing is not scalability. Scalability implies resizing with minimal effort. The effort is minimal because the architecture takes care of the resizing. There is no extra design or development, only deployment and operation.
Scalability applies downwards as well. For example, a switch may have 100 parallel mobules operating at full capacity. At off-peak times, it should be possible to power down 80 modules and run only on the remaining 20. I refer to this as operational scalability. Overall, scalability is a highly attractive feature for any customer. The architecture may not necessarily be fixed. It could be configured appropriately to suit the purpose at a particular scale.
As an example, I recently came across a system with multiple threads and synchronous messaging from each thread. The system was not scalable for two reasons:
- Each user had a dedicated thread, all running on the same CPU. They were all doing the same task, only for different users. This is acceptable for a few users but with thousands of users, the overhead from context-switching will be high. A distributed architecture might also suit.
- Synchronous communication meant that the thread blocked while waiting for a reply from an external entity. Such idle times are not a problem for a lightly loaded system. When thousands of users are involved, asynchronous messaging must be preferred to make best use of idle times. What this means is that requests and responses are pipelined. Requests could be on one thread and responses processed in another thread.
One easy way to explain capacity is to look at the PC market. Intel 386 is far below in terms of performance when compared to Intel Pentium 4. However, if we consider the overall system capacity we will realize that the improvement is less than what the numbers suggest. As processors have got better, so too is the demand from applications. Today’s applications are probably running at about the same speed as yesterday’s applications. Only thing that’s really improved is the user experience. So, has the capacity of processors really improved?
In fact, capacity is closely linked to scalability. What if the system can scale if capacity drops due to increased overheads? What if the system can scale if lot more processors are required than what the competition can offer?
This relates to maintenance and upgrade of software. No system today is quite simple. Complex systems must contain intelligent partitions and well-defined interfaces. These are generally managed by large teams. While no single person will grasp the details of the entire system, he will be an expert in his own component or sub-system. Activities happen in parallel – bug fixes, new feature releases, modifying requirements or patches. Good documentation is needed. Good processes leading to better productivity will increase the lifespan of a software system.
Some elements of a good software architecture are:
- well-defined interfaces
- layering (vertical separation)
- partitioning (horizontal separation)
- high cohesion (within components)
- loose coupling (between components)
In conclusion, making a carrier-grade system does not happen right at the start. It is a process that involves continuous improvement. A good design is essential. Such a design will eventually meet all the five requirements that we have discussed. There are necessary choices to be made along the way. For example, a layered architecture suits well for large systems managed by large teams. It has good productivity but some parts of the system could have been of a higher capacity had they been of a leaner design. Likewise, application frameworks seek to provide a uniform API for applications. This increases productivity but at the expense of capacity.