Monday, March 29, 2010

Designing a Self-Healing Mechanism as a Layered Architecture

This post is a sequel to my previous post on achieving Reliability with Self-Healing Systems. Here, I go a step further and describe the architectural design of the Robot example we discussed in that post.

A robot is comprised of both hardware and software components. Here the hardware components are the camera, sensors and other mobility components. The Face recognition and obstacle detection constitute the main software components of the robot. The data captured by camera will be sent to the Face Recognition component, which processes the received data and identifies the desired recipient. Similarly, the data sent by the sensors will be processed by the obstacle detection component to identify the objects on the path. So for the robot application to perform successfully all these software and hardware components have to operate together.

The above described robot application is just one particular example of robot. There are many such robot applications that are designed and developed for various purposes. And as we know each such robot application is comprised of several components. So instead of providing the Self-Healing mechanism for each robot application it is more appropriate to design a Self-Healing mechanism for the robot platform, which supports the design, implementation and execution of all component based robot applications developed using the platform.

Below we will study the software architecture of one such component based robot platform. The platform is designed as the layered software architecture, structured into the service layer and self-management layer. The service layer contains executors or Threads processing various hardware and software components periodically within the defined cycle times, whereas the self-management layer monitors and self-manages failures of the executors. Figure 1 depicts executors in the service layer, and the monitor and executor repair manager in the self-management layer. The monitor is composed of the monitor listener and monitor handler which detects the executors violating the cycle time requirements, whereas the executor repair manager does the self-managing of failed executors. (click below image for enlarged view)


Failure Situation: An Executor or a Software Thread is assigned to execute only those components with same cycle time. In the figure, couple of camera components with same cycle time of 50ms is executed by Executor 1 (thread) and the Face recognition and two Obstacle detection components, each of cycle time 100ms are executed by Executor 2. The requirement is that each camera component has to be executed once every 50ms, so Executor 1 should be capable of executing both camera component once every 50ms. Similarly, Executor 2 has to execute the three (one Face recognition and two Obstacle detection) components once every 100ms. If the Executor fails to achieve this time constraint, it is termed as failed and in turn the components executed by the failed executor won’t operate as expected, resulting in the malfunction of the robot.

Interaction between Service layer and Self-Healing Layer: All the executors are monitored by the self-Healing layer. After executing each component the Executor has to notify the Monitor about the details of the component executed. This interaction happens through a message queue. The executors puts the message at one end of the message queue and the Executor Monitor Listener retrieves the message at the other end and store the executed component information in the component sequence table and time table.

Detecting the Failed Executor: As an Executor start one cycle, the start time of the cycle and the total cycle duration is updated in the Time-Table. The Executor Monitor Handler is responsible for checking if the Executor’s current cycle can be completed within the cycle duration. A violation of this result in the failure of the Executor and this has to be immediately notified to Executor Repair Manager.

Referring to the diagram, let’s assume Executor 2 fail’s while executing the first component i.e. ‘Face Recognition component’. This means the Face Recognition component took more time to execute as a result of which the other components assigned to Executor 2 could not be executed with in the cycle time.

Repair and Recovery Process: So it is clear that the failed executor is not capable of executing all the assigned components within the given cycle time. So the components are split and assigned to a new executor. All the components from the first component till the failed component are assigned to a new Executor (thread) and the remaining components continue to execute in the same executor after the executor is restarted in the next cycle. In this example as the first component itself failed it is removed from Executor 2 and assigned to a new executor (Executor 3). Now the work load of one executor is split among two executors and this may probably lead to the successful execution of all the components. If any failure is detected further, the splitting process will occur again.

The above approach has been prototyped and tested under practical circumstances and has shown significant improvement in the performance of robot applications.

Your feedback and suggestions are greatly appreciated.

Source: Self-Healing component executors for OPRoS, Dr. Michael Shin, Hemanth gowda, Taeghyun Kang, Texas Tech University and ETRI, South Korea.

Friday, March 5, 2010

Reliability with Self-Healing Mechanism

Reliability is the ability of the system to perform its operations in routine circumstances, as well as hostile or unexpected circumstances. Normally as a system is being developed attention is focused mainly on ensuring optimum performance under normal circumstances. It becomes highly challenging to simulate hostile scenarios due to their unpredictable nature and most of it is understood only when encountered.

Self-Healing mechanism is a technology, which considers the worst case scenario that a system could face and addresses it to ensure that it delivers the desired performance. Self-healing deals with self- cure of problems by the failed objects themselves. This will ensure that they are back at work again. In order to cure the problem, it has to first identify the problem. So a Self-Healing mechanism is composed of automatic error detection and recovery mechanisms.

Before looking to other detailed aspects of Self-Healing systems, let us understand the problem more clearly and how the self healing mechanism solves the problem to a greater extent in our daily usage of computers software.

Example 1: Consider the MS Word application. You want to document some notes and you just start with it and in the process you forget to save the document. After you have documented several pages unexpectedly the word application closes and this could be because of several reasons like power outage, word process being killed by an external process, system shuts down because of hardware failure etc. But the end result of this incident is some of your valuable data is lost as well as the time spent.

Fortunately this is not the case; the word application has a self-healing mechanism which recovers the unsaved or lost data. When the system or application resumes after the unexpected failure and when you restart the word application, you will find a popped-up recovery window, which shows all the documents that were recovered from the unexpected failure. If you open the document that you haven’t saved, to your surprise you will find most of the data you entered.

Example 2: Consider the case of Robots which are getting more used gradually at home, industry and military for wide variety of tasks. Suppose that a robot application is to deliver water from a kitchen to a human being. This application can be composed of several software components, which include Camera, Face Recognition, Obstacle Detection, and Mobility components. The Camera component takes pictures on the path between a kitchen and a human being repeatedly, sending the pictures to both the Face Recognition component and Obstacle Detection component. The Face Recognition component analyzes the camera data to recognize the human being, whereas the Obstacle Detection component uses the data to detect obstacles on the path.

As can be seen in the above example robotic application is composed of several components to accomplish a particular work. So if any one of these components fails, the work is not completed as expected. In this kind of adverse situation a Self-Healing mechanism, which is monitoring all the components in the robot application will immediately analyze the problem in the failed robot component and will automatically take measures to repair and recover the failed component. This greatly improves the performance of robotic applications even in adverse situations.

In summary, either it is a widely used application as Word or more sophisticated as a Robot, a Self-healing Mechanism surely adds the Reliability component to the system and makes it more Robust. But still this technology is very naïve as you find this feature only in some widely used and more popular products or in some critical applications.

In this post, I have just discussed the reliability issue in software systems for which Self-Healing Mechanism could be a possible solution.

In my next post, Designing a Self-Healing mechanism as a layered architecture I’ll be discussing more about design and implementation details of a Self-Healing system.