Monday, March 29, 2010

Designing a Self-Healing Mechanism as a Layered Architecture

This post is a sequel to my previous post on achieving Reliability with Self-Healing Systems. Here, I go a step further and describe the architectural design of the Robot example we discussed in that post.

A robot is comprised of both hardware and software components. Here the hardware components are the camera, sensors and other mobility components. The Face recognition and obstacle detection constitute the main software components of the robot. The data captured by camera will be sent to the Face Recognition component, which processes the received data and identifies the desired recipient. Similarly, the data sent by the sensors will be processed by the obstacle detection component to identify the objects on the path. So for the robot application to perform successfully all these software and hardware components have to operate together.

The above described robot application is just one particular example of robot. There are many such robot applications that are designed and developed for various purposes. And as we know each such robot application is comprised of several components. So instead of providing the Self-Healing mechanism for each robot application it is more appropriate to design a Self-Healing mechanism for the robot platform, which supports the design, implementation and execution of all component based robot applications developed using the platform.

Below we will study the software architecture of one such component based robot platform. The platform is designed as the layered software architecture, structured into the service layer and self-management layer. The service layer contains executors or Threads processing various hardware and software components periodically within the defined cycle times, whereas the self-management layer monitors and self-manages failures of the executors. Figure 1 depicts executors in the service layer, and the monitor and executor repair manager in the self-management layer. The monitor is composed of the monitor listener and monitor handler which detects the executors violating the cycle time requirements, whereas the executor repair manager does the self-managing of failed executors. (click below image for enlarged view)


Failure Situation: An Executor or a Software Thread is assigned to execute only those components with same cycle time. In the figure, couple of camera components with same cycle time of 50ms is executed by Executor 1 (thread) and the Face recognition and two Obstacle detection components, each of cycle time 100ms are executed by Executor 2. The requirement is that each camera component has to be executed once every 50ms, so Executor 1 should be capable of executing both camera component once every 50ms. Similarly, Executor 2 has to execute the three (one Face recognition and two Obstacle detection) components once every 100ms. If the Executor fails to achieve this time constraint, it is termed as failed and in turn the components executed by the failed executor won’t operate as expected, resulting in the malfunction of the robot.

Interaction between Service layer and Self-Healing Layer: All the executors are monitored by the self-Healing layer. After executing each component the Executor has to notify the Monitor about the details of the component executed. This interaction happens through a message queue. The executors puts the message at one end of the message queue and the Executor Monitor Listener retrieves the message at the other end and store the executed component information in the component sequence table and time table.

Detecting the Failed Executor: As an Executor start one cycle, the start time of the cycle and the total cycle duration is updated in the Time-Table. The Executor Monitor Handler is responsible for checking if the Executor’s current cycle can be completed within the cycle duration. A violation of this result in the failure of the Executor and this has to be immediately notified to Executor Repair Manager.

Referring to the diagram, let’s assume Executor 2 fail’s while executing the first component i.e. ‘Face Recognition component’. This means the Face Recognition component took more time to execute as a result of which the other components assigned to Executor 2 could not be executed with in the cycle time.

Repair and Recovery Process: So it is clear that the failed executor is not capable of executing all the assigned components within the given cycle time. So the components are split and assigned to a new executor. All the components from the first component till the failed component are assigned to a new Executor (thread) and the remaining components continue to execute in the same executor after the executor is restarted in the next cycle. In this example as the first component itself failed it is removed from Executor 2 and assigned to a new executor (Executor 3). Now the work load of one executor is split among two executors and this may probably lead to the successful execution of all the components. If any failure is detected further, the splitting process will occur again.

The above approach has been prototyped and tested under practical circumstances and has shown significant improvement in the performance of robot applications.

Your feedback and suggestions are greatly appreciated.

Source: Self-Healing component executors for OPRoS, Dr. Michael Shin, Hemanth gowda, Taeghyun Kang, Texas Tech University and ETRI, South Korea.

No comments: