gLite's Logging and Bookkeeping is a well-established tool for monitoring processes in the grid, typically compute jobs. It collects event information from various grid elements and sums it up to determine the current status of any such process at the given moment. Aside of various types of compute jobs L&B has also been used to monitor the states of sandbox transfers. With a great portion of the grid infrastructure becoming virtualized, a virtual machine is becoming a "process" just like a normal job, with a distinguished lifetime, a state diagram with well defined state transitions, and a similar requirement for status monitoring. Once the likeness becomes apparent, it is obvious that monitoring can be done by the same tools. Another advantage of using the same tool for monitoring virtual machines and compute jobs, which run on them, is that it makes it easy to keep track of the relationship at the same time, which simplifies problem solving.
Link for further information
Wider impact of this work
Although the solution has been prototyped with Open Nebula, it is intended to combine with other widely used virtualization stacks. There are multiple options for instrumenting event delivery. L&B client library provides C or Java bindings and command-line tools are also available for use in scripting languages. Some virtualization solutions, OpenStack for instance, already use messaging at the moment. The possibility of tapping into that source of information and using it "as is" to monitor its processes is another opportunity for further investigation.
Using a common monitoring tool is advantageous for heterogeneous infrastructures relying on multiple virtualization solutions at once. L&B can become a common point of reference, or a source of unified-format notifications to be used by other elements for further processing.
Description of the work
The number of "job" types L&B recognizes has been extended with two additional types: one covering the states of an individual virtual machine (VM) and one corresponding to a collection of virtual machines, representative of a cluster. All the other mechanisms such as event delivery, storage, querying API, notifications or statistics are already in place in L&B. A single instance of an L&B server in its standard setup can be used to monitor compute jobs and VMs at the same time. The basic VM state diagram is a simplified version of the state diagram defined by the Open Nebula toolkit. A prototype has been setup with Open Nebula as the virtualization stack. It provides call-out hooks on major events, which makes it easy to instrument L&B event delivery. Events can be delivered either over L&B's legacy event delivery chain, or through messaging (STOMP/OpevWire), which leaves potential adopters with a choice of mechanism. The L&B team is looking further at other messaging protocols applicable to event delivery. L&B events can be sent not only from the virtualization stack, but also from Dom0 or even from within the virtual machine itself. This makes some of them partly redundant, but in fact this is building up on one of L&B's other advantages since redundant events make for a more reliable state determination in case of localized failures, and distinguishing between similar events received from various sources provides for fine-grained status monitoring.
By abstraction a VM state diagram is also applicable to physical machines. Since L&B can record relationships between individual processes it monitors, such as compute jobs and sandbox transfers, or compute jobs and virtual machines, it can also report on the relationship between a virtual and physical machine. This information is often found useful by infrastructure administrators.