Checkpointing contributes to the performance and flexibility of a distributed system by enabling processes to be migrated to other hosts during their execution. Checkpointing systems usually require the checkpointed program to be re-linked with a checkpointing library before it is executed. A restriction imposed by this requirement is that it is not possible to checkpoint programs already in execution or programs that cannot be re-linked (such as proprietary executables).
We have constructed a checkpointing tool, called the Process Hijacker, that eliminates this restriction. The Process Hijacker makes any running process checkpointable by dynamically re-writing its code. Any process can be hijacked at any time during its execution. No prior preparation of the executable, such as re-linking, is necessary. Instead, the Hijacker dynamically injects a checkpointing library into its target process. In addition, the Hijacker also strips the original system call layer of the process, and replaces it with a remote system call RPC layer. These RPCs ensure that the system calls of the hijacked process will continue to execute correctly after the process migrates.
The Process Hijacker is a synthesis of technologies developed by two systems research groups at UW-Madison, Condor and Paradyn. Condor is a distributed batch processing system designed to support high throughput computing on large pools of commodity workstations. The Hijacker injects a variant of Condor's checkpointing and remote system call libraries into its target. Paradyn is a parallel program performance monitoring system that uses dynamic code re-writing to insert instrumentation into a running program. Its code re-writing technology, called DynInst, is the interface through which the Hijacker binds the Condor libraries to the target process.
In this talk I will explain how the Process Hijacker and its constituent Condor and Paradyn technologies work.
Colloquia Series page.