Many scientists and engineers operate in what they perceive to be a resource-scarce environment. They may run programs on their local workstation, which is rarely fast enough for their needs, or submit jobs to a departmental server or remote supercomputer, which are inevitably grossly oversubcribed. The adventurous may exploit a PC cluster or the workstations on a local area network, but these resources are often still far from sufficient for increasingly demanding computational tasks such as simulation, large scale optimization, Monte Carlo computing, image processing, and rendering. Yet from a global perspective, computing is far from scarce: the number of unused computing cycles dwarfs those used for productive work. In principle, we should be able to harness these cycles, exploiting inter-institutional resource pooling arrangements or calling upon commerical services to achieve order-of-magnitude increases in instantaneously available computine cycles via on-demand access to remote computing resources. Yet in practice, this rarely happens, due to the significant ``potential barrier'' associated with the diverse mechanisms, policies, failure modes, performance uncertainties, etc., that inevitably arise when we cross institutional boundaries. Overcoming this potential barrier requires new methods and mechanisms that meet the following three key user requirements for computing in a ``computing Grid'' comprising resources from multiple locations: * They want to be able to discover, acquire, and manage computational resources dynamically, in the course of their everyday activities. * They do not want to be bothered with where these resources are located, with what mechanisms are required to use them, with keeping track of the status of the computational tasks operating on these resources, or with reacting to failure. * They do care about how long their tasks are likely to take to run and how much these tasks will cost. In this article, we present an innovative distributed computing framework that addresses these three issues. The Condor-G system leverages the significant advances that have been achieved in recent years in two distinct areas: (1) management of computation and harnessing of resources within a single administrative domain, specifically within the Condor system and (2) security, resource discovery, and resource access in multi-domain environments, as supported within the Globus Toolkit. In brief, we combine the intradomain resource management methods of Condor and the interdomain resource management protocols of the Globus Toolkit to allow the user to harness multi-domain resources as if they all belong to one personal domain. The user defines the tasks to be executed; Condor-G handles all aspects of discovering and acquiring appropriate resources, regardless of their location; initiating, monitoring, and managing execution on those resources; detecting and responding to failure; and notifying the user of termination.
Back to Program
Back to ACAT Web page