GT 3.9.3 WS_GRAM Key Concepts

Overview

WS_GRAM provides a set of web services to submit, monitor, and cancel jobs on Grid computing resources via Web services consistent with the WSRF model. Jobs are computational tasks that may perform input/output operations while running which affect the state of the computational resource and its associated filesystems. In practice, such jobs may require coordinated staging of data into the resource prior to job execution and out of the resource following execution. Some users, particularly interactive ones, benefit from accessing output data files as the job is running. Monitoring consists of querying and subscribing for status information such as job state changes.

Grid computing resources are typically operated under the control of a scheduler which implements allocation and prioritization policies while optimizing the execution of all submitted jobs for efficiency and performance. WS_GRAM is not a resource scheduler, but rather a protocol engine for communicating with a range of different local resource schedulers using a standard message format.

Targeted Job Types

WS_GRAM is meant to address a range of jobs where reliable operation, stateful monitoring, credential management, and file staging are important. WS_GRAM is not meant to serve as a "remote procedure call" (RPC) interface for applications not requiring many of these features, and furthermore its interface model and implementation may be too costly for such uses.

The underlying WSRF core environment used for WS_GRAM should also be considered as a direct application-hosting solution for lightweight applications. In other words, if an application requires only modest input and output data transport in a stateless manner (all that matters is the result data or fault signal) and will be invoked many times, it may be a good candidate for exposure as an application-specific Web service.

Conceptual Details

A number of concepts underly the purpose and motivation for WS_GRAM. These concepts are divided into broad categories below.

Component architecture

Rather than consisting of a monolithic solution, WS_GRAM is based on a component architecture at both the protocol and software implementation levels. This component approach serves as an ideal which shapes the implementation as well as the abstract design and features.

WS_GRAM service suite

Job management with WS_GRAM makes use of multiple types of service:

  • Job management services represent, monitor, and control the overall job lifecycle. These services are the job-management specific software provided by the WS_GRAM solution.

  • File transfer services support staging of files into and out of compute resources. WS_GRAM makes use of these existing services rather than providing redundant solutions.

  • Credential management services are used to explicitly control the delegation of rights among distributed elements of the WS_GRAM architecture based on users' application requirements. Again, WS_GRAM makes use of more general infrastructure rather than providing a redundant solution.

WSRF service model

The Globus Toolkit software development environment, and particularly WSRF core, is used to implement distributed communications and service state.

Scheduler adapters

A scripted plug-in architecture is provided by WS_GRAM to enable extension with adapters to control a variety of local schedulers.

Security

Secure operation

WS_GRAM utilizes WSRF functionality to provide for authentication of job management requests as well as to protect job requests from malicious interference. The use of WS_GRAM does not reduce the ability for system administrators to control access to their computing resources. The use of WS_GRAM also does not reduce the safety of jobs users run on a given computing resource.

Local system protection domains

To protect users from each other, jobs are executed in appropriate local security contexts, e.g. under specific Unix user IDs based on details of the job request and authorization policies. Additionally, WS_GRAM mechanisms used to interact with the local resource are design to minimize the priveleges required and to minimize the risks of service malfunction or compromise.

Credential delegation and management

A client may delegate some of its rights to WS_GRAM services in order to facilitate the above functions, e.g. rights for WS_GRAM to access data on a remote storage element as part of the job execution. Additionally, the client may delegate rights for use by the job process itself.

Audit

To assist with normal accounting functions as well as to further mitigate risks from abuse or malfunction, WS_GRAM uses a range of audit and logging techniques to record a history of job submissions and critical system operations.

Job Management

Reliable job submission

WS_GRAM provides an "at most once" job submission semantics. A client is able to check for and possibly resubmit jobs, in order to account for transient communication errors without risk of running more than one copy of the job.

Job cancellation

While many jobs are allowed to run to their natural completion, WS_GRAM provides a mechanism for clients to cancel (abort) their jobs at any point in the job lifecycle.

Data management

Reliable data staging

WS_GRAM provides for reliable, high-performance transfers of files between the compute resource and external (gridftp) data storage elements before and after the job execution.

Output monitoring

WS_GRAM provides a mechanism for incrementally transferring output file contents from the computation resource while the job is running.

Task coordination

Parallel jobs

Some jobs are parallel, meaning that they consist of more than one simultaneous tasks. These tasks are often hosted on parallel computing hardware to provide increased computational throughput.

Task rendezvous

Some parallel programming environments, such as MPI, provide mechanisms for all tasks in a parallel computation to find out about each other: to know the number of peer tasks as well as possibly to exchange some information between tasks. Native parallel programming models often support this within the local job start mechanism.

WS_GRAM provides a mechanism for task rendezvous which job applications may use if they do not have another more appropriate solution. This mechanism may be used directly by application code or by intervening middleware libraries, e.g. libraries that are designed to present a simplified programming model to applications run via WS_GRAM. WS_GRAM does not require tasks to be coordinated, but addresses this requirement in order to facilitate a wider range of applications.