GT 5.0.3 Release Notes: GRAM5


1. Component Overview

The Grid Resource Allocation and Management (GRAM5) component is used to locate, submit, monitor, and cancel jobs on Grid computing resources. GRAM5 is not a job scheduler, but rather a set of services and clients for communicating with a range of different batch/cluster job schedulers using a common protocol. GRAM5 is meant to address a range of jobs where reliable operation, stateful monitoring, credential management, and file staging are important.

2. Feature summary

New Features new since 5.0.2: none

Other Standard Supported Features

  • Remote job execution and management
  • Uniform and flexible interface to local resource managers
  • File staging before and after job execution
  • File and directory clean up after job termination
  • Service auditing for each submitted

Removed Features

  • Condor SEG module is no longer included. Its functionality has been moved into the core of the job manager program.

3. Summary of Changes in GRAM5

  • Improved diagnostics for RSL validation file parsing errors. If an RVF file contains an invalid attribute or syntax error, the job manager will include the parse error as an extended error message containing the path to the file and the invalid part of the file.
  • Improved diagnostics for missing -condor-arch and -condor-os configuration errors. If a job manager is misconfigured for condor, it will send an error message indicating what configuration option is missing.
  • Remove confusing log messages about state file ownership when a job manager starts up on a multi-user system.

4. Bug Fixes

4.1. General Bugs

  • GRAM-183: wrong entry in globus-gram-job-manager.rvf file
  • GRAM-188: Bad error message when job state directory isn't usable
  • GRAM-198: globus-personal-gatekeeper misparses -start -log
  • GRAM-196: Compilation fails on Solaris 10 x86 amd64
  • GRAM-82: printf of uninitialized value in logging
  • GRAM-199: job manager crash at deactivation for fork and condor
  • GRAM-200: Gatekeeper leaks extra copy of sockets to job manager
  • Job manager does not work on Solaris 10
  • Uninitialized variable in job manager can cause random sleeps
  • The job manager busy waits when job manager exits while another is trying to send it a job
  • GRAM-201: Gatekeeper doesn't unset security environment variables

4.2. PBS-Specific Bugs

  • Bug 7121: pbs.rvf doesn't contain the "name" attribute
  • GRAM 189: PBS LRM fails to detect end-of-job if PBS job history is enabled

4.3. Condor-Specific Bugs

5. Known Problems

The following problems and limitations are known to exist for GRAM5 at the time of the 5.0.3 release:

5.1. Limitations

  • None at this time.

5.2. Outstanding bugs

  • GRAM-2: Investigate how to setup GRAM5 services in a HA setup
  • GRAM-4: Add support for a "managed fork" service
  • GRAM-5: Add gram-level prologue and epilogue script execution for mpi jobs
  • GRAM-12: Gatekeeper's syslog output cannot be controlled
  • GRAM-15: transition from httpg to https
  • GRAM-22: client connections can't be timed out
  • GRAM-23: Improved error codes and error reporting for users
  • GRAM-24: Debug/verbose flags for globusrun, globus-job-run
  • GRAM-51: configurable control of number of perl scripts that can run simultaneously
  • GRAM-53: Generalize log path configuration
  • GRAM-79: Add support for OSG's "NFS Lite" concept
  • GRAM-99: Add a high-level diagram for the approach doc
  • GRAM-104: globus-job-manager-event-generator loads all historical events the first time run
  • GRAM-105: Held Condor jobs should be reported as SUSPENDED
  • GRAM-110: softenv extensions for GRAM5
  • GRAM-119: improve the GRAM LRM adapter doc
  • GRAM-122: tracking gram client software
  • GRAM-135: Improve developer doc for a reliable client
  • GRAM-138: GRAM5 job manager uses a lot of memory when SEG is pointed to incorrect log path
  • GRAM-139: SEG may deadlock with threads
  • GRAM-149: GRAM5 Unix domain socket misbehaves on Snow Leopard
  • GRAM-154: GASS Cache doesn't check for updates
  • GRAM-159: GRAM5 Migration guide is outdated
  • GRAM-163: improve error output for globusrun
  • Bug 5621: gram2 credential refresh problems in 4.0.5
  • Bug 1934: Gatekeeper's syslog output cannot be controlled
  • Bug 2739: Gatekeeper AuthZ/Gridmap Callout result logging
  • Bug 2741: catching SIGSEGV if dynamic loading of authorization modules fails
  • Bug 4199: Patch pre-WS GRAM to use individual condor logs for jobs
  • Bug 3795: jobmanager perl modules issues
  • Bug 4235: globus-job-manager doesn't exit if the job fails.
  • Bug 4730: MPI Jobs using Globus LSF in HP XC Cluster....
  • Bug 4747: Need evaluation of patch to JobManager.pm
  • Bug 4779: gram GT2 log files: timestamps are not ISO 8601 compatible
  • Bug 5143: DONE state never reported for Condor jobs when using Condor-G grid monitor
  • Bug 5429: stdin is lost when jobtype=multiple with jobmanager-lsf
  • Bug 5554: GRAM2 4.0.5 setup-globus-job-manager-fork.pl silent failure
  • Bug 5556: Audit directory setup instructions are insecure
  • Bug 5775: gram status of old jobs incorrect on some lsf systems
  • Bug 6184: pbs.pm jobmanager fails jobs on qstat failure
  • Bug 6337: Cannot configure globus to use different certificate path than default
  • Bug 6703: PBS scheduler adapter assumes that Globus is installed in the same location on the headnode of a cluster and on the work nodes.
  • Bug 6768: Held Condor jobs should be reported as SUSPENDED by GRAM
  • Bug 6815: Support standard install locations for globus-gram-protocol
  • Bug 6819: Missing metatdata in globus-scheduler-event-generator
  • Bug 6820: Support standard install locations for globus-gatekeeper
  • Bug 6821: Support standard install locations for globus-gatekeeper-setup
  • Bug 6822: Support standard install locations for globus-gram-job-manager-scripts
  • Bug 6823: Support standard install locations for globus-gram-job-manager
  • Bug 6824: Support standard install locations for globus-gram-job-manager-setup
  • Bug 6825: Remove hardcoded paths in globus-gram-job-manager-setup-fork
  • Bug 6826: Remove hardcoded paths in globus-gram-job-manager-setup-condor
  • Bug 6840: The PBS job manager doesn't handle large environments well
  • Bug 6855: Undefined variable in Makefiles
  • Bug 6862: PBS job manager fails if job history is enabled
  • Bug 6927: A Loadleveler LRM for GRAM5 should be very welcome
  • Bug 720: allow gram client to detect the version of a gram server
  • Bug 851: Add "cleanup" RSL attribute for cleaning up a job submission
  • Bug 5536: Missing dependency in package globus_gram_job_manager_auditing
  • Bug 5537: Missing dependency in package globus_gram_job_manager_auditing
  • Bug 3373: globus removes the temporary job directory before pbs writes back into it
  • Bug 5200: GRAM (pre-webservices) from OSG 0.6.0 (VDT 1.6.1) has bad syslog format
  • Bug 5207: GRAM SoftEnv extension bug
  • Bug 5250: Does not support mpi jobtype of RSL script
  • Bug 5272: Invalid parsing of RSL file

6. Technology dependencies

GRAM depends on the following GT components:

  • Globus Common
  • GSI C
  • GridFTP server

7. Tested platforms

Tested platforms for GRAM5:

  • Linux

    • Debian 5.0.6 x86_64

  • Mac OS X

    • Mac OS X 10.6.6

  • Solaris

    • OpenSolaris 11 (November 2008)

8. Backward compatibility summary

Protocol changes in GRAM since GT4 series:

  • The GRAM5 service uses a superset of the GRAM2 protocol for communciation between the client and service. The extensions supported in GRAM5 are implemented in such a way that they are ignored by GRAM2 services or clients. These extensions provide improved error messages and version detection.
  • GRAM5 does not support task coallocation using DUROC and its related protocols. Jobs submitted using DUROC directives will fail.
  • GRAM5 does not support file streaming. The standard output and standard error streams are sent after the job completes instead of during execution. As a special case, support for the Condor grid monitor program implements a small subset of the streaming capabilities of GRAM2 in GT 4.2.x.

9. Associated Standards

None

10. For More Information

See GRAM5 for more information about this component.

Glossary

J

job scheduler

See the term scheduler.