GT 5.0.2 GRAM5: Quality Profile


1. Test coverage reports

2. Code analysis reports

  • No code analysis reports have been generated at this time.

3. Outstanding bugs

  • GRAM-2: Investigate how to setup GRAM5 services in a HA setup
  • GRAM-4: Add support for a "managed fork" service
  • GRAM-5: Add gram-level prologue and epilogue script execution for mpi jobs
  • GRAM-12: Gatekeeper's syslog output cannot be controlled
  • GRAM-15: transition from httpg to https
  • GRAM-22: client connections can't be timed out
  • GRAM-23: Improved error codes and error reporting for users
  • GRAM-24: Debug/verbose flags for globusrun, globus-job-run
  • GRAM-51: configurable control of number of perl scripts that can run simultaneously
  • GRAM-53: Generalize log path configuration
  • GRAM-79: Add support for OSG's "NFS Lite" concept
  • GRAM-99: Add a high-level diagram for the approach doc
  • GRAM-104: globus-job-manager-event-generator loads all historical events the first time run
  • GRAM-105: Held Condor jobs should be reported as SUSPENDED
  • GRAM-110: softenv extensions for GRAM5
  • GRAM-119: improve the GRAM LRM adapter doc
  • GRAM-122: tracking gram client software
  • GRAM-135: Improve developer doc for a reliable client
  • GRAM-138: GRAM5 job manager uses a lot of memory when SEG is pointed to incorrect log path
  • GRAM-139: SEG may deadlock with threads
  • GRAM-149: GRAM5 Unix domain socket misbehaves on Snow Leopard
  • GRAM-154: GASS Cache doesn't check for updates
  • GRAM-159: GRAM5 Migration guide is outdated
  • GRAM-163: improve error output for globusrun
  • Bug 5621: gram2 credential refresh problems in 4.0.5
  • Bug 1934: Gatekeeper's syslog output cannot be controlled
  • Bug 2739: Gatekeeper AuthZ/Gridmap Callout result logging
  • Bug 2741: catching SIGSEGV if dynamic loading of authorization modules fails
  • Bug 4199: Patch pre-WS GRAM to use individual condor logs for jobs
  • Bug 3795: jobmanager perl modules issues
  • Bug 4235: globus-job-manager doesn't exit if the job fails.
  • Bug 4730: MPI Jobs using Globus LSF in HP XC Cluster....
  • Bug 4747: Need evaluation of patch to JobManager.pm
  • Bug 4779: gram GT2 log files: timestamps are not ISO 8601 compatible
  • Bug 5143: DONE state never reported for Condor jobs when using Condor-G grid monitor
  • Bug 5429: stdin is lost when jobtype=multiple with jobmanager-lsf
  • Bug 5554: GRAM2 4.0.5 setup-globus-job-manager-fork.pl silent failure
  • Bug 5556: Audit directory setup instructions are insecure
  • Bug 5775: gram status of old jobs incorrect on some lsf systems
  • Bug 6184: pbs.pm jobmanager fails jobs on qstat failure
  • Bug 6337: Cannot configure globus to use different certificate path than default
  • Bug 6703: PBS scheduler adapter assumes that Globus is installed in the same location on the headnode of a cluster and on the work nodes.
  • Bug 6768: Held Condor jobs should be reported as SUSPENDED by GRAM
  • Bug 6815: Support standard install locations for globus-gram-protocol
  • Bug 6819: Missing metatdata in globus-scheduler-event-generator
  • Bug 6820: Support standard install locations for globus-gatekeeper
  • Bug 6821: Support standard install locations for globus-gatekeeper-setup
  • Bug 6822: Support standard install locations for globus-gram-job-manager-scripts
  • Bug 6823: Support standard install locations for globus-gram-job-manager
  • Bug 6824: Support standard install locations for globus-gram-job-manager-setup
  • Bug 6825: Remove hardcoded paths in globus-gram-job-manager-setup-fork
  • Bug 6826: Remove hardcoded paths in globus-gram-job-manager-setup-condor
  • Bug 6840: The PBS job manager doesn't handle large environments well
  • Bug 6855: Undefined variable in Makefiles
  • Bug 6862: PBS job manager fails if job history is enabled
  • Bug 6927: A Loadleveler LRM for GRAM5 should be very welcome
  • Bug 720: allow gram client to detect the version of a gram server
  • Bug 851: Add "cleanup" RSL attribute for cleaning up a job submission
  • Bug 5536: Missing dependency in package globus_gram_job_manager_auditing
  • Bug 5537: Missing dependency in package globus_gram_job_manager_auditing
  • Bug 3373: globus removes the temporary job directory before pbs writes back into it
  • Bug 5200: GRAM (pre-webservices) from OSG 0.6.0 (VDT 1.6.1) has bad syslog format
  • Bug 5207: GRAM SoftEnv extension bug
  • Bug 5250: Does not support mpi jobtype of RSL script
  • Bug 5272: Invalid parsing of RSL file

4. Bug Fixes

  • GRAM-130: Individual Condor Logs per Job
  • GRAM-136: Error message not precise when disk quota is exceeded
  • GRAM-146: GRAM5 Usage stats ignores GLOBUS_USAGE_TARGETS environment
  • GRAM-155: Leak in file_clean_up
  • GRAM-156: job-failure-code not provided by jobmanager in response to status query
  • GRAM-157: RSL substitutions not updated on a restart request
  • GRAM-158: Remove references to setup-seg-job-manager.pl from documentation
  • GRAM-160: double free in job manager
  • GRAM-161: Don't count two phase commit time until state is sent
  • GRAM-164: Invalid RSL value error doesn't indicate what values are valid
  • GRAM-167: SGE LRM doesn't interpret path to stdout and stderr relative to directory
  • GRAM-168: zombie job manager processes on Ranger SGE
  • GRAM-169: Error messages in globus-job-get-output
  • GRAM-170: globus-job-submit fails
  • GRAM-171: Simplify debugging of script errors
  • GRAM-172: Job Manager doesn't exit when proxy expires if jobs are present
  • GRAM-173: SGE LRM doesn't set exit code for multiple jobs

5. GRAM5 Throughput Tests

5.1. Experiment Hardware and Software Configuration

The following experiments were run on the nomer.mcs.anl.gov virtual cluster. The cluster consists of 6 partitions, each having a single Intel(R) Xeon(R) CPU E5430 @ 2.66GHz core, and 2GB RAM. The virtual machines in the cluster each had a single virtual network interface. The cluster was configured as follows:

  • 1 node: Master node
  • 5 nodes: Test/execution nodes

All nodes ran an apache2 http server, gmond (ganglia monitoring), and pbs_mom (torque LRM).

The master node also ran a globus-gatekeeper, globus-gridftp-server, globus-job-manager-event-generator, gmetad (Ganglia Meta Daemon), pbs_sched (Torque LRM scheduler), pbs_server (Torque LRM server), and nfsd linux kernel NFSv4 server for the execution nodes.

The test nodes (1 for single-client tests, 5 for multiple client tests) ran the gram-throughput-tester program. All 5 test/execution nodes executed the test job executables as scheduled by the LRM.

These tests were done with GT 5.0.2 beta1. The throughput tester was compiled from CVS trunk October 30, 2009.

5.2. Experiment Scenarios

The first set of tests submitted jobs to the GRAM5 service for one hour. After one hour elapsed, the client would terminate any jobs which were being managed by the GRAM5 service. The test client recorded the time of the experiment start, the time of the termination start, and the time after which all jobs had reached a DONE or FAILED state.

The throughput test client would generate a maximum of 50 simultaneous job requests. For all but the uncapped test below, a maximum of 2000 jobs were managed by the job manager at any given time (pending or active). Once the client had submitted 2000 jobs, it would stop submissions until a job completed. A total of 10 execution slots were available for the LRM to schedule jobs into, so many were in the GRAM5 PENDING state during the duration of the test.

The different gram-throughput-tester experiments consisted of:

Table 1. GRAM5 Throughput Tester Experiments

Experiment NameLRM monitoring methodNumber of clientsNumber of usersMaximum number of simultaneous jobs / client
1-client-pollPOLL112000
1-client-segSEG112000
1-client-seg-uncappedSEG11unlimited
5-client-segSEG512000
5-client-seg-diffusersSEG552000

5.3. Throughput Test Results

The following table contains a summary of the results of these experiments. The columns contain the following information:

Experiment
Experiment name, the same as in the previous section
Total Jobs
The total number of GRAM jobs that were submitted to the GRAM5 service by the throughput tester in one hour.
Termination Tasks After 1 Hour
The total number of jobs that were being managed by the GRAM5 service when the one hour test period elapsed. These jobs were terminated using the GRAM5 cancel protocol message by the throughput tester.
Termination Duration (hh:mm:ss)
The amount of time it took for the termination tasks to complete and all jobs to reach the DONE or FAILED state.
Master Node Max 1 min. Load Average
The maximum value of the one-minute load average on the master node, that is, the node running the GRAM5 and Torque service.
Master Node Average 1 min. Load Average
The average value of the one-minute load average on the master node over the duration of the test.
Errors
Description of any errors which occurred on the client side that prevented operations from completing as expected.

Table 2. GRAM5 Throughput Tester Results Summary

ExperimentTotal JobsTermination Tasks After 1 HourTermination Duration (hh:mm:ss)Master Node Max 1 min. Load AverageMaster Node Average 1 min. Load AverageErrors
1-client-poll2110200000:14:1811.467.96None
1-client-seg2110200000:10:362.820.93None
1-client-seg-uncapped6664658400:42:463.262.75None
5-client-seg6800643400:57:203.192.49Connection refused during termination
5-client-seg-diffusers7226672000:45:413.793.13None

6. GRAM5 Condor-G Tests

6.1. Experiment Hardware and Software Configuration

The following experiments were run on the nomer.mcs.anl.gov virtual cluster. The cluster consists of 6 partitions, each having a single Intel(R) Xeon(R) CPU E5430 @ 2.66GHz core, and 2GB RAM. The virtual machines in the cluster each had a single virtual network interface. The cluster was configured as follows:

  • 1 node: Master node
  • 1 nodes: Test/execution nodes
  • 4 nodes: execution nodes

All nodes ran an apache2 http server, gmond (ganglia monitoring), and pbs_mom (torque LRM).

The master node also ran a globus-gatekeeper, globus-gridftp-server, globus-job-manager-event-generator, gmetad (Ganglia Meta Daemon), pbs_sched (Torque LRM scheduler), pbs_server (Torque LRM server), and nfsd linux kernel NFSv4 server for the execution nodes.

The test/execution node ran the condor-g daemons. The condor job classified ad included attributes to submit the jobs to the GRAM5 service running on the service node. The tests were done with Condor version 7.4.1. The Condor-G configuration parameters in the condor_config file were as follows:

Figure 1. Condor-G Experiment Configuration

GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE=50
GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE=2000
GRIDMANAGER_MAX_PENDING_REQUESTS=50
GRIDMANAGER_JOB_PROBE_INTERVAL=300
GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE=0
ENABLE_GRID_MONITOR=FALSE
GRIDMANAGER_DEBUG= D_FULLDEBUG
GRIDMANAGER_GLOBUS_COMMIT_TIMEOUT=12000

The execution nodes executed the test job executables as scheduled by the LRM.

For this test, the test/execution node and the execution nodes where configured to run up 20 job processes each simultaneously.

6.2. Experiment Scenario

This test submitted a 2000 job condor job cluster, using the following classified ad:

Figure 2. Condor-G Classified Ad

Universe=grid
grid_resource = gt5 nomer1.mcs.anl.gov:2119/jobmanager-pbs
executable=/bin/sleep
arguments=300
transfer_executable=False
stream_output = False
stream_error  = False
output = test.out.$(Process)
error  = test.err.$(Process)
log    = test.log
notification=Never
queue 2000

The configuration parameters are similar to the GRAM5 tests described in GRAM5 Throughput Tests section. The key difference being that this test runs until all 2000 jobs have completed and does not submit any jobs after the maximum of 2000 has been reached.

To provide a point of comparison, another test using similar parameters was run using the gram-throughput-tester program in place of Condor-G. Note that the Condor-G service provides file staging and a scratch directory beyond what the throughput tester job did.

The two experiments consist of:

Table 3. GRAM5 Condor-G Experiments

Experiment NameLRM monitoring methodNumber of clientsNumber of usersTotal number of jobs
Condor-GSEG112000
Throughput TesterSEG112000

6.3. Condor-G Test Results

The following table contains a summary of the results of these experiments. The columns contain the following information:

Experiment
Experiment name, the same as in the previous section
Time to Submit 2000 Jobs
The total number of GRAM jobs that were submitted to the GRAM5 service by the throughput tester in one hour.
Time For All Jobs To Complete
The amount of time it took for all 2000 jobs to complete. The theoretical minimum value for this is 50 minutes if all 2000 jobs were submitted instantaneously and there was no overhead for them to be deployed to the 200 execution nodes.
LRM Submit Rate (Jobs/Minute)
The total number of jobs that were being managed by the GRAM5 service when the one hour test period elapsed. These jobs were terminated using the GRAM5 cancel protocol message by the throughput tester.
Master Node Max 1 min. Load Average
The maximum value of the one-minute load average on the master node, that is, the node running the GRAM5 and Torque service.
Master Node Average 1 min. Load Average
The average value of the one-minute load average on the master node over the duration of the test.

Table 4. GRAM5 Throughput Tester Results Summary

ExperimentTime to Submit 2000 Jobs (hh:mm:ss)Time For All Jobs To Complete (hh:mm:ss)LRM Submit Rate (Jobs/Minute)Master Node Max 1 min. Load AverageMaster Node Average 1 min. Load Average
Condor-G00:32:3402:00:56946.561.64
Throughput Tester00:17:5801:54:491112.430.53