GT 5.0.0 GRAM5: Quality Profile


1. Test coverage reports

2. Code analysis reports

  • No code analysis reports have been generated at this time.

3. Outstanding bugs

  • Bug 108: Fork perl zombies
  • Bug 106: Fix test failures with SGE LRM adapter
  • Bug 105: Held Condor jobs should be reported as SUSPENDED
  • Bug 104: globus-job-manager-event-generator loads all historical events the first time run
  • Bug 103: Ease two phase end commit timeout
  • Bug 102: Fix Two Phase Commit Semantics for Failed Jobs
  • Bug 100: one of the RSL parameters is not supported error doesn't indicate which it is
  • Bug 99: Add a high-level diagram for the approach doc
  • Bug 98: Add Condor-G doc for using GRAM 2 and 5
  • Bug 96: GRAM-106 SGE LRM mishandles invalid environment definition
  • Bug 95: GRAM-106 SGE LRM doesn't check for executable permissions
  • Bug 94: GRAM-106 SGE LRM doesn't check for executable existence
  • Bug 93: GRAM-106 SGE LRM script doesn't handle environment vars with whitespace
  • Bug 92: stdout to local file doesn't work if count >1
  • Bug 88: Missed two phase commit causes job to not be destroyed
  • Bug 86: Prioritize script invocations to improve throughput
  • Bug 80: GRAM 5 beta2 release
  • Bug 79: Add support for OSG's "NFS Lite" concept
  • Bug 77: GRAM zombie
  • Bug 71: GRAM protocol test package contains expired test certificate
  • Bug 70: globus-job-status acts strange for completed jobs in GRAM5
  • Bug 69: globus-job-get-output -f doesn't work in GRAM5
  • Bug 68: Bad error when proxy is too short-lived
  • Bug 54: make globus-job-manager-event-generator not require configuration by default
  • Bug 53: Generalize log path configuration
  • Bug 51: configurable control of number of perl scripts that can run simultaneously
  • Bug 47: simplify the throughput tester program and use improved version as doc
  • Bug 24: Debug/verbose flags for globusrun, globus-job-run
  • Bug 23: Improved error codes and error reporting for users
  • Bug 22: client connections can't be timed out
  • Bug 15: transition from httpg to https
  • Bug 14: increase availability of GRAM in linux distributions
  • Bug 12: Gatekeeper's syslog output cannot be controlled
  • Bug 5: Add gram-level prologue and epilogue script execution for mpi jobs
  • Bug 4: Add support for a "managed fork" service
  • Bug 2: Investigate how to setup GRAM5 services in a HA setup

4. Bug Fixes

  • All Fixes: This link lists all 67 of the improvements and bug fixes that were done for the GT 5.0.0 release

5. GRAM5 Throughput Tests

5.1. Experiment Hardware and Software Configuration

The following experiments were run on the nomer.mcs.anl.gov virtual cluster. The cluster consists of 6 partitions, each having a single Intel(R) Xeon(R) CPU E5430 @ 2.66GHz core, and 2GB RAM. The virtual machines in the cluster each had a single virtual network interface. The cluster was configured as follows:

  • 1 node: Master node
  • 5 nodes: Test/execution nodes

All nodes ran an apache2 http server, gmond (ganglia monitoring), and pbs_mom (torque LRM).

The master node also ran a globus-gatekeeper, globus-gridftp-server, globus-job-manager-event-generator, gmetad (Ganglia Meta Daemon), pbs_sched (Torque LRM scheduler), pbs_server (Torque LRM server), and nfsd linux kernel NFSv4 server for the execution nodes.

The test nodes (1 for single-client tests, 5 for multiple client tests) ran the gram-throughput-tester program. All 5 test/execution nodes executed the test job executables as scheduled by the LRM.

These tests were done with GT 5.0.0 beta1. The throughput tester was compiled from CVS trunk October 30, 2009.

5.2. Experiment Scenarios

The first set of tests submitted jobs to the GRAM5 service for one hour. After one hour elapsed, the client would terminate any jobs which were being managed by the GRAM5 service. The test client recorded the time of the experiment start, the time of the termination start, and the time after which all jobs had reached a DONE or FAILED state.

The throughput test client would generate a maximum of 50 simultaneous job requests. For all but the uncapped test below, a maximum of 2000 jobs were managed by the job manager at any given time (pending or active). Once the client had submitted 2000 jobs, it would stop submissions until a job completed. A total of 10 execution slots were available for the LRM to schedule jobs into, so many were in the GRAM5 PENDING state during the duration of the test.

The different gram-throughput-tester experiments consisted of:

Table 1. GRAM5 Throughput Tester Experiments

Experiment NameLRM monitoring methodNumber of clientsNumber of usersMaximum number of simultaneous jobs / client
1-client-pollPOLL112000
1-client-segSEG112000
1-client-seg-uncappedSEG11unlimited
5-client-segSEG512000
5-client-seg-diffusersSEG552000

5.3. Throughput Test Results

The following table contains a summary of the results of these experiments. The columns contain the following information:

Experiment
Experiment name, the same as in the previous section
Total Jobs
The total number of GRAM jobs that were submitted to the GRAM5 service by the throughput tester in one hour.
Termination Tasks After 1 Hour
The total number of jobs that were being managed by the GRAM5 service when the one hour test period elapsed. These jobs were terminated using the GRAM5 cancel protocol message by the throughput tester.
Termination Duration (hh:mm:ss)
The amount of time it took for the termination tasks to complete and all jobs to reach the DONE or FAILED state.
Master Node Max 1 min. Load Average
The maximum value of the one-minute load average on the master node, that is, the node running the GRAM5 and Torque service.
Master Node Average 1 min. Load Average
The average value of the one-minute load average on the master node over the duration of the test.
Errors
Description of any errors which occurred on the client side that prevented operations from completing as expected.

Table 2. GRAM5 Throughput Tester Results Summary

ExperimentTotal JobsTermination Tasks After 1 HourTermination Duration (hh:mm:ss)Master Node Max 1 min. Load AverageMaster Node Average 1 min. Load AverageErrors
1-client-poll2110200000:14:1811.467.96None
1-client-seg2110200000:10:362.820.93None
1-client-seg-uncapped6664658400:42:463.262.75None
5-client-seg6800643400:57:203.192.49Connection refused during termination
5-client-seg-diffusers7226672000:45:413.793.13None

6. GRAM5 Condor-G Tests

6.1. Experiment Hardware and Software Configuration

The following experiments were run on the nomer.mcs.anl.gov virtual cluster. The cluster consists of 6 partitions, each having a single Intel(R) Xeon(R) CPU E5430 @ 2.66GHz core, and 2GB RAM. The virtual machines in the cluster each had a single virtual network interface. The cluster was configured as follows:

  • 1 node: Master node
  • 1 nodes: Test/execution nodes
  • 4 nodes: execution nodes

All nodes ran an apache2 http server, gmond (ganglia monitoring), and pbs_mom (torque LRM).

The master node also ran a globus-gatekeeper, globus-gridftp-server, globus-job-manager-event-generator, gmetad (Ganglia Meta Daemon), pbs_sched (Torque LRM scheduler), pbs_server (Torque LRM server), and nfsd linux kernel NFSv4 server for the execution nodes.

The test/execution node ran the condor-g daemons. The condor job classified ad included attributes to submit the jobs to the GRAM5 service running on the service node. The tests were done with Condor version 7.4.1. The Condor-G configuration parameters in the condor_config file were as follows:

Figure 1. Condor-G Experiment Configuration

GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE=50
GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE=2000
GRIDMANAGER_MAX_PENDING_REQUESTS=50
GRIDMANAGER_JOB_PROBE_INTERVAL=300
GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE=0
ENABLE_GRID_MONITOR=FALSE
GRIDMANAGER_DEBUG= D_FULLDEBUG
GRIDMANAGER_GLOBUS_COMMIT_TIMEOUT=12000

The execution nodes executed the test job executables as scheduled by the LRM.

For this test, the test/execution node and the execution nodes where configured to run up 20 job processes each simultaneously.

6.2. Experiment Scenario

This test submitted a 2000 job condor job cluster, using the following classified ad:

Figure 2. Condor-G Classified Ad

Universe=grid
grid_resource = gt5 nomer1.mcs.anl.gov:2119/jobmanager-pbs
executable=/bin/sleep
arguments=300
transfer_executable=False
stream_output = False
stream_error  = False
output = test.out.$(Process)
error  = test.err.$(Process)
log    = test.log
notification=Never
queue 2000

The configuration parameters are similar to the GRAM5 tests described in GRAM5 Throughput Tests section. The key difference being that this test runs until all 2000 jobs have completed and does not submit any jobs after the maximum of 2000 has been reached.

To provide a point of comparison, another test using similar parameters was run using the gram-throughput-tester program in place of Condor-G. Note that the Condor-G service provides file staging and a scratch directory beyond what the throughput tester job did.

The two experiments consist of:

Table 3. GRAM5 Condor-G Experiments

Experiment NameLRM monitoring methodNumber of clientsNumber of usersTotal number of jobs
Condor-GSEG112000
Throughput TesterSEG112000

6.3. Condor-G Test Results

The following table contains a summary of the results of these experiments. The columns contain the following information:

Experiment
Experiment name, the same as in the previous section
Time to Submit 2000 Jobs
The total number of GRAM jobs that were submitted to the GRAM5 service by the throughput tester in one hour.
Time For All Jobs To Complete
The amount of time it took for all 2000 jobs to complete. The theoretical minimum value for this is 50 minutes if all 2000 jobs were submitted instantaneously and there was no overhead for them to be deployed to the 200 execution nodes.
LRM Submit Rate (Jobs/Minute)
The total number of jobs that were being managed by the GRAM5 service when the one hour test period elapsed. These jobs were terminated using the GRAM5 cancel protocol message by the throughput tester.
Master Node Max 1 min. Load Average
The maximum value of the one-minute load average on the master node, that is, the node running the GRAM5 and Torque service.
Master Node Average 1 min. Load Average
The average value of the one-minute load average on the master node over the duration of the test.

Table 4. GRAM5 Throughput Tester Results Summary

ExperimentTime to Submit 2000 Jobs (hh:mm:ss)Time For All Jobs To Complete (hh:mm:ss)LRM Submit Rate (Jobs/Minute)Master Node Max 1 min. Load AverageMaster Node Average 1 min. Load Average
Condor-G00:32:3402:00:56946.561.64
Throughput Tester00:17:5801:54:491112.430.53