Table of Contents
- Bug 108: Fork perl zombies
- Bug 106: Fix test failures with SGE LRM adapter
- Bug 105: Held Condor jobs should be reported as SUSPENDED
- Bug 104: globus-job-manager-event-generator loads all historical events the first time run
- Bug 103: Ease two phase end commit timeout
- Bug 102: Fix Two Phase Commit Semantics for Failed Jobs
- Bug 100: one of the RSL parameters is not supported error doesn't indicate which it is
- Bug 99: Add a high-level diagram for the approach doc
- Bug 98: Add Condor-G doc for using GRAM 2 and 5
- Bug 96: GRAM-106 SGE LRM mishandles invalid environment definition
- Bug 95: GRAM-106 SGE LRM doesn't check for executable permissions
- Bug 94: GRAM-106 SGE LRM doesn't check for executable existence
- Bug 93: GRAM-106 SGE LRM script doesn't handle environment vars with whitespace
- Bug 92: stdout to local file doesn't work if count >1
- Bug 88: Missed two phase commit causes job to not be destroyed
- Bug 86: Prioritize script invocations to improve throughput
- Bug 80: GRAM 5 beta2 release
- Bug 79: Add support for OSG's "NFS Lite" concept
- Bug 77: GRAM zombie
- Bug 71: GRAM protocol test package contains expired test certificate
- Bug 70: globus-job-status acts strange for completed jobs in GRAM5
- Bug 69: globus-job-get-output -f doesn't work in GRAM5
- Bug 68: Bad error when proxy is too short-lived
- Bug 54: make globus-job-manager-event-generator not require configuration by default
- Bug 53: Generalize log path configuration
- Bug 51: configurable control of number of perl scripts that can run simultaneously
- Bug 47: simplify the throughput tester program and use improved version as doc
- Bug 24: Debug/verbose flags for globusrun, globus-job-run
- Bug 23: Improved error codes and error reporting for users
- Bug 22: client connections can't be timed out
- Bug 15: transition from httpg to https
- Bug 14: increase availability of GRAM in linux distributions
- Bug 12: Gatekeeper's syslog output cannot be controlled
- Bug 5: Add gram-level prologue and epilogue script execution for mpi jobs
- Bug 4: Add support for a "managed fork" service
- Bug 2: Investigate how to setup GRAM5 services in a HA setup
- All Fixes: This link lists all 67 of the improvements and bug fixes that were done for the GT 5.0.0 release
The following experiments were run on the
nomer.mcs.anl.gov virtual
cluster. The
cluster consists of 6 partitions, each having a single
Intel(R) Xeon(R) CPU E5430 @ 2.66GHz core, and 2GB
RAM. The virtual machines in the cluster each had a single virtual network
interface. The cluster was configured as follows:
- 1 node: Master node
- 5 nodes: Test/execution nodes
All nodes ran an apache2 http server, gmond (ganglia monitoring), and pbs_mom (torque LRM).
The master node also ran a globus-gatekeeper, globus-gridftp-server, globus-job-manager-event-generator, gmetad (Ganglia Meta Daemon), pbs_sched (Torque LRM scheduler), pbs_server (Torque LRM server), and nfsd linux kernel NFSv4 server for the execution nodes.
The test nodes (1 for single-client tests, 5 for multiple client tests) ran the gram-throughput-tester program. All 5 test/execution nodes executed the test job executables as scheduled by the LRM.
These tests were done with GT 5.0.0 beta1. The throughput tester was compiled from CVS trunk October 30, 2009.
The first set of tests submitted jobs to the GRAM5 service for one hour. After one hour elapsed, the client would terminate any jobs which were being managed by the GRAM5 service. The test client recorded the time of the experiment start, the time of the termination start, and the time after which all jobs had reached a DONE or FAILED state.
The throughput test client would generate a maximum of 50 simultaneous job requests. For all but the uncapped test below, a maximum of 2000 jobs were managed by the job manager at any given time (pending or active). Once the client had submitted 2000 jobs, it would stop submissions until a job completed. A total of 10 execution slots were available for the LRM to schedule jobs into, so many were in the GRAM5 PENDING state during the duration of the test.
The different gram-throughput-tester experiments consisted of:
Table 1. GRAM5 Throughput Tester Experiments
| Experiment Name | LRM monitoring method | Number of clients | Number of users | Maximum number of simultaneous jobs / client |
|---|---|---|---|---|
| 1-client-poll | POLL | 1 | 1 | 2000 |
| 1-client-seg | SEG | 1 | 1 | 2000 |
| 1-client-seg-uncapped | SEG | 1 | 1 | unlimited |
| 5-client-seg | SEG | 5 | 1 | 2000 |
| 5-client-seg-diffusers | SEG | 5 | 5 | 2000 |
The following table contains a summary of the results of these experiments. The columns contain the following information:
- Experiment
- Experiment name, the same as in the previous section
- Total Jobs
- The total number of GRAM jobs that were submitted to the GRAM5 service by the throughput tester in one hour.
- Termination Tasks After 1 Hour
- The total number of jobs that were being managed by the GRAM5 service when the one hour test period elapsed. These jobs were terminated using the GRAM5 cancel protocol message by the throughput tester.
- Termination Duration (hh:mm:ss)
- The amount of time it took for the termination
tasks to complete and all jobs to reach the
DONEorFAILEDstate. - Master Node Max 1 min. Load Average
- The maximum value of the one-minute load average on the master node, that is, the node running the GRAM5 and Torque service.
- Master Node Average 1 min. Load Average
- The average value of the one-minute load average on the master node over the duration of the test.
- Errors
- Description of any errors which occurred on the client side that prevented operations from completing as expected.
Table 2. GRAM5 Throughput Tester Results Summary
| Experiment | Total Jobs | Termination Tasks After 1 Hour | Termination Duration (hh:mm:ss) | Master Node Max 1 min. Load Average | Master Node Average 1 min. Load Average | Errors |
|---|---|---|---|---|---|---|
| 1-client-poll | 2110 | 2000 | 00:14:18 | 11.46 | 7.96 | None |
| 1-client-seg | 2110 | 2000 | 00:10:36 | 2.82 | 0.93 | None |
| 1-client-seg-uncapped | 6664 | 6584 | 00:42:46 | 3.26 | 2.75 | None |
| 5-client-seg | 6800 | 6434 | 00:57:20 | 3.19 | 2.49 | Connection refused during termination |
| 5-client-seg-diffusers | 7226 | 6720 | 00:45:41 | 3.79 | 3.13 | None |
The following experiments were run on the
nomer.mcs.anl.gov virtual
cluster. The cluster consists of 6 partitions, each having a single
Intel(R) Xeon(R) CPU E5430 @ 2.66GHz core, and 2GB
RAM. The virtual machines in the cluster each had a single virtual network
interface. The cluster was configured as follows:
- 1 node: Master node
- 1 nodes: Test/execution nodes
- 4 nodes: execution nodes
All nodes ran an apache2 http server, gmond (ganglia monitoring), and pbs_mom (torque LRM).
The master node also ran a globus-gatekeeper, globus-gridftp-server, globus-job-manager-event-generator, gmetad (Ganglia Meta Daemon), pbs_sched (Torque LRM scheduler), pbs_server (Torque LRM server), and nfsd linux kernel NFSv4 server for the execution nodes.
The test/execution node ran the
condor-g daemons. The condor job classified ad included attributes to
submit the jobs to the GRAM5 service running on the service node. The
tests were done with Condor version 7.4.1. The Condor-G configuration
parameters in the condor_config file were as
follows:
Figure 1. Condor-G Experiment Configuration
GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE=50 GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE=2000 GRIDMANAGER_MAX_PENDING_REQUESTS=50 GRIDMANAGER_JOB_PROBE_INTERVAL=300 GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE=0 ENABLE_GRID_MONITOR=FALSE GRIDMANAGER_DEBUG= D_FULLDEBUG GRIDMANAGER_GLOBUS_COMMIT_TIMEOUT=12000
The execution nodes executed the test job executables as scheduled by the LRM.
For this test, the test/execution node and the execution nodes where configured to run up 20 job processes each simultaneously.
This test submitted a 2000 job condor job cluster, using the following classified ad:
Figure 2. Condor-G Classified Ad
Universe=grid grid_resource = gt5 nomer1.mcs.anl.gov:2119/jobmanager-pbs executable=/bin/sleep arguments=300 transfer_executable=False stream_output = False stream_error = False output = test.out.$(Process) error = test.err.$(Process) log = test.log notification=Never queue 2000
The configuration parameters are similar to the GRAM5 tests described in GRAM5 Throughput Tests section. The key difference being that this test runs until all 2000 jobs have completed and does not submit any jobs after the maximum of 2000 has been reached.
To provide a point of comparison, another test using similar parameters was run using the gram-throughput-tester program in place of Condor-G. Note that the Condor-G service provides file staging and a scratch directory beyond what the throughput tester job did.
The two experiments consist of:
Table 3. GRAM5 Condor-G Experiments
| Experiment Name | LRM monitoring method | Number of clients | Number of users | Total number of jobs |
|---|---|---|---|---|
| Condor-G | SEG | 1 | 1 | 2000 |
| Throughput Tester | SEG | 1 | 1 | 2000 |
The following table contains a summary of the results of these experiments. The columns contain the following information:
- Experiment
- Experiment name, the same as in the previous section
- Time to Submit 2000 Jobs
- The total number of GRAM jobs that were submitted to the GRAM5 service by the throughput tester in one hour.
- Time For All Jobs To Complete
- The amount of time it took for all 2000 jobs to complete. The theoretical minimum value for this is 50 minutes if all 2000 jobs were submitted instantaneously and there was no overhead for them to be deployed to the 200 execution nodes.
- LRM Submit Rate (Jobs/Minute)
- The total number of jobs that were being managed by the GRAM5 service when the one hour test period elapsed. These jobs were terminated using the GRAM5 cancel protocol message by the throughput tester.
- Master Node Max 1 min. Load Average
- The maximum value of the one-minute load average on the master node, that is, the node running the GRAM5 and Torque service.
- Master Node Average 1 min. Load Average
- The average value of the one-minute load average on the master node over the duration of the test.
Table 4. GRAM5 Throughput Tester Results Summary
| Experiment | Time to Submit 2000 Jobs (hh:mm:ss) | Time For All Jobs To Complete (hh:mm:ss) | LRM Submit Rate (Jobs/Minute) | Master Node Max 1 min. Load Average | Master Node Average 1 min. Load Average |
|---|---|---|---|---|---|
| Condor-G | 00:32:34 | 02:00:56 | 94 | 6.56 | 1.64 |
| Throughput Tester | 00:17:58 | 01:54:49 | 111 | 2.43 | 0.53 |