Table of Contents
- GRAM-2: Investigate how to setup GRAM5 services in a HA setup
- GRAM-4: Add support for a "managed fork" service
- GRAM-5: Add gram-level prologue and epilogue script execution for mpi jobs
- GRAM-12: Gatekeeper's syslog output cannot be controlled
- GRAM-15: transition from httpg to https
- GRAM-22: client connections can't be timed out
- GRAM-23: Improved error codes and error reporting for users
- GRAM-24: Debug/verbose flags for globusrun, globus-job-run
- GRAM-51: configurable control of number of perl scripts that can run simultaneously
- GRAM-53: Generalize log path configuration
- GRAM-79: Add support for OSG's "NFS Lite" concept
- GRAM-99: Add a high-level diagram for the approach doc
- GRAM-104: globus-job-manager-event-generator loads all historical events the first time run
- GRAM-105: Held Condor jobs should be reported as SUSPENDED
- GRAM-110: softenv extensions for GRAM5
- GRAM-119: improve the GRAM LRM adapter doc
- GRAM-122: tracking gram client software
- GRAM-135: Improve developer doc for a reliable client
- GRAM-138: GRAM5 job manager uses a lot of memory when SEG is pointed to incorrect log path
- GRAM-139: SEG may deadlock with threads
- GRAM-149: GRAM5 Unix domain socket misbehaves on Snow Leopard
- GRAM-154: GASS Cache doesn't check for updates
- GRAM-159: GRAM5 Migration guide is outdated
- GRAM-163: improve error output for globusrun
- Bug 5621: gram2 credential refresh problems in 4.0.5
- Bug 1934: Gatekeeper's syslog output cannot be controlled
- Bug 2739: Gatekeeper AuthZ/Gridmap Callout result logging
- Bug 2741: catching SIGSEGV if dynamic loading of authorization modules fails
- Bug 4199: Patch pre-WS GRAM to use individual condor logs for jobs
- Bug 3795: jobmanager perl modules issues
- Bug 4235: globus-job-manager doesn't exit if the job fails.
- Bug 4730: MPI Jobs using Globus LSF in HP XC Cluster....
- Bug 4747: Need evaluation of patch to JobManager.pm
- Bug 4779: gram GT2 log files: timestamps are not ISO 8601 compatible
- Bug 5143: DONE state never reported for Condor jobs when using Condor-G grid monitor
- Bug 5429: stdin is lost when jobtype=multiple with jobmanager-lsf
- Bug 5554: GRAM2 4.0.5 setup-globus-job-manager-fork.pl silent failure
- Bug 5556: Audit directory setup instructions are insecure
- Bug 5775: gram status of old jobs incorrect on some lsf systems
- Bug 6184: pbs.pm jobmanager fails jobs on qstat failure
- Bug 6337: Cannot configure globus to use different certificate path than default
- Bug 6703: PBS scheduler adapter assumes that Globus is installed in the same location on the headnode of a cluster and on the work nodes.
- Bug 6768: Held Condor jobs should be reported as SUSPENDED by GRAM
- Bug 6815: Support standard install locations for globus-gram-protocol
- Bug 6819: Missing metatdata in globus-scheduler-event-generator
- Bug 6820: Support standard install locations for globus-gatekeeper
- Bug 6821: Support standard install locations for globus-gatekeeper-setup
- Bug 6822: Support standard install locations for globus-gram-job-manager-scripts
- Bug 6823: Support standard install locations for globus-gram-job-manager
- Bug 6824: Support standard install locations for globus-gram-job-manager-setup
- Bug 6825: Remove hardcoded paths in globus-gram-job-manager-setup-fork
- Bug 6826: Remove hardcoded paths in globus-gram-job-manager-setup-condor
- Bug 6840: The PBS job manager doesn't handle large environments well
- Bug 6855: Undefined variable in Makefiles
- Bug 6862: PBS job manager fails if job history is enabled
- Bug 6927: A Loadleveler LRM for GRAM5 should be very welcome
- Bug 720: allow gram client to detect the version of a gram server
- Bug 851: Add "cleanup" RSL attribute for cleaning up a job submission
- Bug 5536: Missing dependency in package globus_gram_job_manager_auditing
- Bug 5537: Missing dependency in package globus_gram_job_manager_auditing
- Bug 3373: globus removes the temporary job directory before pbs writes back into it
- Bug 5200: GRAM (pre-webservices) from OSG 0.6.0 (VDT 1.6.1) has bad syslog format
- Bug 5207: GRAM SoftEnv extension bug
- Bug 5250: Does not support mpi jobtype of RSL script
- Bug 5272: Invalid parsing of RSL file
- GRAM-130: Individual Condor Logs per Job
- GRAM-136: Error message not precise when disk quota is exceeded
- GRAM-146: GRAM5 Usage stats ignores GLOBUS_USAGE_TARGETS environment
- GRAM-155: Leak in file_clean_up
- GRAM-156: job-failure-code not provided by jobmanager in response to status query
- GRAM-157: RSL substitutions not updated on a restart request
- GRAM-158: Remove references to setup-seg-job-manager.pl from documentation
- GRAM-160: double free in job manager
- GRAM-161: Don't count two phase commit time until state is sent
- GRAM-164: Invalid RSL value error doesn't indicate what values are valid
- GRAM-167: SGE LRM doesn't interpret path to stdout and stderr relative to directory
- GRAM-168: zombie job manager processes on Ranger SGE
- GRAM-169: Error messages in globus-job-get-output
- GRAM-170: globus-job-submit fails
- GRAM-171: Simplify debugging of script errors
- GRAM-172: Job Manager doesn't exit when proxy expires if jobs are present
- GRAM-173: SGE LRM doesn't set exit code for multiple jobs
The following experiments were run on the
nomer.mcs.anl.gov virtual
cluster. The
cluster consists of 6 partitions, each having a single
Intel(R) Xeon(R) CPU E5430 @ 2.66GHz core, and 2GB
RAM. The virtual machines in the cluster each had a single virtual network
interface. The cluster was configured as follows:
- 1 node: Master node
- 5 nodes: Test/execution nodes
All nodes ran an apache2 http server, gmond (ganglia monitoring), and pbs_mom (torque LRM).
The master node also ran a globus-gatekeeper, globus-gridftp-server, globus-job-manager-event-generator, gmetad (Ganglia Meta Daemon), pbs_sched (Torque LRM scheduler), pbs_server (Torque LRM server), and nfsd linux kernel NFSv4 server for the execution nodes.
The test nodes (1 for single-client tests, 5 for multiple client tests) ran the gram-throughput-tester program. All 5 test/execution nodes executed the test job executables as scheduled by the LRM.
These tests were done with GT 5.0.2 beta1. The throughput tester was compiled from CVS trunk October 30, 2009.
The first set of tests submitted jobs to the GRAM5 service for one hour. After one hour elapsed, the client would terminate any jobs which were being managed by the GRAM5 service. The test client recorded the time of the experiment start, the time of the termination start, and the time after which all jobs had reached a DONE or FAILED state.
The throughput test client would generate a maximum of 50 simultaneous job requests. For all but the uncapped test below, a maximum of 2000 jobs were managed by the job manager at any given time (pending or active). Once the client had submitted 2000 jobs, it would stop submissions until a job completed. A total of 10 execution slots were available for the LRM to schedule jobs into, so many were in the GRAM5 PENDING state during the duration of the test.
The different gram-throughput-tester experiments consisted of:
Table 1. GRAM5 Throughput Tester Experiments
| Experiment Name | LRM monitoring method | Number of clients | Number of users | Maximum number of simultaneous jobs / client |
|---|---|---|---|---|
| 1-client-poll | POLL | 1 | 1 | 2000 |
| 1-client-seg | SEG | 1 | 1 | 2000 |
| 1-client-seg-uncapped | SEG | 1 | 1 | unlimited |
| 5-client-seg | SEG | 5 | 1 | 2000 |
| 5-client-seg-diffusers | SEG | 5 | 5 | 2000 |
The following table contains a summary of the results of these experiments. The columns contain the following information:
- Experiment
- Experiment name, the same as in the previous section
- Total Jobs
- The total number of GRAM jobs that were submitted to the GRAM5 service by the throughput tester in one hour.
- Termination Tasks After 1 Hour
- The total number of jobs that were being managed by the GRAM5 service when the one hour test period elapsed. These jobs were terminated using the GRAM5 cancel protocol message by the throughput tester.
- Termination Duration (hh:mm:ss)
- The amount of time it took for the termination
tasks to complete and all jobs to reach the
DONEorFAILEDstate. - Master Node Max 1 min. Load Average
- The maximum value of the one-minute load average on the master node, that is, the node running the GRAM5 and Torque service.
- Master Node Average 1 min. Load Average
- The average value of the one-minute load average on the master node over the duration of the test.
- Errors
- Description of any errors which occurred on the client side that prevented operations from completing as expected.
Table 2. GRAM5 Throughput Tester Results Summary
| Experiment | Total Jobs | Termination Tasks After 1 Hour | Termination Duration (hh:mm:ss) | Master Node Max 1 min. Load Average | Master Node Average 1 min. Load Average | Errors |
|---|---|---|---|---|---|---|
| 1-client-poll | 2110 | 2000 | 00:14:18 | 11.46 | 7.96 | None |
| 1-client-seg | 2110 | 2000 | 00:10:36 | 2.82 | 0.93 | None |
| 1-client-seg-uncapped | 6664 | 6584 | 00:42:46 | 3.26 | 2.75 | None |
| 5-client-seg | 6800 | 6434 | 00:57:20 | 3.19 | 2.49 | Connection refused during termination |
| 5-client-seg-diffusers | 7226 | 6720 | 00:45:41 | 3.79 | 3.13 | None |
The following experiments were run on the
nomer.mcs.anl.gov virtual
cluster. The cluster consists of 6 partitions, each having a single
Intel(R) Xeon(R) CPU E5430 @ 2.66GHz core, and 2GB
RAM. The virtual machines in the cluster each had a single virtual network
interface. The cluster was configured as follows:
- 1 node: Master node
- 1 nodes: Test/execution nodes
- 4 nodes: execution nodes
All nodes ran an apache2 http server, gmond (ganglia monitoring), and pbs_mom (torque LRM).
The master node also ran a globus-gatekeeper, globus-gridftp-server, globus-job-manager-event-generator, gmetad (Ganglia Meta Daemon), pbs_sched (Torque LRM scheduler), pbs_server (Torque LRM server), and nfsd linux kernel NFSv4 server for the execution nodes.
The test/execution node ran the
condor-g daemons. The condor job classified ad included attributes to
submit the jobs to the GRAM5 service running on the service node. The
tests were done with Condor version 7.4.1. The Condor-G configuration
parameters in the condor_config file were as
follows:
Figure 1. Condor-G Experiment Configuration
GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE=50 GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE=2000 GRIDMANAGER_MAX_PENDING_REQUESTS=50 GRIDMANAGER_JOB_PROBE_INTERVAL=300 GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE=0 ENABLE_GRID_MONITOR=FALSE GRIDMANAGER_DEBUG= D_FULLDEBUG GRIDMANAGER_GLOBUS_COMMIT_TIMEOUT=12000
The execution nodes executed the test job executables as scheduled by the LRM.
For this test, the test/execution node and the execution nodes where configured to run up 20 job processes each simultaneously.
This test submitted a 2000 job condor job cluster, using the following classified ad:
Figure 2. Condor-G Classified Ad
Universe=grid grid_resource = gt5 nomer1.mcs.anl.gov:2119/jobmanager-pbs executable=/bin/sleep arguments=300 transfer_executable=False stream_output = False stream_error = False output = test.out.$(Process) error = test.err.$(Process) log = test.log notification=Never queue 2000
The configuration parameters are similar to the GRAM5 tests described in GRAM5 Throughput Tests section. The key difference being that this test runs until all 2000 jobs have completed and does not submit any jobs after the maximum of 2000 has been reached.
To provide a point of comparison, another test using similar parameters was run using the gram-throughput-tester program in place of Condor-G. Note that the Condor-G service provides file staging and a scratch directory beyond what the throughput tester job did.
The two experiments consist of:
Table 3. GRAM5 Condor-G Experiments
| Experiment Name | LRM monitoring method | Number of clients | Number of users | Total number of jobs |
|---|---|---|---|---|
| Condor-G | SEG | 1 | 1 | 2000 |
| Throughput Tester | SEG | 1 | 1 | 2000 |
The following table contains a summary of the results of these experiments. The columns contain the following information:
- Experiment
- Experiment name, the same as in the previous section
- Time to Submit 2000 Jobs
- The total number of GRAM jobs that were submitted to the GRAM5 service by the throughput tester in one hour.
- Time For All Jobs To Complete
- The amount of time it took for all 2000 jobs to complete. The theoretical minimum value for this is 50 minutes if all 2000 jobs were submitted instantaneously and there was no overhead for them to be deployed to the 200 execution nodes.
- LRM Submit Rate (Jobs/Minute)
- The total number of jobs that were being managed by the GRAM5 service when the one hour test period elapsed. These jobs were terminated using the GRAM5 cancel protocol message by the throughput tester.
- Master Node Max 1 min. Load Average
- The maximum value of the one-minute load average on the master node, that is, the node running the GRAM5 and Torque service.
- Master Node Average 1 min. Load Average
- The average value of the one-minute load average on the master node over the duration of the test.
Table 4. GRAM5 Throughput Tester Results Summary
| Experiment | Time to Submit 2000 Jobs (hh:mm:ss) | Time For All Jobs To Complete (hh:mm:ss) | LRM Submit Rate (Jobs/Minute) | Master Node Max 1 min. Load Average | Master Node Average 1 min. Load Average |
|---|---|---|---|---|---|
| Condor-G | 00:32:34 | 02:00:56 | 94 | 6.56 | 1.64 |
| Throughput Tester | 00:17:58 | 01:54:49 | 111 | 2.43 | 0.53 |