Introduction
This guide is intended to help a developer create compatible GRAM4 clients and alternate service implementations.
The key concepts for the GRAM component have not changed. Its purpose is still to provide the mechanisms to execute remote applications for a user. Given an XML job description, GRAM submits the job to a scheduling system such as PBS or Condor, or to a simple fork-based way of spawning processes, and monitors it until completion. More details can be found here:
Table of Contents
- 1. Before you begin
- 2. Selective Concepts
- 3. Scenarios
- 1. C
- 1.1. Loading the job description
- 1.2. Setting the security attributes
- 1.3. Creating the factory client handle
- 1.4. Querying for factory resource properties
- 1.5. Creating the notification consumer
- 1.6. Creating the job resource
- 1.7. Subscribing for job state notifications
- 1.8. Releasing any state holds (if necessary)
- 1.9. Destroying resources
- 1.10. Building a client
- 2. Java
- 2.1. Class imports
- 2.2. Loading the job description
- 2.3. Creating the factory service stub
- 2.4. Loading a proxy from a file
- 2.5. Setting stub security parameters
- 2.6. Querying for factory resource properties
- 2.7. Delegating credentials (if needed)
- 2.8. Creating the job resource
- 2.9. Creating the job service stub
- 2.10. Subscribing for job state notifications
- 2.11. Releasing any state holds (if necessary)
- 2.12. Destroying resources
- 4. Tutorials
- 5. Architecture and design overview
- 6. APIs
- 7. Services and WSDL
- 8. Debugging
- 9. Troubleshooting
- 10. Related Documentation
- 11. Internal Components
- Glossary
- Index
Table of Contents
New Features new since 4.2.0
- New terminate method in the client-side GramJob API
- Improved job lifetime management for users and admins
- Added configuration for "default" Local Resource Managers
Other Standard Supported Features
- Remote job execution and management
- Uniform and flexible interface to batch scheduling systems
- File staging before and after job execution
- File / directory clean up after job execution (after file stage out)
- Service auditing for each submitted
Deprecated Features
- With the addition of the new terminate method in the GramJob API, the destroy method is no longer necessary. For backward compatibility, the destroy method was left in the GramJob API, but it simply calls the terminate method. During the 4.2.x series, clients using the destroy method should change to instead use terminate. In GT 4.4, the plan is to remove the destroy method.
Tested platforms for GRAM4:
Linux
- Fedora Core 1 i686
- Fedora Core 3 i686
- Fedora Core 3 yup xeon
- RedHat 7.3 i686
- RedHat 9 x86
- Debian Sarge x86
- Debian 3.1 i686
Tested containers for GRAM4:
- Java WS Core container
Protocol changes since GRAM4 in the GT4.0 series:
- The Java WS Core Framework has been updated from the draft versions of the WSRF/WSN and WS Addressing specifications to the final versions WSRF 1.2, WSN 1.3 and WS Addressing 1.0. There is no backward compatibility between this version and any previous versions.
GRAM depends on the following GT components:
- Java WS Core
- Transport-Level Security
- Delegation Service
- RFT
- GridFTP
- MDS - internal libraries
- The XML::Parser Perl module is required http://search.cpan.org/~msergeant/XML-Parser/Parser.pm
Other scheduler adapters available for GT 4.2.1 release:
- SGE scheduler adapter interface
- IBM LoadLeveler (As of release 3.3.1). For more information see "What's new" in the LoadLeveler product documentation
- other batch schedulers... (where the GRAM scheduler interface has been implemented)
Table of Contents
This chapter gives an overview and specifics about various topics and concepts of GRAM4.
A client can generate a job-id and pass it to the call to ManagedJobFactoryService.createManagedJob(). This id has been subject of misunderstandings. Please check the section Client-Side Generated Submission ID in the User's guide if you have doubts about it.
A client can provide a lifetime for a job in job submission. However, sometimes it's hard or impossible to estimate an appropriate lifetime, especially with factors beyond the user's knowledge, like queuing time in the remote local resource manager or load in the remote GT4 container. Please check the following links for information about job lifetime concepts in GRAM4 in GT 4.2:
Job Lifetime in the User's guide
A client can subscribe for job status notifications of a job in two ways
Subscribe on the call to ManagedJobFactoryService.createManagedJob(): GRAM4 creates the subscription resource for the client and returns an EPR of it as part of the response of the createManagedJob() call.
Subscribe in a separate WS call after the call to ManagedJobFactoryService.createManagedJob() returned.
The second approach has two disadvantages:
Two WS calls as opposed to one WS call in the first approach.
There's a risk to miss notifications, because the subscribe call is done after the job had been created.
Job termination can happen for three reasons:
Processing errors: Any error that occurs during job processing. This can be an error when files are staged in, an invalid executable, an error of the job in the local resource manager, an error while staging files out, etc.
Job resource expiration: If the lifetime specified by a client the job will be terminated.
Client cancellation: A client requests the termination of a job.
If a job is still running and not already fully processed, termination will cause the job to go through a series cleanup steps in GRAM4 before the job-related data is destroyed. The cleanup steps being performed depend on the job and the state it is in. In general this includes cancellation of a running job at the local resource manager and running fileCleanUp if so specified in the job description. Termination at the local resource manager however will only be performed if the job did not already finished executing. This also applies to fileCleanUp: If no fileCleanUp is specified in the job description or if the job already passed fileCleanUp when the termination request comes in, then this step is skipped.
As in normal processing, errors may also happen in the clean up phase. It might e.g. be interesting for a user whether the cancellation of a running job at the local resource manager was successful or not, or whether the specified fileCleanUp had been processed successfully .
This section focuses on the interface GRAM4 offers a client to terminate a job.
In GRAM4 in 4.2 a client cannot request synchronous destruction of a job anymore like it was in GRAM4 in the 4.0 series. The reason for that is that many concurrent destroy requests at a time can cause the GT4 container to become unresponsive. Unlike destroy() the call to terminate() is asynchronous, i.e. returns quickly and does not block until the cleanup steps are all done. The new termination method applies to both ManagedExecutableJobResources (MEJRs) and ManagedMultiJobResources (MMJRs) and is a replacement for the destroy() method in GRAM4 in the 4.0 series.
The new termination method is supposed to cope with different scenarios for jobs in different states, which resulted in an interface that might not be intuitive at first glance. This section explains parameters, return value and faults of the call to terminate() and the implications for jobs in various states.
The terminate() method is defined as follows
public TerminateOutputType terminate (
TerminateInputType parameters)
throws RemoteException, ResourceUnknownFaultType,
DelegatedCredentialDestroyFaultType,
ResourceNotTerminatedFaultTypeTerminateInputType contains:
- boolean destroyAfterCleanup
- boolean continueNotifying
- boolean destroyDelegatedCredentials
TerminateOutputType contains:
- boolean terminationCompleted
destroyAfterCleanup: If set to true the job resource will be destroyed once all cleanup steps are done.
continueNotifying: If set to true a client will be notified about the success of the termination. This happens via a notification bound to the same topic a client subscribes to for normal state change information. A client who is not interested in the success of the termination can set this to false.
destroyDelegatedCredentials: If this is set to true, all delegated credentials that are specified in the job description (if any) will be destroyed after all cleanup steps are done. They must not be destroyed earlier because staging credentials are needed during failureFileCleanUp. Setting this parameter and destroyAfterCleanup both to true enables a client to completely go away after the termination c
terminationCompleted: Indicates whether termination of the MEJR completed or not when the call to terminate() returns. It is true in case the job is already in a final state, i.e. no cleanup steps had to be done. Otherwise it is false.
If it is true no further notifications will be sent, even if the client requested it, because the job is in a final state where no state transition is happening anymore. If destroyAfterCleanup had been set to true the client can be sure that the job resource has been destroyed when the call to terminate() returns.
If it is false, the job entered the clean up phase and the client can find out about success or failure of the termination either by listening to state change notifications or by querying the status of the MEJR.
ResourceUnknownFaultType: Thrown when the job resource to be terminated does not exist.
DelegatedCredentialDestroyFaultType: Thrown if the client demanded the destruction of delegated credentials and this failed. If this exception is thrown all termination steps succeeded, and only the destruction of the delegated credential failed.
ResourceNotTerminatedFaultType An error occurred during termination. This should not happen at all, but in case unforeseen things happen it will indicate that termination failed.
destroyAfterCleanup: If this is set to true the job resource will be destroyed once all cleanup steps are done. This is when the termination calls to all SJs went successfully and the MMJR received final notifications of all SJs.
continueNotifying: If set to true a client will be notified about the success of the termination. This happens via a notification bound to the same topic a client subscribes to for normal state change information. A client who is not interested in the success of the termination can set this to false.
destroyDelegatedCredentials: If this is set to true, all delegated credentials for the MMJR that are specified in the job description will be destroyed when the MMJR is destroyed. They must not be destroyed earlier because job credentials are needed for potential repetitive termination calls. Otherwise the MMJR wouldn’t be able to interact with SJ’s anymore.
Setting this parameter and destroyAfterCleanup both to true enables a client to completely go away after the termination call in case the success of the cleanup steps is not of importance.
terminationCompleted: Indicates whether termination of the MMJR completed or not. It is true in case the job is already in a final state, i.e. all SJs are in a final state. Otherwise it is false.
If it is true no further notifications will be sent, even if the client requested it, because the job is in a final state where no state transition is happening anymore. If destroyAfterCleanup had been set to true the client can be sure that the job resource has been destroyed when the call to terminate() returns.
If it is false, the job entered the clean up phase, i.e. termination calls to all subjobs had been sent, and the client can find out about success or failure of the termination either by listening to state change notifications or by querying the status of the MMJR.
ResourceUnknownFaultType: Thrown when the resource to be terminated does not exist.
DelegatedCredentialDestroyFaultType: Thrown if the client demanded the destruction of delegated credentials and this failed. If this exception is thrown all termination steps succeeded, and only the destruction of the delegated credential failed.
ResourceNotTerminatedFaultType: An error occurred during the call to terminate e.g. a termination call to at least one SJ caused a ResourceNotTerminatedFaultType exception. Note that a failure in terminating a SJ does not prevent from terminate calls to other SJs. But in this case, the MMJR will not be destroyed after an error from terminating one of the SJs.
Additional to Done and Failed, two new final states UserTerminateDone and
UserTerminateFailed are introduced. These states are part of the notification
message sent if a client subscribed for notifications, and they are values of
the ResourceProperty state that can be queried
by a client. Final states and their meaning:
Done: The MEJR has been fully processed
Failed: A processing error occurred that resulted in a termination initialized by GRAM4.
UserTerminateDone: The client had called terminate() and all cleanup steps have been processed successfully.
UserTerminateFailed: The client had called terminate() and at least one cleanup step had not been processed successfully.
Done: All SJs are in state Done
Failed: Failed is not necessarily a final state, because a MMJR transitions into state Failed if the first SJ fails. Failed is only a final state if at least one SJ failed and all other SJs are in state Done or Failed. If just one out of N (N>1) SJs failed and the client terminates the MMJR, the state Failed will transition to state UserTerminateDone in case of termination success or to state.
UserTerminateDone: At least one SJ is in state UserTerminateDone and all other SJs are in state UserTerminateDone, Failed or Done.
UserTerminateFailed: At least one SJ is in state UserTerminateFailed and all other SJs are in state UserTerminateDone, UserTerminateFailed, Done or Failed.
UserTerminateDone and UserTerminateFailed indicate only whether termination
was successful or not. In case of errors in the clean up phase the following
faults indicate what had happened. They are part of the notification message
to client in case termination failed, and are set in the ResourceProperty
fault.
These faults will only be set in a resource or are sent to a client as part of a notification message if the clean up steps have to been performed and the job did not already pass the correspondent state. As an example: For a job, that run to completion in the local resource manager, no attempt will be made to cancel it in the local resource manager.
No automated action can be done when these faults are found, they just indicate problems that would have to be manually taken care of.
StagingTerminateFaultType: Interruption of a running transfer failed.
LocalResourceManagerJobTerminateFaultType: Cancellation of the job at the local resource manager failed.
DelegatedCredentialDestroyFaultType: Destroying delegated credentials failed.
Now, all the above might be a bit abstract and hard to understand. Probably the most common scenarios are summarized in this section and illustrate which settings should be chosen in a certain scenario. Since the Java API GramJob is commonly used, some of these use-cases refer to it.
A client let GramJob delegate and wants to terminate a job and just go away. Information about success of the termination is unimportant.
destroyAfterCleanup=true, continueNotifying=false, destroyDelegatedCredentials=trueA client delegated itself and does not want the delegated credential to be destroyed and wants to terminate a job and just go away. Information about success of the termination is unimportant.
Same like in (1) but
destroyDelegatedCredentials=falseA client let GramJob delegate and wants to terminate a job, get information about the success of termination, subscribed for notifications.
destroyAfterCleanup=false, destroyDelegatedCredentials=true continueNotifying=trueIf the call to terminate() returns true the job has been fully terminated and destroyed. If it returns ‘false’ the client has to wait for the final notification message of the job (state UserTerminateDone, UserTerminateFailed). In case of problems it can find out from the faults that are part of the notification message what happened. In case it does not hear about the job the notification message might have been lost and the client should query the RP ‘state’ for job status the RP ‘fault’ to check what happened.
Once termination finished the client should send a second termination call to request destruction of the job resource with parameters
destroyAfterCleanup=true, destroyDelegatedCredentials=false continueNotifying=falseA client delegated itself and does not want the delegated credential to be destroyed and wants to terminate a job, get information about the success of termination, subscribed for notifications.
Same like (3), but
destroyDelegatedCredential=falseA client let GramJob delegate and wants to terminate a job, get information about the success of termination, didn’t subscribe for notifications.
Same like (3), but periodical querying the RP’s state and fault is necessary in case the call to terminate didn’t return true.
A client delegated itself and does not want the delegated credential to be destroyed and wants to terminate a job, get information about the success of termination, didn’t subscribe for notifications.
Same like (4), but periodical querying the RP’s state and fault is necessary in case the call to terminate didn’t return true.
Termination is a kind of one-way street: Once a clean up step is passed there's no way back, and there's no way to repeat a certain step. The only thing GRAM4 provides is information what went wrong by providing the faults along the way. Repetitive termination calls e.g.do not cause repetitive cancellations at the local resource manager.
Repetitive termination calls however can make sense for scenario 3 in the last section: Terminate a job, but don't destroy after clean up to ensure that no information is lost. Then do a second termination call to destroy the job resource.
Table of Contents
- 1. C
- 1.1. Loading the job description
- 1.2. Setting the security attributes
- 1.3. Creating the factory client handle
- 1.4. Querying for factory resource properties
- 1.5. Creating the notification consumer
- 1.6. Creating the job resource
- 1.7. Subscribing for job state notifications
- 1.8. Releasing any state holds (if necessary)
- 1.9. Destroying resources
- 1.10. Building a client
- 2. Java
- 2.1. Class imports
- 2.2. Loading the job description
- 2.3. Creating the factory service stub
- 2.4. Loading a proxy from a file
- 2.5. Setting stub security parameters
- 2.6. Querying for factory resource properties
- 2.7. Delegating credentials (if needed)
- 2.8. Creating the job resource
- 2.9. Creating the job service stub
- 2.10. Subscribing for job state notifications
- 2.11. Releasing any state holds (if necessary)
- 2.12. Destroying resources
The following is a general scenario for submitting a job using the C stubs and APIs. Please consult the C WS Core API, GRAM4 API documentation for details on the APIs used in the code excerpts.
const char * file = "job.xml";
globus_soap_message_handle_t message;
wsgram_CreateManagedJobInputType input;
globus_soap_message_handle_init_from_file(&message, file);
globus_soap_message_deserialize_element_unknown(message, &element);
if(strcmp(element.local, "job") == 0)
{
wsgram_JobDescriptionType * jd;
input.choice_value.type = wsgram_CreateManagedJobInputType_job;
jd = &input.choice_value.value.job;
wsgram_JobDescriptionType_deserialize(&element, jd, message, 0);
}
else if(strcmp(element.local, "multiJob") == 0)
{
wsgram_JobDescriptionType * mjd;
input.choice_value.type = wsgram_CreateManagedJobInputType_multiJob;
mjd = &input.choice_value.value.multiJob;
wsgram_MultiJobDescriptionType_deserialize(&element, mjd, message, 0);
}
xsd_QName_destroy_contents(&element);
globus_soap_message_handle_destroy(message);
This code sets the choice value of the wsgram_CreateManagedJobInputType to be the appropriate type depending on whether the job description is a job or multijob request.
globus_soap_message_attr_t message_attr;
globus_soap_message_attr_init(&message_attr);
/*
* Set authentication mode to host authorization: other possibilities are
* GLOBUS_SOAP_MESSAGE_AUTHZ_HOST_IDENTITY or
* GLOBUS_SOAP_MESSAGE_AUTHZ_HOST_SELF.
*/
globus_soap_message_attr_set(
message_attr,
GLOBUS_SOAP_MESSAGE_AUTHZ_METHOD_KEY,
NULL,
NULL,
(void *) GLOBUS_SOAP_MESSAGE_AUTHZ_HOST);
/*
* Set message protection level. GLOBUS_SOAP_MESSAGE_AUTH_PROTECTION_PRIVACY
* for encryption.
*/
globus_soap_message_attr_set(
message_attr,
GLOBUS_SOAP_MESSAGE_AUTH_PROTECTION_KEY,
NULL,
NULL,
(void *) GLOBUS_SOAP_MESSAGE_AUTH_PROTECTION_PRIVACY);
ManagedJobFactoryService_client_handle_t factory_handle;
result = ManagedJobFactoryService_client_init(
&factory_handle,
message_attr,
NULL);
/*
* localResourceManager, or other resource property names as defined in the
* WSDL
*/
xsd_QName property_name =
{
"http://www.globus.org/namespaces/2008/03/gram/job",
"localResourceManager"
};
wsrp_GetResourcePropertyResponseType * property_response;
int fault_type;
xsd_any * fault;
ManagedJobFactoryPortType_GetResourceProperty(
factory_handle,
endpoint,
&property_name,
&property_response,
(ManagedJobFactoryPortType_GetResourceProperty_fault_t *) &fault_type,
&fault);
If this is successful, then property_response's
any field will contain the deserialized data in the
value field of the first element in the array.
xsd_string * localResourceManager = property_response->any.elements[0].value;
printf("local resource manager is %s\n", *localResourceManager);
The notification consumer can be either passed in as part of the
wsgram_CreateManagedJobInputType or through a
separate invocation of
ManagedJobPortType_Subscribe_epr().
globus_service_engine_t engine;
wsa_EndpointReferenceType consumer_reference;
globus_service_engine_init(&engine, NULL, NULL, NULL, NULL, NULL);
globus_notification_create_consumer(
&consumer_reference,
engine,
notify_callback,
NULL);
First, prepare the other parts of the
wsgram_CreateManagedJobInputType structure.
/*
* You can set input.InitialTerminationTime to be a timeout if interested.
* The xsd_dateTime type is a struct tm pointer.
*/
time_t term_time = time(NULL);
globus_uuid_t uuid;
wsa_AttributedURI * job_id;
wsa_EndpointReferenceType * factory_epr;
xsd_any * reference_property;
wsgram_CreateManagedJobOutputType * output = NULL;
xsd_QName factory_reference_id_qname =
{
"http://www.globus.org/namespaces/2008/03/gram/job",
"ResourceID"
};
term_time += 60 * 60; /* 1 hour later */
xsd_dateTime_copy(&input.InitialTerminationTime, gmtime(&term_time));
/*
* Set unique JobID. This is used to reliably create jobs and check for status.
*/
globus_uuid_create(&uuid);
wsa_AttributedURI_init(&job_id);
job_id->base_value = globus_common_create_string("uuid:%s", uuid.text);
/* To subscribe to notifications at create time, add the consumer's EPR to
* the input message. Otherwise, use the EPR created above in a
* call to
*/
wsnt_SubscribeType_init(&input.Subscribe);
wsa_EndpointReferenceType_copy_contents(
&input.Subscribe.ConsumerReference,
&consumer_reference);
xsd_any_init(&input.Subscribe->TopicExpression.any);
&input.Subscribe->TopicExpression.any->any_info =
&xsd_QName_contents_info;
xsd_QName_copy(
(xsd_QName **) &input.Subscribe->TopicExpression.any->any.value,
&ManagedJobPortType_state_rp_qname);
xsd_anyURI_copy_cstr(
&input.Subscribe->TopicExpression._Dialect,
"http://docs.oasis-open.org/wsn/2004/06/TopicExpression/Simple");
xsd_boolean_init(&input.Subscribe->UseNotify);
*(&input.Subscribe->UseNotify) = GLOBUS_TRUE;
/* Construct the EPR of the job factory */
wsa_EndpointReferenceType_init(&factory_epr);
wsa_AttributedURI_init_contents(&factory_epr->Address);
xsd_anyURI_init_contents_cstr(&factory_epr->Address.base_value,
globus_common_create_string(
"https://%s:%hu/wsrf/services/%s",
factory_host,
factory_port,
MANAGEDJOBFACTORYSERVICE_BASE_PATH);
wsa_ReferenceParametersTypeinit(&factory_epr->ReferenceParameters);
reference_property = xsd_any_array_push(
&factory_epr->ReferenceParameters.any);
reference_property->any_info = &xsd_string_info;
xsd_QName_copy(
&reference_property->element,
&factory_reference_id_qname);
xsd_string_copy_cstr(
(xsd_string **) &reference_property->value,
"Fork");
/* Submit the request to the service container */
ManagedJobFactoryPortType_createManagedJob_epr(
factory_handle,
factory_epr,
input,
&output,
(ManagedJobFactoryPortType_createManagedJob_fault_t *) &fault_type,
&fault);
If this is successful, then the output structure will be
initialized with the results of the operation. Of particular interest is the
managedJobEndpoint which contains the reference to the
newly-created job resource.
In order to subscribe for job state change notifications to an existing job
resource, initialize the subscribe_input used below in the same
way as input.Subscribe was initialized above.
ManagedJobService_client_handle_t job_handle;
wsnt_SubscribeType subscribe_input;
wsnt_SubscribeResponseType * subscribe_response;
ManagedJobService_client_init(
&job_handle,
message_attr,
NULL);
ManagedJobPortType_Subscribe_epr(
job_handle,
output->managedJobEndpoint,
subscribe_input,
&subscribe_response,
(ManagedJobPortType_Subscribe_fault_t *) &fault_type,
&fault);
wsgram_ReleaseInputType release;
wsgram_ReleaseOutputType * release_response = NULL;
wsgram_ReleaseInputType_init_contents(&release);
ManagedJobPortType_release_epr(
job_handle,
output->managedJobEndpoint,
&release,
&release_response,
(ManagedJobPortType_release_fault_t *) &fault_type,
&fault);
/* destroy subscription resource */
SubscriptionManagerService_client_init subscription_handle;
wsnt_DestroyType destroy;
wsnt_DestroyResponseType * destroy_response = NULL;
SubscriptionManagerService_client_init(
&subscription_handle,
message_attr,
NULL);
/* if subscription done at job creation time, use
* output->subscriptionEndpoint in place of
* subscribe_response->SubscriptionReference,
*/
SubscriptionManager_Destroy_epr(
subscription_handle,
subscribe_response->SubscriptionReference,
&destroy,
&destroy_response,
(SubscriptionManager_Destroy_fault_t *) &fault_type,
&fault);
/* destroy the job resource */
jobPort.destroy(new Destroy());
ManagedJobPortType_Destroy_epr(
job_handle,
output->managedJobEndpoint,
&destroy,
&destroy_response,
(ManagedJobPortType_Destroy_fault_t *) &fault_type,
&fault);
In order to build a client application, certain flags must be passed to the compiler and linker to enable them to be able to locate headers and libraries. The easiest way to do so is to generate a makefile header, which is a fragment of a Makefile which includes all of the necessary flags needed to build the application. To do this, issue the command:
% globus-makefile-header --flavor=gcc32dbg globus_c_gram_client_bindings > Makefile.inc
Then, write your makefile to include this file and use the GLOBUS_CC, GLOBUS_LD, GLOBUS_CFLAGS, GLOBUS_LDFLAGS, and GLOBUS_PKG_LIBS macros. For example:
GLOBUS_FLAVOR_NAME=gcc32dbg include Makefile.inc CC = $(GLOBUS_CC) LD = $(GLOBUS_LD) CFLAGS = $(GLOBUS_CFLAGS) LDFLAGS = $(GLOBUS_LDFLAGS) $(GLOBUS_PKG_LIBS) client: client.c
The following is a general scenario for submitting a job using the Java stubs and APIs. Please consult the Java WS Core API, Delegation API, Reliable File Transfer API, and GRAM4 API documentation for details on classes referenced in the code excerpts.
Also, it will probably be helpful to look at the GramJob class source code as a functioning example.
The following imports will be needed for these examples:
import java.io.File;
import java.io.FileInputStream;
import java.net.URL;
import java.util.LinkedList;
import java.util.List;
import java.util.Vector;
import java.security.cert.X509Certificate;
import javax.xml.rpc.Stub;
import javax.xml.soap.SOAPElement;
import org.apache.axis.components.uuid.UUIDGenFactory;
import org.apache.axis.message.addressing.AttributedURI;
import org.apache.axis.message.addressing.EndpointReferenceType;
import org.globus.delegation.DelegationUtil;
import org.globus.exec.generated.CreateManagedJobInputType;
import org.globus.exec.generated.CreateManagedJobOutputType;
import org.globus.exec.generated.ManagedJobFactoryPortType;
import org.globus.exec.generated.ManagedJobPortType;
import org.globus.exec.generated.ReleaseInputType;
import org.globus.exec.utils.ManagedJobConstants;
import org.globus.exec.utils.ManagedJobFactoryConstants;
import org.globus.exec.utils.client.ManagedJobClientHelper;
import org.globus.exec.utils.client.ManagedJobFactoryClientHelper;
import org.globus.exec.utils.rsl.RSLHelper;
import org.globus.wsrf.NotificationConsumerManager;
import org.globus.wsrf.WSNConstants;
import org.globus.wsrf.encoding.ObjectDeserializer;
import org.globus.wsrf.impl.security.authentication.Constants;
import org.globus.wsrf.impl.security.authorization.Authorization;
import org.globus.wsrf.impl.security.authorization.HostAuthorization;
import org.globus.wsrf.impl.security.authorization.IdentityAuthorization;
import org.globus.wsrf.impl.security.authorization.SelfAuthorization;
import org.globus.wsrf.impl.security.descriptor.ClientSecurityDescriptor;
import org.globus.wsrf.impl.security.descriptor.GSISecureMsgAuthMethod;
import org.globus.wsrf.impl.security.descriptor.GSITransportAuthMethod;
import org.globus.wsrf.impl.security.descriptor.ResourceSecurityDescriptor;
import org.gridforum.jgss.ExtendedGSSManager;
import org.oasis.wsn.Subscribe;
import org.oasis.wsn.SubscribeResponse;
import org.oasis.wsn.SubscriptionManager;
import org.oasis.wsn.TopicExpressionType;
import org.oasis.wsn.WSBaseNotificationServiceAddressingLocator;
import org.oasis.wsrf.lifetime.Destroy;
import org.oasis.wsrf.properties.GetMultipleResourceProperties_Element;
import org.oasis.wsrf.properties.GetMultipleResourcePropertiesResponse;
import org.oasis.wsrf.properties.GetResourcePropertyResponse;
File jobDescriptionFile = new File("myjobdesc.xml");
JobDescriptionType jobDescription = RSLHelper.readRSL(jobDescriptionFile);
The object jobDescription will be of sub-type
MultiJobDescriptionType if the file contents is a multi-job description.
URL factoryUrl = ManagedJobFactoryClientHelper.getServiceURL(
contactString).getURL();
String factoryType
= ManagedJobFactoryConstants.FACTORY_TYPE.<factory type constant>;
EndpointReferenceType factoryEndpoint
= ManagedJobFactoryClientHelper.getFactoryEndpoint(factoryUrl, factoryType);
ManagedJobFactoryPortType factoryPort
= ManagedJobFactoryClientHelper.getPort(factoryEndpoint);
The format of contactString is
[protocol://]host[:port][/servicepath].
Default proxy file:
ExtendedGSSManager manager = (ExtendedGSSManager)ExtendedGSSManager.getInstance(); GSSCredential cred = manager.createCredential( GSSCredential.INITIATE_AND_ACCEPT);Specific proxy file:
File proxyFile = new File("proxy_file"); byte[] proxyData = new byte[(int)proxyFile.length)]; FileInputStream inputStream = new FileInputStream(proxyFile); inputStream.read(proxyData); inputStream.close(); ExtendedGSSManager manager = (ExtendedGSSManager)ExtendedGSSManager.getInstance(); GSSCredential proxy = manager.createCredential( proxyData, ExtendedGSSCredential.IMPEXP_OPAQUE, GSSCredential.DEFAULT_LIFETIME, null, GSSCredential.ACCEPT_ONLY);
ClientSecurityDescriptor secDesc = new ClientSecurityDescriptor();
secDesc.setGSITransport(Constants.<protection level constant>);
secDesc.setAuthz(<Authorization sub-class instance>);
if (proxy != null) {
secDesc.setGSSCredential(proxy);
}
((Stub) port)._setProperty(Constants.CLIENT_DESCRIPTOR, secDesc);
Use setGSISecureMsg() for GSI Secure Message.
GetResourcePropertyResponse response
= factoryport.getResourceProperty(ManagedJobConstants.<RP constant>);
SOAPElement[] any = response.get_any();
... = ObjectDeserializer.toObject(any[0], <RP type>.class);
GetMultipleResourceProperties_Element rpRequest
= new GetMultipleResourceProperties_Element();
rpRequest.setResourceProperty(new QName[] {
ManagedJobFactoryConstants.<RP constant #1>,
ManagedJobFactoryConstants.<RP constant #2>,
ManagedJobFactoryConstants.<RP constant #N>
});
GetMultipleResourcePropertiesResponse response
= factoryPort.getMultipleResourceProperties(rpRequest);
SOAPElement[] any = response.get_any();
... = ObjectDeserializer.toObject(any[0], <RP #1 type>.class);
... = ObjectDeserializer.toObject(any[0], <RP #2 type>.class);
... = ObjectDeserializer.toObject(any[0], <RP #N type>.class);
X509Certificate certToSign = DelegationUtil.getCertificateChainRP(
delegFactoryEndpoint, //EndpointReferenceType
secDesc, //ClientSecurityDescriptor
)[0]; //first element in the returned array
EndpointReferenceType credentialEndpoint = DelegationUtil.delegate(
delegFactoryurl, //String
credential, //GlobusCredential
certToSign, //X509Certificate
lifetime, //int (seconds)
fullDelegation, //boolean
secDesc); //ClientSecurityDescriptor
There are three types of delegated credentials:
Credential used by the job to generate user-owned proxy:
jobDescription.setJobCredential(credentialEndpoint);
Credential used to contact RFT for staging and file clean up:
jobDescription.setStagingCredentialEndpoint(credentialEndpoint);
Credential used by RFT to contact GridFTP servers:
TransferRequestType stageOut = jobDescription.getFileStageOut(); stageOut.setTransferCredential(credentialEndpoint);Do the same for fileStageIn and fileCleanUp.
CreateManagedJobInputType jobInput = new CreateManagedJobInputType();
jobInput.setJobID(new AttributeURI("uuid: " + UUIDGenFactory.getUUIDGen().nextUUID()));
jobInput.setInitialTerminationTime(<Calendar instance>);
if (multiJob) jobInput.setMultiJob(jobDescription) else jobInput.setJob(jobDescription);
if (subscribeOnCreate) jobInput.setSubscribe(subscriptionReq);
CreateManagedJobOutputType createResponse
= factoryPort.createManagedJob(jobInput);
EndpointReferenceType jobEndpoint = createResponse.getManagedJobEndpoint();
ManagedJobPortType jobPort = ManagedJobClientHelper.getPort(jobEndpoint);
You must set the appropriate security parameters for the job service stub
(jobPort) as well.
NotificationConsumerManager notifConsumerManager
= NotificationConsumerManager.getInstance();
notifConsumerManager.startListening();
List topicPath = new LinkedList();
topicPath.add(ManagedJobConstants.RP_STATE);
ResourceSecurityDescriptor resourceSecDesc = new ResourceSecurityDescriptor();
resourceSecDesc.setAuthz(Authorization.<authz type constant>);
Vector authMethods = new Vector();
authMethods.add(GSITransportAuthMethod.BOTH);
resourceSecDesc.setAuthMethods(authMethods);
EndpointReferenceType notificationConsumerEndpoint
= notifConsumerManager.createNotificationConsumer(
topicPath,
this,
resourceSecDesc);
Subscribe subscriptionReq = new Subscribe();
subscriptionReq.setConsumerReference(
notificationConsumerEndpoint);
TopicExpressionType topicExpression = new TopicExpressionType(
WSNConstants.SIMPLE_TOPIC_DIALECT,
ManagedJobConstants.RP_STATE);
subscriptionReq.setTopicExpression(topicExpression);
EndpointReferenceType subscriptionEndpoint;
Subscribe on creation
jobInput.setSubscribe(subscriptionReq);
Subscribe after creation
SubscribeResponse subscribeResponse = jobPort.subscribe(subscriptionRequest); subscriptionEndpoint = subscribeResponse.getSubscriptionReference();
/*destroy subscription resource*/
SubscriptionManager subscriptionManagerPort
= new WSBaseNotificationServiceAddressingLocator()
.getSubscriptionManagerPort(subscriptionEndpoint);
//set stub security parameters on subscriptionManagerPort
subscriptionManagerPort.destroy(new Destroy());
/*destroy the job resource*/
jobPort.destroy(new Destroy());
The following tutorials are available for GRAM4 developers:
Table of Contents
The GRAM services in GT 4.2.1 are WSRF compliant. One of the key concepts in the WSRF specification is the decoupling of a service with the public "state" of the service in the interface via the implied resource pattern. Following this concept, the data of GT 4.2.1 GRAM jobs is published as part of WSRF resources, while there is only one service to start jobs or query and monitor their state. This is different from the OGSI model of GT3 where each job was represented as a separate service. There is still a job factory service that can be called in order to create job instances (represented as WSRF resources). Each scheduling system that GRAM is interfaced with is represented as a separate factory resource. By making a call to the factory service while associating the call to the appropriate factory resource, the job submitting actor can create a job resource mapping to a job in the chosen scheduling system.
The Managed Executable Job Service (MEJS) relies on a state machine to handle state transitions. There are two sets of states: external and internal. The external states are those that the user gets in notifications and can be queried as a resource property. The internal states are those that are strictly used by the state machine to step through all the necessary internal tasks that need to be performed for a particular job.
The Managed Multi Job Service does not rely on a state machine, but instead makes judgements after receiving notifications from the sub-jobs about which external state it should be in. The external states for the MMJS are identical to the ones used by the MEJS.
- Unsubmitted
- StageIn
- Pending
- Active
- Suspended
- StageOut
- CleanUp
- Done
- Failed
- UserTerminateDone
- UserTerminateFailed
- None
- Start
- StageIn
- StageInHold
- StageInResponse
- Submit
- PendingHold
- WaitingForStateChanges
- Suspend
- Resume
- OpenStdout
- OpenStderr
- MergeStdout
- StageOut
- StageOutHold
- StageOutResponse
- CleanUp
- CleanUpHold
- FileCleanUp
- FileCleanUpResponse
- CacheCleanUp
- UserTerminate
- SystemTerminate
- FailureFileCleanUp
- FailureFileCleanUpResponse
- FailureCacheCleanUp
- FinalizeTermination
- Done
- Restart
Table of Contents
This component consists abstractly of two interfaces: the Managed Job Factory Port Type(MJFPT) and the Managed Job Port Type (MJPT).
In actuality there are three service/resource implementations, two of which implement the basic MJPT. The first one is the service which actually talks to a particular local resource manager to execute a process on the remote computer or cluster. This one is called a Managed Executable Job Service (MEJS) and its resource is called the Managed Executable Job Resource (MEJR). The second is a special implementation which accepts a multi-job description, breaks the description up into single-job descriptions, and then submits each of these so-called "sub-jobs" to an MEJS. This implementation is called the Managed Multi Job Service (MMJS). Its resource is called the Managed Multi-Job Resource (MMJR)
Because of the fact that these two job services use the same port type, the API for accessing both the MEJR and the MMJR are identical. The MJFS creates the appropriate job resource depending on the factory resource used to qualify the operation call. Most of the factory resources represent local resource managers used by the MEJS (PBS, LSF, Condor). There is a special Multi factory resource which represents an abstract multi-job resource manager. The appropriate job description type is required for the two different types of managed job.
Table of Contents
GRAM4 allows for remote execution and management of programs through the creation of a managed job. The management of the job is taken care of primarily by core toolkit functionality (WS-ResourceLifetime and WS-BaseN implementations). Please see Java WS Core on notifications and resource lifetime (destruction) for more information.
A single MJFS is used to create all jobs for all users. For each local resource manager, a dedicated Managed Job Factory Resource (MJFR) enables the MJFS to publish information about the characteristics of the compute resource, for example:
- host information
- GridFTP URL (for file staging and streaming)
- compute cluster size and configuration, and so on...
In addition, there is a special MJFR which is used for creating MMJRs.
A single MEJS is used to manage all executable jobs for all users. Each Managed Executable Job Resource (MEJR) enables the MEJS to publish information about the individual job the MEJR represents. This information can be accessed by querying the MEJS for the resource properties of a given MEJR, such as the:
- current job state
- stdout location
- stderr location
- exit code, and so on.
A single MMJS is used to manage all multi-jobs for all users. Each Managed Multi-Job Resource (MMJR) enables the MMJS to publish information about the individual multi-job the MMJR represents. This information can be accessed by querying the MMJS for the resource properties of a given MMJR, such as the:
- current overall job state
- list of sub-job EPRs
There are just two operations defined in the GRAM port types (not counting the Rendezvous port type which is used for MPI job synchronization): "createManagedJob" in the Managed Job Factory port type, and "release" in the Managed Job port type. All other operations (such as canceling/killing the job and querying for resource properties) are provided by the underlying WSRF implementation of the toolkit.
createManagedJob: This operation creates either a MEJR or MMJR, subscribes the client for notifications if requested, and replies with one or two endpoint references (EPRs). The input of this operation consists of a job description, an optional initial termination time for the job resource, and an optional state notification subscription request.
The first EPR:
- is qualified with the identifier to the newly created MEJR or MMJR
- points to either the MEJS or MMJS.
The second EPR:
- is only present if a notification subscription was requested
- is qualified with the identifier to the newly created subscription resource
- points to the subscription manager service.
Using the optional subscription request provides an efficient means of subscribing to the newly created MEJR or MMJR without additional round-trip messages. Clients who subscribe afterwards must check the current status of the job, since the inherent race-condition means some state-changes may have occurred prior to the separate subscription request. In any event, there is a slight risk of lost notifications due to the lack of reliability guarantees in the notification delivery mechanism from WS-BaseNotification.
The ManagedJobFactoryPortType also has all the operations and publishes all the resource properties (via the MJFR) defined in the following WS-ResourceProperties port types:
GetResourcePropertyGetMultipleResourcePropertiesQueryResourceProperties
This port type does not define any new operations itself, but has all the operations and publishes all the resource properties defined in the following port types:
ReleaseManagedJob port type:
release:This operation takes no parameters and returns nothing. Its purpose is to release a hold placed on a state through the use of the "holdState" field in the job description. See the domain-specific GRAM4 component documentation for more information on the "holdState" field.TerminateManagedJob port type:
terminate:This operation terminates a job. Depending on arguments and the state of the job this may result in immediate destruction of the job resource or in starting of clean up steps and resource destruction after the clean up is done.WS-ResourceProperties port types:
GetResourcePropertyGetMultipleResourcePropertiesQueryResourcePropertiesWS-ResourceLifetime port types:
ScheduledResourceTerminationWS-BaseNotification port type:
NotificationProducer
This port type does not define any new operations. See "Resources Properties" under Services and WSDL.
{http://www.globus.org/namespaces/2008/03/gram/job}condorArchitectureCondor architecture label.
{http://www.globus.org/namespaces/2008/03/gram/job}condorOSCondor OS label.
{http://www.globus.org/namespaces/2008/03/gram/job}delegationFactoryEndpointThe endpoint reference to the delegation factory used to delegated credentials to the job.
{http://mds.globus.org/glue/ce/1.1}GLUECEGLUE data
{http://mds.globus.org/glue/ce/1.1}GLUECESummaryGLUE data summary
{http://www.globus.org/namespaces/2008/03/gram/job}globusLocationThe location of the Globus Toolkit installation that these services are running under.
{http://www.globus.org/namespaces/2008/03/gram/job}hostCPUTypeThe job host CPU architecture (i686, x86_64, etc...)
{http://www.globus.org/namespaces/2008/03/gram/job}hostManufacturerThe host manufacturer name. May be "unknown".
{http://www.globus.org/namespaces/2008/03/gram/job}hostOSNameThe host OS name (Linux, Solaris, etc...)
{http://www.globus.org/namespaces/2008/03/gram/job}hostOSVersionThe host OS version.
{http://www.globus.org/namespaces/2008/03/gram/job}localResourceManagerThe local resource manager type (i.e. Condor, Fork, LSF, Multi, PBS, etc...)
{http://www.globus.org/namespaces/2008/03/gram/job}availableLocalResourceManagerAll local resource managers that are configured in this GRAM4 instance
{http://www.globus.org/namespaces/2008/03/gram/job}jobTTLAfterProcessingTime in seconds a job resource will stay alive after a job finished processing in GRAM4 (including fileStageOut, fileCleanUp). When this time elapsed the job resource is destroyed and no longer be available for a client. A negative values means that the job resource will never be destroyed.
{http://www.globus.org/namespaces/2008/03/gram/job}maxJobLifetimeMax time in seconds a user can set as initial lifetime in job submission or in subsequent setTerminationTime calls. A negative value means that there is no limit.
{http://mds.globus.org/metadata/2005/02}ServiceMetaDataInfoservice start time, Globus Toolkit(R) version, service type name
{http://www.globus.org/namespaces/2008/03/gram/job}scratchBaseDirectoryThe directory recommended by the system administrator to be used for temporary job data.
{http://www.globus.org/namespaces/2008/03/gram/job}stagingDelegationFactoryEndpointThe endpoint reference to the delegation factory used to delegated credentials to the staging service (RFT).
{http://www.globus.org/namespaces/2008/04/rendezvous}CapacityUsed for Rendezvous.
{http://docs.oasis-open.org/wsrf/rl-2}CurrentTimeTime of creation.
{http://docs.oasis-open.org/wsrf/rp-2}QueryExpressionDialectFrom the QueryResourceProperties port type.
{http://www.globus.org/namespaces/2008/03/gram/job/faults}faultFaults (if generated) that happen along job processing and that cause a job to fail.
{http://www.globus.org/namespaces/2008/03/gram/job/types}holdingIndicates whether a hold has been placed on this job.
{http://www.globus.org/namespaces/2008/03/gram/job/types}localUserIdThe job owner's local user account name.
{http://www.globus.org/namespaces/2008/04/rendezvous}RegistrantDataUsed for Rendezvous.
{http://www.globus.org/namespaces/2008/04/rendezvous}RendezvousCompletedUsed for Rendezvous.
{http://www.globus.org/namespaces/2008/03/gram/job/description}serviceLevelAgreementA wrapper around fields containing the single-job and multi-job descriptions or RSLs. Only one of these sub-fields shall have a non-null value.
{http://www.globus.org/namespaces/2008/03/gram/job/types}stateThe current state of the job.
{http://docs.oasis-open.org/wsrf/rl-2}TerminationTimeTime when the resource expires.
{http://www.globus.org/namespaces/2008/03/gram/job/types}userSubjectThe GSI certificate DN of the job owner.
{http://docs.oasis-open.org/wsrf/rl-2}CurrentTimeTime of creation.
{http://docs.oasis-open.org/wsrf/rl-2}TerminationTimeTime when the resource expires.
{http://www.globus.org/namespaces/2008/03/gram/job/exec}credentialPathThe path (relative to the job process) to the file containing the user proxy used by the job to authenticate out to other services.
{http://www.globus.org/namespaces/2008/03/gram/job/types}exitCodeThe exit code generated by the job process.
{http://www.globus.org/namespaces/2008/03/gram/job/faults}faultThe fault (if generated) indicating the reason for failure of the job to complete.
{http://www.globus.org/namespaces/2008/03/gram/job/types}holdingIndicates whether a hold has been placed on this job.
{http://www.globus.org/namespaces/2008/03/gram/job/types}localUserIdThe job owner's local user account name.
{http://www.globus.org/namespaces/2008/03/gram/job/exec}localJobIdThe job id(s) of the job in the local resource manager. Note that for Fork jobs these id's are prefixed with the uuid of the job.
{http://www.globus.org/namespaces/2008/03/gram/job/description}serviceLevelAgreementA wrapper around fields containing the single-job and multi-job descriptions or RSLs. Only one of these sub-fields shall have a non-null value.
{http://www.globus.org/namespaces/2008/03/gram/job/types}stateThe current state of the job.
{http://www.globus.org/namespaces/2008/03/gram/job/exec}stderrURLA GridFTP URL to the file generated by the job which contains the stderr.
{http://www.globus.org/namespaces/2008/03/gram/job/exec}stdoutURLA GridFTP URL to the file generated by the job which contains the stdout.
{http://www.globus.org/namespaces/2008/03/gram/job/types}userSubjectThe GSI certificate DN of the job owner.
{http://www.globus.org/namespaces/2008/04/rendezvous}CapacityUsed for Rendezvous.
{http://www.globus.org/namespaces/2008/04/rendezvous}RegistrantDataUsed for Rendezvous.
{http://www.globus.org/namespaces/2008/04/rendezvous}RendezvousCompletedUsed for Rendezvous.
{http://docs.oasis-open.org/wsrf/rp-2}QueryExpressionDialectFrom the QueryResourceProperties port type.
{http://docs.oasis-open.org/wsrf/rl-2}CurrentTimeTime of creation.
{http://docs.oasis-open.org/wsrf/rl-2}TerminationTimeTime when the resource expires.
{http://www.globus.org/namespaces/2008/03/gram/job/faults}faultThe fault (if generated) indicating the reason for failure of the job to complete.
{http://www.globus.org/namespaces/2008/03/gram/job/types}holdingIndicates whether a hold has been placed on this job.
{http://www.globus.org/namespaces/2008/03/gram/job/types}localUserIdThe job owner's local user account name.
{http://www.globus.org/namespaces/2008/03/gram/job/description}serviceLevelAgreementA wrapper around fields containing the single-job and multi-job descriptions or RSLs. Only one of these sub-fields shall have a non-null value.
{http://www.globus.org/namespaces/2008/03/gram/job/types}stateThe current state of the job.
{http://www.globus.org/namespaces/2008/03/gram/job/multi}subJobEndpointA set of endpoint references to the sub-jobs created by this multi-job.
{http://www.globus.org/namespaces/2008/03/gram/job/types}userSubjectThe GSI certificate DN of the job owner.
{http://www.globus.org/namespaces/2008/04/rendezvous}CapacityUsed for Rendezvous.
{http://www.globus.org/namespaces/2008/04/rendezvous}RegistrantDataUsed for Rendezvous.
{http://www.globus.org/namespaces/2008/04/rendezvous}RendezvousCompletedUsed for Rendezvous.
{http://docs.oasis-open.org/wsrf/rp-2}QueryExpressionDialectFrom the QueryResourceProperties port type.
WSDL links:
Schema links:
Table of Contents
Log output from GRAM4 is a useful tool for debugging issues. Because GRAM4 is built on top of Java WS Core, developer debugging is the same as described in Chapter 10, Debugging. For sys admin logging information, see Chapter 11, Admin Debugging.
The following information applies to Java WS Core and those services built on it.
Logging in the Java WS Core is based on the Jakarta Commons Logging API. Commons Logging provides a consistent interface for instrumenting source code while at the same time allowing the user to plug-in a different logging implementation. Currently we use Log4j as a logging implementation. Log4j uses a separate configuration file to configure itself. Please see Log4j documentation for details on the configuration file format.
Server side logging can be configured in $GLOBUS_LOCATION/container-log4j.properties, when the container is stand alone container. For tomcat level logging, refer to Logging for Tomcat, . The logger log4j.appender.A1 is used for developer logging and by default writes output to the system output. By default it is set for all warnings in the Globus Toolkit package to be displayed.
Additional logging can be enabled for a package by adding a new line to the configuration file. Example:
#for debug level logging from org.globus.package.FooClass log4j.category.org.globus.package.name.FooClass=DEBUG #for warnings from org.some.warn.package log4j.category.org.some.warn.package=WARN
Client side logging can be configured in $GLOBUS_LOCATION/log4j.properties. The logger log4j.appender.A1 is used for developer logging and by default writes output to the system output. By default it is set for all warnings in the Globus Toolkit package to be displayed.
To turn on debug logging for the Managed Executable Job Service (MEJS), add the
following entry to the container-log4j.properties file:
log4j.category.org.globus.exec.service.exec=DEBUG
To turn on debug logging for the delegated proxy management code, add the
following entry to the container-log4j.properties file:
log4j.category.org.globus.exec.service.utils=DEBUG
To turn on debug logging for the Managed Multi Job Service (MMJS), add the
following entry to the container-log4j.properties file:
log4j.category.org.globus.exec.service.multi=DEBUG
To turn on debug logging for the Managed Job Factory Service (MJFS), add the
following entry to the container-log4j.properties file:
log4j.category.org.globus.exec.service.factory=DEBUG
To turn on debug logging for all GRAM code, add the following entry to the
container-log4j.properties file:
log4j.category.org.globus.exec=DEBUG
Follow the pattern to turn on logging for other specific packages or classes.
Both the service and Java client API code contain special debugging statements which output certain timing data to help in determining performance bottlenecks.
The service code uses the PerformanceLog class to output the timings information. To turn on service timings logging without triggering full debug logging for the service code, add the following lines to the container-log4j.properties file:
log4j.category.org.globus.exec.service.factory.ManagedJobFactoryService.performance=DEBUG log4j.category.org.globus.exec.service.exec.ManagedExecutableJobResource.performance=DEBUG log4j.category.org.globus.exec.service.exec.StateMachine.performance=DEBUG
The Java client API has not been converted over to using the PerformanceLog class, so the debug statements are sent at the INFO level to avoid having to turn on full debug logging. To turn on client timings logging without triggering full debug logging for the client code, add the following line to the container-log4j.properties file:
log4j.category.org.globus.exec.client.e=INFO
There are two parsing scripts available in the source distribution that aren't
distributed in any GPT package for summarizing the service and client timings
data. The are located in
ws-gram/service/java/test/throughput/, and are
named parse-service-timings.pl and
parse-client-timings.pl. They both simply take
the path of the appropriate log file that contains the timing data. These
scripts work fine with log files that have other logging statements mixed with
the timing data.
It may be necessary to debug the scheduler scripts if jobs aren't being submitted correctly, and either no fault or a less-than-helpful fault is generated. Ideally we would like that this not be necessary; so if you find that you must resort to this, please file a bug report or let us know on the discuss e-mail list.
By turning on debug logging for the MEJS (see above), you should be able to search for "Perl Job Description" in the logging output to find the perl form of the job description that is sent to the scheduler scripts.
Also by turning on debug logging for the MEJS, you should be able to search for "Executing command" in the logging output to find the specific commands that are executed when the scheduler scripts are invoked from the service code. If you saved the perl job description from the previous paragraph, then you can use this to manually run these commands.
There is a perl job description attribute named
logfile that isn't currently supported in the
XML job description that can be used to print debugging info about the
execution of the perl scripts. The value for this attribute is a path to a file
that will be created. You can add this to the perl job description file that
you created from the service debug logging before manually running the script
commands.
Beyond the above advice, you may want to edit the perl scripts themselves to print more detailed information. For more information on the location and composition of the scheduler scripts, please consult the GRAM4 Scheduler Interface Tutorial.
Table of Contents
For a list of common errors in GT, see Error Codes.
For information about sys admin logging, see Chapter 11, Admin Debugging in the GRAM4 Admin Guide.
In case you run into problems you can do the following
- Check the GRAM4 documentation. Maybe you'll find hints here to solve your problem.
Send e-mails to one of several Globus e-mail lists. You'll have to subscribe to a list before you can send an e-mail to it. See here for general e-mail lists and information on how to subscribe to a list and here for GRAM specific lists.
Probably the best lists for GRAM4-related problems are gt-user@globus.org and gram-user@globus.org
Check the container log for errors.
In case you don't find anything suspicious you can increase the log-level of GRAM4 or other relevant components. Maybe the additional logging-information will tell you what's going wrong. General information about container logging can be found Logging in Java WS Core section.
To get debug information from GRAM4, un-comment the following line in
$GLOBUS_LOCATION/container-log4j.propertiesby removing the leading '#' and restart the GT4 server.# log4j.category.org.globus.exec=DEBUG
The logging output can either be found on the console if you started the container using
globus-start-container(maybe with arguments) or in$GLOBUS_LOCATION/var/container.login if you started the container using the commandglobus-start-container-detached
Table 9.1. Java WS Core Errors
| Error Code | Definition | Possible Solutions |
|---|---|---|
Failed to acquire notification consumer home instance from registry | Caused by javax.naming.NameNotFoundException: Name services is not bound in this Context error. | Please see Running client programs from any directory if a client fails with this error. |
The WS-Addressing 'To' request header is missing | This warning is logged by the container if the request did not contain the necessary WS-Addressing headers. The client either did not attempt to send those headers at all or is somehow misconfigured. | If you are using a
Java client and launching it directly using the java executable,
take a look at Appendix B, Running client programs from any directory.
|
java.io.IOException: Token length X > 33554432 | If you see this error in the container log, it usually means you are trying to
connect to HTTPS server using HTTP. For example, the service address specifies
8443 as a port number and
http as the protocol name. | In general, use
8443 port number with the
https protocol, and 8080 port
number with the http protocol. |
java.lang.NoSuchFieldError: DOCUMENT | This error usually indicates a mismatch between the version of Apache Axis that the code was compiled with and the version of Axis that the code is currently running with. | Make sure that the same version of Axis is used at compile time and at runtime. |
org.globus.wsrf. InvalidResourceKeyException: Argument key is null / Resource key
is missing | These errors usually indicate that a resource key was not passed with the request or that an invalid resource key was passed with the request (that is, the element QName of the resource key did not match what the service expected). | Make sure that the EPR used to invoke the service that contains the appropriate resource key. If
you are using some command-line tool make sure
to specify the resource key using the -k option or pass a
complete EPR from a file using the -e option. |
Unable to connect to localhost:xxx | Cannot resolve localhost. The machine's /etc/hosts isn't set up correctly and/or you do not have DNS for these machines. |
There should always be an entry in /etc/hosts (or /etc/hostname/ on Debian)
for localhost in the following format (IP address/fully qualified domain name/short name):
140.221.8.109 cognito.mcs.anl.gov cognito |
org.globus.common.ChainedIOException: Failed to initialize security context | This may indicate that the user's proxy is invalid. | To correct the error, the user must properly initialize the user proxy. See grid-proxy-init for more information on proxy initialization. |
Error: org.xml.sax.SAXException: Unregistered type: class xxx | This may indicate that an Axis generated XML type, defined by the WS RLS XSD, was not properly registered. While all the XML types should get registered upon deployment without intervention by the user, sometimes they do not. | To remedy the situation add a typeMapping to the server-config.wsdd file under globus_wsrf_replicalocation_service.
Use the format shown here. |
No socket factory for 'https' protocol | When a client fails with the following exception: java.io.IOException: No socket factory for 'https' protocol at
org.apache.axis.transport.http.HTTPSender.getSocket(HTTPSender.java:179) at
org.apache.axis.transport.http.HTTPSender.writeToSocket(HTTPSender.java:397) at
org.apache.axis.transport.http.HTTPSender.invoke(HTTPSender.java:135)FIXME - it may have happened because... | Add the following to the client: import org.globus.axis.util.Util; ... static { Util.registerTransport(); }
... |
No client transport named 'https' found | When a client fails with the following exception: No client transport named 'https' found at
org.apache.axis.client.AxisClient.invoke(AxisClient.java:170) at
org.apache.axis.client.Call.invokeEngine(Call.java:2726)
The client is most likely loading an incorrect | Ensure that the GT4 installation
directory is listed as the first entry in the CLASSPATH=/usr/local/globus-4.2.0:/foo/bar/others.jar:... If you are seeing this problem in Tomcat, copy the |
ConcurrentModificationException in Tomcat 5.0.x | If the following exception is visible in the Tomcat logs at
startup, it might cause the java.util.ConcurrentModificationException at
java.util.HashMap$HashIterator.nextEntry(HashMap.java:782) at
java.util.HashMap$EntryIterator.next(HashMap.java:824) at
java.util.HashMap.putAllForCreate(HashMap.java:424) at
java.util.HashMap.clone(HashMap.java:656) at
mx4j.server.DefaultMBeanRepository.clone(DefaultMBeanRepository.java:56)
The java.lang.NullPointerException at
org.apache.coyote.tomcat5.CoyoteRequest.setAttribute(CoyoteRequest.java:1472) at
org.apache.coyote.tomcat5.CoyoteRequestFacade.setAttribute(CoyoteRequestFacade.java:351) at
org.globus.tomcat.coyote.valves.HTTPSValve.expose(HTTPSValve.java:99)
These exceptions will prevent the transport security from working properly in Tomcat. | This is a Tomcat bug. Keep restarting Tomcat until it starts
without the
|
java.net.SocketException: Invalid argument or
cannot assign requested address | FIXME - what causes this? | If you see the $ export GLOBUS_OPTIONS="-Djava.net.preferIPv4Stack=true" |
GAR deploy/undeploy fails with container is running error | A GAR file can only be deployed or undeployed locally while the container is off. However, GAR deployment/undeployment might still sometimes fail with this error even if the container is off. This usually happens if the container has crashed or was stopped improperly, preventing the container from cleaning up its state files. | To resolve
this problem, delete any files under the
|
Table 9.2. GRAM4 Errors
| Error Code | Definition | Possible Solutions |
|---|---|---|
globusrun-ws - error querying job state | During job submission, an error like this occurs:
globusrun-ws failed: Delegating user credentials...Done.
Submitting job...Done.
Job ID: xxxx
Termination time: xxxx
Current job state: Unsubmitted
globusrun-ws: Error querying job state
globus_soap_message_module: Failed sending request
ManagedJobPortType_GetMultipleResourceProperties.
globus_xio: An end of file occurred
| Periodically, globusrun-ws will query the GRAM service to check on the job state. The "End of file" indicates that the GRAM server dropped a connection when globusrun-ws tried to read a response. This could be caused by temporary network issues between the client and service, or possibly caused by an overloaded service host. |
globusrun-ws - error querying job state | During job submission, an error like this occurs:
globusrun-ws failed: Delegating user credentials...Done.
Submitting job...Done.
Job ID: xxxx
Termination time: xxxx
Current job state: Unsubmitted
globusrun-ws: Error querying job state
globus_soap_message_module: Failed sending request ManagedJobPortType_GetMultipleResourceProperties.
globus_xio: System error in read: Connection reset by peer
globus_xio: A system call failed: Connection reset by peer
|
Periodically, globusrun-ws will query the GRAM service to check on the job state. The System error in read: Connection reset by peerindicates that the GRAM server dropped the connection while trying to write the response. This could be caused by temporary network issues between the client and service, or possibly caused by an overloaded service host. |
globusrun-ws - error submitting job | During job submission, an error like this occurs:
globusrun-ws -Ft PBS -F https://host.teragrid.org:8444 -submit -b -f /tmp/wsgram.rsl -o /tmp/wsgram.epr failed: Submitting job...Failed.
globusrun-ws: Error submitting job
globus_soap_message_module: Failed sending request ManagedJobFactoryPortType_createManagedJob.
globus_xio: Operation was canceled
globus_xio: Operation timed out
|
The Operation timed outindicates that the GRAM service was not able to accept the job request and respond in time. This could be caused by temporary network issues between the client and service, or possibly caused by an overloaded service host. |
C
- Condor
A job scheduler mechanism supported by GRAM. See http://www.cs.wisc.edu/condor/ for more information.
L
- LSF
A job scheduler mechanism supported by GRAM.
For more information, see http://www.platform.com/Products/Platform.LSF.Family/Platform.LSF/.
M
- Managed Executable Job Service (MEJS)
[FIXME]
- Managed Job Factory Service (MJFS)
[FIXME]
- Managed Multi Job Service (MMJS)
[FIXME]
- multijob
A job that is itself composed of several executable jobs; these are processed by the MMJS subjob.
See also MMJS subjob.
P
- Portable Batch System (PBS)
A job scheduler mechanism supported by GRAM. For more information, see http://www.openpbs.org.
R
S
- scheduler
Term used to describe a job scheduler mechanism to which GRAM interfaces. It is a networked system for submitting, controlling, and monitoring the workload of batch jobs in one or more computers. The jobs or tasks are scheduled for execution at a time chosen by the subsystem according to an available policy and availability of resources. Popular job schedulers include Portable Batch System (PBS), Platform LSF, and IBM LoadLeveler.
W
- Web Services Addressing (WSA)
The WS-Addressing specification defines transport-neutral mechanisms to address web services and messages. Specifically, it defines XML elements to identify web service endpoints and to secure end-to-end endpoint identification in messages. See the W3C WS Addressing Working Group for details.
A
- apis, APIs
- links, Component API
- overview, Programming Model Overview
C
- compatibility, Backward compatibility summary
- containers, tested, Tested platforms
D
- debugging, Debugging
- dependencies, Technology dependencies
E
- errors, Errors
F
- features, Feature summary
P
- platforms, tested, Tested platforms
R
- resource properties, Resource properties
- Managed Executable Job Port Type, Managed Executable Job Port Type
- Managed Job Factory Port Type, Managed Job Factory Port Type
- Managed Job Port Type, Managed Job Port Type
- Managed Multi-Job Port Type, Managed Multi-Job Port Type
S
- services, Services and WSDL
T
- troubleshooting, Troubleshooting
- check container log, Troubleshooting tips
- check documentation, Troubleshooting tips
- errors, Troubleshooting
- mailing lists, Troubleshooting tips
W
- WSDL, Services and WSDL
