Usage Statistics Collection by the Globus Alliance
Beginning with GT 3.9.5, the Globus Toolkit has the capability to send usage statistics back to the Globus Alliance.
- Why are we doing this?
- The overview
- What is sent?
- How is the data sent?
- When is it sent?
- What will the data be used for?
The Globus Alliance receives support from government funding agencies. In a time of funding scarcity, these agencies must be able to demonstrate that the scientific community is benefiting from their investment. To this end, we want to provide generic usage data about such things as the following:
- how many people use GridFTP
- how many jobs run using GRAM
- how many GT4 web services containers are running.
- Components affected (for GT 3.9.5) are:
- Java WS Core
- C WS Core
- WS GRAM
- Reliable File Transfer (RFT) service
- The data sent is as generic as possible (see What is Sent? below).
- Every component affected has a section titled "Usage Statistics" in its Users and Admin guides that lists precisely what is sent and the configuration control that is available (which you can use to disable the ability to send the data).
- To make this a win-win proposition, we have made the receiver for the data available from CVS (follow the directions here). This means that a (virtual) organization could set up their own listener and collect organization wide usage statistics.
By not opting out, and allowing these statistics to be reported back, you are explicitly supporting the further development of the Globus Toolkit.
The components affected (for GT 3.9.5) are GridFTP, Java WS Core, C WS Core, WS GRAM, and the Reliable File Transfer (RFT) Service. We send the "how much" data, not "the what" data.
For instance, GridFTP sends the number of bytes, how long the transfer took, how many streams were used, etc. It does NOT send filenames, usernames, or even the destination IP since that would mean that the source site would make a decision about sending information about the destination site.
Each component has a section in its Users and Admin guides listing what component specific data is sent, and the Admin guide explains configurations related to the usage statistics (the exception is C WS Core, which does not send component-specific data other than the header data listed below). Links to these sections are provided here:
Header data that may be sent by every component, not including the component-specific data listed above, is:
- Component identifier
- Usage data format identifier
- Time stamp
- Source IP address
- Source hostname (to differentiate between hosts with identical private IP addresses)
The data will be used for answering questions such as:
- How many jobs were run with GRAM last month?
- How many gigabytes of data has GridFTP moved?
We will also try and mine the data to answer operational questions such as:
- What percentage of the jobs run complete successfully?
- Of the ones that fail, what is the most common fault code returned?
The data will NOT be used to answer questions such as "IP 123.456.789.012 sent 10 TB of data last month."
Our intent is to make the data that we get generic enough that we do not have to worry what is done with it. We record the IP only for counting purposes to know how many sites there are, but we will not produce site-specific firstname.lastname@example.org. Feedback from our user communities will be useful in determining our path forward with this in the future. We do ask that if you have concerns or objections, please be specific in your feedback. For example: "Our site has a policy against sending such data" is good information for us to know in the future. A link to such a policy would be even better.