Managing Files on a Grid (GridFTP Quickstart)

1. Basic procedure for using GridFTP (globus-url-copy)

If you just want the "rules of thumb" on getting started (without all the details), the following options using globus-url-copy will normally give acceptable performance:

globus-url-copy -vb -tcp-bs 2097152 -p 4 source_url destination_url

where:

-vb

specifies verbose mode and displays:

  • number of bytes transferred,
  • performance since the last update (currently every 5 seconds), and
  • average performance for the whole transfer.
-tcp-bs

specifies the size (in bytes) of the TCP buffer to be used by the underlying ftp data channels. This is critical to good performance over the WAN.

How do I pick a value?

-p

Specifies the number of parallel data connections that should be used. This is one of the most commonly used options.

How do I pick a value?

The source/destination URLs will normally be one of the following:

  • file:///path/to/my/file if you are accessing a file on a file system accessible by the host on which you are running your client.
  • gsiftp://hostname/path/to/remote/file if you are accessing a file from a GridFTP server.

1.1. Putting files

One of the most basic tasks in GridFTP is to "put" files, i.e., moving a file from your file system to the server. So for example, if you want to move the file /tmp/foo from a file system accessible to the host on which you are running your client to a file name /tmp/bar on a host named remote.machine.my.edu running a GridFTP server, you would use this command:

globus-url-copy -vb -tcp-bs 2097152 -p 4 file:///tmp/foo gsiftp://remote.machine.my.edu/tmp/bar

[Note]Note

In theory, remote.machine.my.edu could be the same host as the one on which you are running your client, but that is normally only done in testing situations.

1.2. Getting files

A get, i.e, moving a file from a server to your file system, would just reverse the source and destination URLs:

[Tip]Tip

Remember file: always refers to your file system.

globus-url-copy -vb -tcp-bs 2097152 -p 4 gsiftp://remote.machine.my.edu/tmp/bar file:///tmp/foo

1.3. Third party transfers

Finally, if you want to move a file between two GridFTP servers (a third party transfer), both URLs would use gsiftp: as the protocol:

globus-url-copy -vb -tcp-bs 2097152 -p 4 gsiftp://other.machine.my.edu/tmp/foo gsiftp://remote.machine.my.edu/tmp/bar

1.4. For more information

If you want more information and details on URLs and the command line options, the Key Concepts gives basic definitions and an overview of the GridFTP protocol as well as our implementation of it.

2. Accessing data from other data interfaces

2.1. Accessing data in a non-POSIX file data source that has a POSIX interface

If you want to access data in a non-POSIX file data source that has a POSIX interface, the standard server will do just fine. Just make sure it is really POSIX-like (out of order writes, contiguous byte writes, etc).

2.2. GridFTP and DSIs

The following information is helpful if you want to use GridFTP to access data in DSIs (such as HPSS and SRB), and non-POSIX data sources.

Architecturally, the Globus GridFTP server can be divided into 3 modules:

  • the GridFTP protocol module,
  • the (optional) data transform module, and
  • the Data Storage Interface (DSI).

In the GT 4.2.0 implementation, the data transform module and the DSI have been merged, although we plan to have separate, chainable, data transform modules in the future.

[Note]Note

This architecture does NOT apply to the WU-FTPD implementation (GT3.2.1 and lower).

2.2.1. GridFTP Protocol Module

The GridFTP protocol module is the module that reads and writes to the network and implements the GridFTP protocol. This module should not need to be modified since to do so would make the server non-protocol compliant, and unable to communicate with other servers.

2.2.2. Data Transform Functionality

The data transform functionality is invoked by using the ERET (extended retrieve) and ESTO (extended store) commands. It is seldom used and bears careful consideration before it is implemented, but in the right circumstances can be very useful. In theory, any computation could be invoked this way, but it was primarily intended for cases where some simple pre-processing (such as a partial get or sub-sampling) can greatly reduce the network load. The disadvantage to this is that you remove any real option for planning, brokering, etc., and any significant computation could adversely affect the data transfer performance. Note that the client must also support the ESTO/ERET functionality as well.

2.2.3. Data Storage Interface (DSI) / Data Transform module

The Data Storage Interface (DSI) / Data Transform module knows how to read and write to the "local" storage system and can optionally transform the data. We put local in quotes because in a complicated storage system, the storage may not be directly attached, but for performance reasons, it should be relatively close (for instance on the same LAN).

The interface consists of functions to be implemented such as send (get), receive (put), command (simple commands that simply succeed or fail like mkdir), etc..

Once these functions have been implemented for a specific storage system, a client should not need to know or care what is actually providing the data. The server can either be configured specifically with a specific DSI, i.e., it knows how to interact with a single class of storage system, or one particularly useful function for the ESTO/ERET functionality mentioned above is to load and configure a DSI on the fly.

See Appendix A, Developing DSIs for GridFTP for more information.

2.3. Latest information about HPSS

Last Update: August 2005

Working with Los Alamos National Laboratory and the High Performance Storage System (HPSS) collaboration (http://www.hpss-collaboration.org), we have written a Data Storage Interface (DSI) for read/write access to HPSS. This DSI would allow an existing application that uses a GridFTP compliant client to utilize an HPSS data resources.

This DSI is currently in testing. Due to changes in the HPSS security mechanisms, it requires HPSS 6.2 or later, which is due to be released in Q4 2005. Distribution for the DSI has not been worked out yet, but it will *probably* be available from both Globus and the HPSS collaboration. While this code will be open source, it requires underlying HPSS libraries which are NOT open source (proprietary).

[Note]Note

This is a purely server side change, the client does not know what DSI is running, so only a site that is already running HPSS and wants to allow GridFTP access needs to worry about access to these proprietary libraries.

2.4. Latest information about SRB

Last Update: August 2005

Working with the SRB team at the San Diego Supercomputing Center, we have written a Data Storage Interface (DSI) for read/write access to data in the Storage Resource Broker (SRB) (http://www.npaci.edu/DICE/SRB). This DSI will enable GridFTP compliant clients to read and write data to an SRB server, similar in functionality to the sput/sget commands.

This DSI is currently in testing and is not yet publicly available, but will be available from both the SRB web site (here) and the Globus web site (here). It will also be included in the next stable release of the toolkit. We are working on performance tests, but early results indicate that for wide area network (WAN) transfers, the performance is comparable.

When might you want to use this functionality:

  • You have existing tools that use GridFTP clients and you want to access data that is in SRB
  • You have distributed data sets that have some of the data in SRB and some of the data available from GridFTP servers.

3. Pipelining

Pipelining allows the client to have many outstanding, unacknowledged transfer commands at once. Instead of being forced to wait for the "Finished response" message, the client is free to send transfer commands at any time.

Pipelining is enabled by using the -pp option:

globus-url-copy -pp

4. GridFTP Where There Is FTP (GWTFTP)

GridFTP Where There Is FTP (GWTFTP) is an intermediate program that acts as a proxy between existing FTP clients and GridFTP servers. Users can connect to GWFTP with their favorite standard FTP client, and GWFTP will then connect to a GridFTP server on the client’s behalf. To clients, GWFTP looks much like an FTP proxy server. When wishing to contact a GridFTP server, FTP clients instead contact GWTFTP.

Clients tell GWFTP their ultimate destination via the FTP USER <username> command. Instead of entering their username, client users send the following:

USER <GWTFTP username>::<GridFTP server URL>

This command tells GWTFTP the GridFTP endpoint with which the client wants to communicate. For example:

USER bresnaha::gsiftp://wiggum.mcs.anl.gov:2811/
[Note]Note

Requires GSI C security.

5. Multicasting

To transfer a single file to many destinations in a multicast/broadcast, use the new -mc option.

[Note]Note

To use this option, the admin must enable multicasting. Click here for more information.

globus-url-copy -vb -tcp-bs 2097152 -p 4 -mc filename source_url

The filename must contain a line-separated list of destination urls. For example:

gsiftp://localhost:5000/home/user/tst1
gsiftp://localhost:5000/home/user/tst3
gsiftp://localhost:5000/home/user/tst4
 

For more flexibility, you can also specify a single destination url on the command line in addition to the urls in the file. Examples are:

globus-url-copy -MC multicast.file gsiftp://localhost/home/user/src_file

or

globus-url-copy -MC multicast.file gsiftp://localhost/home/user/src_file gsiftp://localhost/home/user/dest_file1

5.1. Advanced multicasting options

Along with specifying the list of destination urls in a file, a set of options for each url can be specified. This is done by appending a ? to the resource string in the url followed by semicolon-separated key value pairs. For example:

gsiftp://dst1.domain.com:5000/home/user/tst1?cc=1;tcpbs=10M;P=4

This indicates that the receiving host dst1.domain.com will use 4 parallel stream, a tcp buffer size of 10 MB, and will select 1 host when forwarding on data blocks. This url is specified in the -mc file as described above.

The following is a list of key=value options and their meanings:

P=integer
The number of parallel streams this node will use when forwarding.
cc=integer
The number of urls to which this node will forward data.
tcpbs=formatted integer
The TCP buffer size this node will use when forwarding.
urls=string list
The list of urls that must be children of this node when the spanning tree is complete.
local_write=boolean: y|n
Determines if this data will be written to a local disk, or just forwarded on to the next hop. This is explained more in the Network Overlay section.
subject=string
The DN name to expect from the servers this node is connecting to.

5.2. Network Overlay

In addition to allowing multicast, this function also allows for creating user-defined network routes.

If the local_write option is set to n, then no data will be written to the local disk, the data will only be forwarded on.

If the local_write option is set to n and is used with the cc=1 option, the data will be forwarded on to exactly 1 location.

This allows the user to create a network overlay of data hops using each GridFTP server as a router to the ultimate destination.