Software Links
Getting Started
- A Globus Primer
- Globus Is Modular!
- Quickstart
- Installing GT
- Platform Notes
- GT Developer's Guide
- GT User's Guide
- Migrating Guides
Reference
Manuals
Common Runtime
Security
- GSI C
- GSI Java
- Java WS A&A
- C WS A&A (coming soon)
- CAS
- Delegation Service
- MyProxy
- GSI-OpenSSH
- SimpleCA
Data Mgt
WS MDS
Execution Mgt
Name
globus-url-copy — Multi-protocol data movement
Synopsis
globus-url-copy
Tool description
globus-url-copy is a scriptable command line tool that can do multi-protocol data movement. It supports gsiftp:// (GridFTP), ftp://, http://, https://, and file:/// protocol specifiers in the URL. For GridFTP, globus-url-copy supports all implemented functionality. Versions from GT 3.2 and later support file globbing and directory moves.
Before you begin
![]() | Important |
|---|---|
To use |
First, as with all things Grid, you must have a valid proxy certificate to run globus-url-copy in certain protocols (
gsiftp://andhttps://, as noted above). If you are usingftp://orhttp://protocols, security is not mandatory and you may skip the rest of this table.If you do not have a certificate, you must obtain one.
If you are doing this for testing in your own environment, the SimpleCA provided with the Globus Toolkit should suffice.
If not, you must contact the Virtual Organization (VO) with which you are associated to find out whom to ask for a certificate.
One common source is the DOE Science Grid CA, although you must confirm whether or not the resources you wish to access will accept their certificates.
Instructions for proper installation of the certificate should be provided from the source of the certificate.
Please note when your certificates expire; they will need to be renewed or you may lose access to your resources.
Now that you have a certificate, you must generate a temporary proxy. Do this by running:
grid-proxy-init
Further documentation for grid-proxy-init can be found here.
You are now ready to use globus-url-copy! See the following sections for syntax and command line options and other considerations.
Command syntax
The basic syntax for globus-url-copy is:
globus-url-copy [optional command line switches]Source_URLDestination_URL
where:
| [optional command line switches] | See Command line options below for a list of available options. |
|
|
Specifies the original URL of the file(s) to be copied. If this is a directory, all files within that directory will be copied. |
|
|
Specifies the URL where you want to copy the files. If you want to copy multiple files, this must be a directory. |
![]() | Note |
|---|---|
Any url specifying a directory must end with /. |
URL prefixes
As of GT 3.2, we support the following URL prefixes:
- file:// (on a local machine only)
- ftp://
- gsiftp://
- http://
- https://
By default, globus-url-copy expects the same kind of host certificates that globusrun expects from gatekeepers.
![]() | Note |
|---|---|
We do not provide an interactive client similar to the generic FTP client provided with Linux. See the Interactive Clients section below for information on an interactive client developed by NCSA/NMI/TeraGrid. |
URL formats
URLs can be any valid URL as defined by RFC 1738 that have a protocol we support. In general, they have the following
format: protocol://host:port/path.
![]() | Note |
|---|---|
If the path ends with a trailing / (i.e. |
Table 1. URL formats
gsiftp://myhost.mydomain.com:2812/data/foo.dat | Fully specified. |
http://myhost.mydomain.com/mywebpage/default.html | Port is not specified; therefore, GridFTP uses protocol default (in this case,
80). |
file:///foo.dat | Host is not specified; therefore, GridFTP uses your local host. Port is not
specified; therefore, GridFTP uses protocol default (in this case, 80). |
file:/foo.dat | This is also valid but is not recommended because, while many servers (including ours) accept this format, it is not RFC conformant and is not recommended. |
![]() | Important |
|---|---|
For GridFTP ( gsiftp:// If you are using GSI security, then you may specify the username (but you may
not include the If you are using anonymous FTP, the username must be one of the usernames listed as a valid anonymous name and the password can be anything. If you are using password authentication, you must specify both your username and password. THIS IS HIGHLY DISCOURAGED, AS YOU ARE SENDING YOUR PASSWORD IN THE CLEAR ON THE NETWORK. This is worse than no security; it is a false illusion of security. |
Command line options
Informational Options
- -help | -usage
Prints help.
- -version
Prints the version of this program.
- -versions
Prints the versions of all modules that this program uses.
- -q | -quiet
Suppresses all output for successful operation.
- -vb | -verbose
During the transfer, displays:
- number of bytes transferred,
- performance since the last update (currently every 5 seconds), and
- average performance for the whole transfer.
- -dbg | -debugftp
Debugs FTP connections and prints the entire control channel protocol exchange to STDERR.
Very useful for debugging. Please provide this any time you are requesting assistance with a globus-url-copy problem.
- -list <url>
This option will display a directory listing for the given url.
Utility Ease of Use Options
- -a | -ascii
Converts the file to/from ASCII format to/from local file format.
- -b | -binary
Does not apply any conversion to the files. This option is turned on by default.
- -f
filename Reads a list of URL pairs from a filename.
Each line should contain:
sourceURLdestURLEnclose URLs with spaces in double quotes ("). Blank lines and lines beginning with the hash sign (#) will be ignored.
- -r | -recurse
Copies files in subdirectories.
- -notpt | -no-third-party-transfers
Turns third-party transfers off (on by default).
Site firewall and/or software configuration may prevent a connection between the two servers (a third party transfer). If this is the case, globus-url-copy will "relay" the data. It will do a GET from the source and a PUT to the destination.
This obviously causes a performance penalty but will allow you to complete a transfer you otherwise could not do.
Reliability Options
- -rst | -restart
Restarts failed FTP operations.
- -rst-retries <retries>
Specifies the maximum number of times to retry the operation before giving up on the transfer.
Use 0 for infinite.
The default value is 5.
- -rst-interval <seconds>
Specifies the interval in seconds to wait after a failure before retrying the transfer.
Use 0 for an exponential backoff.
The default value is 0.
- -rst-timeout <seconds>
Specifies the maximum time after a failure to keep retrying.
Use 0 for no timeout.
The default value is 0.
Performance Options
- -tcp-bs <size> | -tcp-buffer-size <size>
Specifies the size (in bytes) of the TCP buffer to be used by the underlying ftp data channels.
![[Important]](/docbook-images/important.gif)
Important This is critical to good performance over the WAN.
- -p <parallelism> | -parallel <parallelism>
Specifies the number of parallel data connections that should be used.
![[Note]](/docbook-images/note.gif)
Note This is one of the most commonly used options.
- -bs <block size> | -block-size <block size>
Specifies the size (in bytes) of the buffer to be used by the underlying transfer methods.
- -pp
(New starting with GT 4.1.3) Allows pipelining. GridFTP is a command response protocol. A client sends one command and then waits for a "Finished response" before sending another. Adding this overhead on a per-file basis for a large data set partitioned into many small files makes the performance suffer. Pipelining allows the client to have many outstanding, unacknowledged transfer commands at once. Instead of being forced to wait for the "Finished response" message, the client is free to send transfer commands at any time.
- -mc
filenamesource_url (New starting with GT 4.2.0) Transfers a single file to many destinations. Filename is a line-separated list of destination urls. For more information on this option, click here.
Multicasting must be enabled for use on the server side.
Security Related Options
- -s <subject> | -subject <subject>
Specifies a subject to match with both the source and destination servers.
![[Note]](/docbook-images/note.gif)
Note Used when the server does not have access to the host certificate (usually when you are running the server as a user). See the section called “If you run a GridFTP server by hand...”.
- -ss <subject> | -source-subject <subject>
Specifies a subject to match with the source server.
![[Note]](/docbook-images/note.gif)
Note Used when the server does not have access to the host certificate (usually when you are running the server as a user). See the section called “If you run a GridFTP server by hand...”.
- -ds <subject> | -dest-subject <subject>
Specifies a subject to match with the destination server.
![[Note]](/docbook-images/note.gif)
Note Used when the server does not have access to the host certificate (usually when you are running the server as a user). See the section called “If you run a GridFTP server by hand...”.
- -nodcau | -no-data-channel-authentication
Turns off data channel authentication for FTP transfers (the default is to authenticate the data channel).
![[Warning]](/docbook-images/warning.gif)
Warning We do not recommend this option, as it is a security risk.
- -dcsafe | -data-channel-safe
Sets data channel protection mode to SAFE.
Otherwise known as integrity or checksumming.
Guarantees that the data channel has not been altered, though a malicious party may have observed the data.
![[Warning]](/docbook-images/warning.gif)
Warning Rarely used as there is a substantial performance penalty.
- -dcpriv | -data-channel-private
Sets data channel protection mode to PRIVATE.
The data channel is encrypted and checksummed.
Guarantees that the data channel has not been altered and, if observed, it won't be understandable.
![[Warning]](/docbook-images/warning.gif)
Warning VERY rarely used due to the VERY substantial performance penalty.
Default globus-url-copy usage
A globus-url-copy invocation using the gsiftp protocol with no options (i.e., using all the defaults) will perform a transfer with the following characteristics:
- binary
- stream mode (which implies no parallelism)
- host default TCP buffer size
- encrypted and checksummed control channel
- an authenticated data channel
MODES in GridFTP
GridFTP (as well as normal FTP) defines multiple wire protocols, or MODES, for the data channel.
Most normal FTP servers only implement stream mode (MODE S) , i.e. the bytes flow in order over a single TCP connection. GridFTP defaults to this mode so that it is compatible with normal FTP servers.
However, GridFTP has another MODE, called Extended Block Mode, or MODE E. This mode sends the data over
the data channel in blocks. Each block consists of 8 bits of flags, a 64 bit integer
indicating the offset from the start of the transfer, and a 64 bit integer indicating the
length of the block in bytes, followed by a payload of length bytes. Because the offset and
length are provided, out of order arrival is acceptable, i.e. the 10th block could arrive
before the 9th because you know explicitly where it belongs. This allows us to use multiple
TCP channels. If you use the -p | -parallelism option, globus-url-copy automatically puts the servers into MODE E.
![]() | Note |
|---|---|
Putting |
If you run a GridFTP server by hand...
If you run a GridFTP server by hand, you will need to explicitly specify the subject name to expect. The subject option provides globus-url-copy with a way to validate the remote servers with which it is communcating. Not only must the server trust globus-url-copy, but globus-url-copy must trust that it is talking to the correct server. The validation is done by comparing host DNs or subjects.
If the GridFTP server in question is running under a host certificate then
the client assumes a subject name based on the server's canonical DNS name. However, if it
was started under a user certificate, as is the case when a server is started
by hand, then the expected subject name must be explicitly stated. This is done with the
-ss, -sd, and -s options.
-ssSets the
sourceURLsubject.-dsSets the
destURLsubject.-sIf you use this option alone, it will set both urls to be the same. You can see an example of this usage under the Troubleshooting section.
![[Note]](/docbook-images/note.gif)
Note This is an unusual use of the client. Most times you need to specify both URLs.
How do I choose a value?
How do I choose a value for the TCP buffer size (-tcp-bs) option?
The value you should pick for the TCP buffer size (-tcp-bs) depends
on how fast you want to go (your bandwidth) and how far you are moving the data (as
measured by the Round Trip Time (RTT) or the time it takes a packet to get to the
destination and back).
To calculate the value for -tcp-bs, use the following formula (this
assumes that Mega means 1000^2 rather than 1024^2, which is typical for bandwidth):
-tcp-bs = bandwidth in Megabits per second (Mbs) * RTT in
milliseconds (ms) * 1000 / 8
As an example, if you are using fast ethernet (100 Mbs) and the RTT was 50 ms it would be:
-tcp-bs = 100 * 50 * 1000 / 8 = 625,000 bytes.
So, how do you come up with values for bandwidth and RTT? To determine RTT, use either ping or traceroute. They both list RTT values.
![]() | Note |
|---|---|
You must be on one end of the transfer and ping the other end. This means that if you are doing a third party transfer you have to run the ping or traceroute between the two server hosts, not from your client. |
The bandwidth is a little trickier. Any point in the network can be the bottleneck, so you either need to talk with your network engineers to find out what the bottleneck link is or just assume that your host is the bottleneck and use the speed of your network interface card (NIC).
![]() | Note |
|---|---|
The value you pick for |
So where does this formula come from? Because it uses the bandwidth and the RTT (also known as the latency or delay) it is called the bandwidth delay product. The very simple explanation is this: TCP is a reliable protocol. It must save a copy of everything it sends out over the network until the other end acknowledges that it has been received.
As a simple example, if I can put one byte per second onto the network, and it takes 10 seconds for that byte to get there, and 10 seconds for the acknowledgment to get back (RTT = 20 seconds), then I would need at least 20 bytes of storage. Then, hopefully, by the time I am ready to send byte 21, I have received an acknowledgement for byte 1 and I can free that space in my buffer. If you want a more detailed explanation, try the following links on TCP tuning:
How do I choose a value for the parallelism (-p) option?
For most instances, using 4 streams is a very good rule of thumb. Unfortunately, there is not a good formula for picking an exact answer. The shape of the graph shown here is very characteristic.
You get a strong, nearly linear, increase in bandwidth, then a sharp knee, after which additional streams have very little impact. Where this knee is depends on many things, but it is generally between 2 and 10 streams. Higher bandwidth, longer round trip times, and more congestion in the network (which you usually can only guess at based on how applications are behaving) will move the knee higher (more streams needed).
In practice, between 4 and 8 streams are usually sufficient. If things look really bad, try 16 and see how much difference that makes over 8. However, anything above 16, other than for academic interest, is basically wasting resources.
Interactive clients for GridFTP
The Globus Project does not provide an interactive client for GridFTP. Any normal FTP client will work with a GridFTP server, but it cannot take advantage of the advanced features of GridFTP. The interactive clients listed below take advantage of the advanced features of GridFTP.
There is no endorsement implied by their presence here. We make no assertion as to the quality or appropriateness of these tools, we simply provide this for your convenience. We will not answer questions, accept bugs, or in any way shape or form be responsible for these tools, although they should have mechanisms of their own for such things.
UberFTP was developed at the NCSA under the auspices of NMI and TeraGrid:
- NCSA Uberftp only download: http://dims.ncsa.uiuc.edu/set/uberftp/download.html
- UberFTP User's Guide: http://dims.ncsa.uiuc.edu/set/uberftp/userdoc.html
