August 8, 2013 | Ian Foster

Researchers who move data over the Internet with tools such as FTP on the TCP communication protocol to detect and retransmit data packets that have become corrupted in transit. It turns out that in doing so, they are leaning on an extremely weak reed. A 16-bit checksum means that 1 in 65,536 bad packets will be erroneously accepted as correct. You might think that corrupted packets are rare. But in 1999, Vern Paxson reported (see “End-to-End Internet Packet Dynamics” at that around 1 in 5,000 Internet data packets is corrupted in transit—meaning that around 1 in every 65K*5K = 300M packets is accepted with corruption. Interestingly, he suggested that the source of these corrupted packets was not long-haul networks (which, with the advent of optical fiber, are highly reliable) but the digital circuitry in routers.

Recognizing the potential for corruption, we incorporated an additional 128-bit checksum computation into Globus Online. This reduces the number of undetected bad packets dramatically, to just one in 2 x 1013. In December 2012, we went one step further, and turned checksum computations on by default. It is fortunate that we did, given the following announcement from the XSEDE national eScience infrastructure operator:

XSEDE was notified recently by Internet2 that an error was discovered on the devices that Internet2 uses on its AL2S network that could possibly lead to data corruption. This error could have affected approximately 0.001% of the data that traversed each AL2S device and was undetectable by the standard TCP packet checksum.

Fortunately, as this announcement also noted, data transfers initiated with Globus Online were not affected, due to our default checksum computations.

At a time when science is under ever-more intensive scrutiny, it is scary to think that precious research data might be corrupted as a side effect of transferring it over a network. This story emphasizes the importance of using high-quality research data management tools such as Globus Online—and the lengths to which we go to ensure that your research data is safe.