Optimising network transfers to and from Queen Mary University of London, a large WLCG tier-2 grid site

Optimising network performance is key to high bandwidth data transfers required for a Tier-2 site. We describe the techniques we have used to obtain good performance. Monitoring plays a key part, as does the elimination of bottlenecks and tuning TCP window sizes. Multiple parallel transfers allowed us to saturate a l Gbit/s link for 24 hours - whilst still achieving acceptable download speeds. Source based routing and multiple data transfer servers allowed us to use an otherwise unused "resilient" link.


Introduction
Analysis of the large quantities of data from the Large Hadron Collider (LHC) is performed using a distributed network of computing centres, the Worldwide LHC Computing Grid (WLCG). Queen Mary University of London (QMUL) hosts a Tier-2 WLCG site with 1.8 PB of storage. Making optimum use of Wide Area Network (WAN) links is key to filling the storage -which would take approximately six months at an average of 1 Gbit/s, our nominal WAN capacity until September 2012. This six months is commensurate with the reprocessing cycle of the experimental data.
As can be seen from figure 1, data transfers, and hence bandwidth requirements, are continually increasing. Also notable are the large fraction of international and intercontinental transfers -which are particularly sensitive to packet loss and TCP tuning. Increased use of technologies such as WebDAV [1] and xrootd[2] that enable remote interactive access to the storage, coupled with federated storage technologies such as FAX [3], may also increase bandwidth requirements.
A previous paper [4] describes the tuning of network and disk access within the cluster. This paper describes the steps we have taken to ensure high performance of data transfers over the WAN.

Network Monitoring
Monitoring is key to understanding, and therefore improving, network performance. We use a variety of monitoring and diagnostic tools to help us in this task.

Active network monitoring
We have deployed both PerfSONAR [5] and RIPE[6] probes for active network monitoring. Three PerfSONAR machines are deployed, one for bandwidth monitoring, one for latency, and a third to test IPv6 [7] and jumbo frames performance (see section 4). The RIPE probe performs a similar job to the PerfSONAR latency instance, but is deployed at many more sites. Figure 2 shows monitoring of the WAN link provided by JANET, the UK National Research and Education Network (NREN). This monitoring allows us to monitor traffic over the WAN link, and therefore see whether this is the limiting factor. The figure shows saturation of a 1 Gbit/s link in March 2012 (top), so this clearly is the limiting factor here, but in February 2013 (bottom), the average rate inbound is 6 Gbit/s on a 10 Gbit/s link, so the bottleneck lies elsewhere -and may be outside QMUL. The ATLAS experiment monitors data transfer rates for individual files between grid sites. Figure 3 shows rates for files larger than 1 GB from Taiwan. There is a clear reduction in transfer rates for individual files between 7 and 19 September. The reduction coincided with the upgrade of the WAN link from 1 Gbit/s to 10 Gbit/s, but only affected transfers from some sites, while others saw increased transfer rates. A PerfSONAR host deployed at the Taiwan Tier-1 enabled us to establish that traffic to QMUL was not being routed via the preferred route, but was taking a congested fall back route instead. Further debugging using RIPE probes established that this was because route advertisements from QMUL were not being propagated correctly. Once this problem was fixed, transfer rates were higher than before the network upgrade.

Network tuning
The monitoring described previously allowed us to measure actual network performance.
Comparing actual with expected performance enabled us to find and fix bottlenecks. Our strategy was to obtain high performance on individual transfers, then to increase the number of parallel transfers to increase aggregate data rates.

QMUL Tier-2 Cluster
The Tier-2 has 3500 CPU cores, 1.8PB of Lustre [8] disk storage, and a 10 Gbit/s connection to the internet. The cluster has been optimised for parallel IO using the Lustre filesystem, and should not present a bottleneck to data transfers. Details of the IO performance of the cluster are described in the previous paper [4]. Two data transfer nodes are used, with GridFTP [9] the main protocol used to transfer data over the WAN.

Buffer sizes
High bandwidth links to international and intercontinental sites have high latency, and hence large amounts of data need to be in transit unacknowledged to obtain maximum performance. We have therefore followed the ESnet [10] recommendations for high bandwidth delay product links and increased TCP buffer sizes to obtain good performance.

Router technology
The commercial router initially used proved incapable of advertising routes while handling 1 Gbit/s of traffic. The resulting "route flapping" caused considerable performance degradation. A replacement Linux PC based router running Quagga [11] was deployed. This has recently been upgraded to a Xeon X5670 CPU with Intel X520 10 Gbit/s network cards and is capable of handling our 10 Gbit/s WAN link.

Source based routing
We were able to make use of a backup 1 Gbit/s link by using two data transfer servers on different IP addresses. Inbound traffic was routed separately by advertising a route with higher priority via Border Gateway Protocol (BGP) for a subset (a "/27") of our address space. Source based routing was then used to ensure that outbound traffic used both links. One problem we encountered was that the link to our machine room was provisioned as a bonded pair of 1 Gbit/s links, and all the traffic was going down one of these links resulting in performance still being capped at 1 Gbit/s. To fix this, we changed the MAC address of one of the data transfer servers. This allowed us to use both halves of the bonded link.
Since upgrading to a resilient pair of 10 Gbit/s links, this technique has been used to route our bulk data traffic over a different link to general university traffic. This ensures that any additional latency introduced by our traffic filling the link doesn't affect interactive applications used by general university users.

Future work 4.1. Jumbo frames
In principle, jumbo frames (a Maximum Transmission Unit (MTU) greater than the default of 1500 bytes -typically 9000), reduce overhead and permit increased performance. Performance improvements are most likely on latency limited links, but only if jumbo frames are enabled end to end. Performance measurements are ongoing, but preliminary results do not indicate an increase in transfer failures.

IPv6
We currently have two monitoring machines -a RIPE probe, and a PerfSONAR host running dual stack IPv4/IPv6. Performance has generally been good. We have, however, found poor performance over IPv6 to one site due to a routing problem. This has now been fixed, but illustrates the importance of testing. We plan to deploy some production machines as dual stack in the near future.

Conclusions
We have carefully monitored our network performance and found and fixed bottlenecks causing packet loss. We have also tuned the TCP stack of our data transfer nodes and run multiple transfers in parallel. We are now able to transfer large amount of data at high speeds over long distances.