On Oct 29 2012 in the evening the storm surge that hurricane Sandy caused hit the east coast of North America, and in particular New Jersey and New York. Wikipedia reports that in total 22 US states and a large portion of Eastern Canada are suffering from the direct consequences of the storm. The implications on the people living in these areas can hardly be imagined, and we send our sympathies and our best hopes for a fast recovery from the damage. Indeed, the damage done is large enough to have global impact. Even Internet reachability is impaired, and the results can be perceived from far-away Europe, as will be detailed in this blog post.
Dominik Schatzmann, a Ph.D. student in Bernhard Plattner‘s group (EDIT 2012-11-30: now Dr Schatzmann), has been investigating Internet reachability problems for his thesis for the last 4 years. His approach is orthogonal to conventional techniques which rely on control plane information or active probing for finding outages. In contrast, Dominik is using the information contained in NetFlow data from live traffic to detect which destinations could not be reached. The rational behind this approach is two-fold. First, by using passive measurements which are conducted for billing purposes anyway, there is no additional load on the Internet. Second, as the outage detection is driven by actual connection attempts, it is very simple to see which of the unreachable portions of the Internet are most important to the users. Both of these properties are desirable from an ISP’s perspective, as this reduces measurement cost and allows to solve the most important problems first. A summary about the Flow-Based Approach for Connectivity Tracking (FACT) used to analyze the impact of the hurricane on the Internet was recently presented at the Passive Active Measurement Conference (PAM). In the meanwhile, a master student of ours (thesis to be published) reduced the noise level of the approach by priming FACT with “stable server sockets”.
Remarkably, the technique is sensitive enough to detect data center and routing outages on a global level based on measurement data collected on the border of the SWITCH network, the swiss national research and education network. The plot below shows how many BGP prefixes have been found “unreachable” within 5 minute time bins by a certain number of local clients. The grey area corresponds to at least a single client, the red area to at least ten clients, and the blue area to at least ten clients, but filtered for US destinations only. The picture actually shows two independent incidents: First, the impact of Sandy on US reachability, and second a router outage several days later.
As can be seen a certain level of permanent background noise manifests in the gray area, showing a time-of-day pattern induced by the local client population. However, when comparing the outage periods that affect at least 10 clients we see that the noise is mostly canceled out and almost all of the affected prefixes are located in the US. The outages started only a few hours after Sandy entered the continent, and continued for a couple of days loosely following the time-of-day pattern with peaks around noon.
On Nov 2nd around midnight UTC a different kind of incident happened. In particular, a router inside the SWITCH network crashed due to a bug in the router firmware. This pattern looks quite different. First, the fraction of US based prefixes affected is a lot lower (note the log scale!). Second, the effect manifests in the early morning hours when typically there is a minimum in the time-of-day pattern. Yet, the overall magnitude is higher than for the outages induced by Sandy, thus highlighting the effect of locality.
We further broke down the locations of prefixes located in the US on the first day of the Sandy outage with the Maxmind geolocation database. The result is shown in the picture below.
There are two interesting observations. First, the prefixes hosted on the Bermudas – completely covered by the large red dot on the picture – caused a lot of discontent among SWITCH clients. One may be wondering why this is the case. Indeed, here are two factors involved. First, the QuoVadis certificate service, part of which is hosted on the Bermudas, was not reachable any more. And second, the connectivity to the Bermudas is relayed through New York, either directly or via New Jersey or, a few hops longer, through Brazil. Similar to the Bermudas, other locations in the US have been affected as well, including the center and the west coast. Checking the global picture, even some reachability impairment towards China could been observed.
Overall, we think it is fair to claim that Sandy had a global impact on Internet reachability. FACT can be used to detect such problems and quantify their importance from an ISP’s perspective. As such it is complementary to control plane analysis and to active measurement based approaches.
Dominik will defend his Ph.D. thesis tomorrow. We all wish him the best of luck! Moreover, we want to thank SWITCH and in particular Simon Leinen for supporting us and making this research possible.
EDITED 2012-12-18: Gave Dr Schatzmann his proper title — neuhaus