Collaborative Network Troubleshooting in Practice

In my last article, I discussed the theoretical benefits of MPC for collaborative network security and how basic MPC primitves can be optimized for practical performance. In the last article of this series for now, I’d like to report on our experiences from applying MPC protocols to real network traces of six SWITCH customers.

The SWITCH network

SWITCH is the Swiss national research and education network (NREN), which connects all the Swiss universities and colleges, as well as several governmental institutions (currently around 46 customers) to the Internet. At CSG, we have been collecting NetFlow data exported on the 5 border routers of SWITCH since 2003. We used this comprehensive data set to evaluate the usefulness of privacy-preserving collaboration in network troubleshooting. In particular, we correlated traffic data from the six biggest customers of SWITCH using SEPIA protocols.

Privacy-Preserving Distributed Troubleshooting

We observed 11 days of traffic around the global Skype outage in August 2007. The Skype outage started from a Windows update triggering a large number of system restarts. In response, Skype nodes scanned cached host-lists to find supernodes causing a huge distributed scanning event lasting two days. The following plot shows the local UDP flow count for the six customer networks and, at the bottom, the aggregate flow count as computed by the SEPIA protocol (ALL):

The 2-day Skype anomaly is clearly visible in the middle. From comparing the local traffic with the aggregate traffic, each customer gets several benefits. First of all, the networks can quickly decide whether the anomaly is local or global (distributed) by comparing the size of the local peak with the global peak. Knowing the global or local nature of an anomaly is important for steering further troubleshooting steps. Moreover, the networks learn how much they are affected compared to the average. Network 2, for instance, had no anomaly and concludes that there was a large anomaly taking place but they were not affected. Most of the organizations conclude that they were indeed affected, but less than average. Organizations 3 and 4, however, have to spend thoughts on why the anomaly was so disproportionately strong in their networks. Very interestingly, one of the most affected networks (number 6) could have even learned of the anomaly almost one day ahead, as it was visible in the aggregate plot 23h before it hit the network.

Root-Cause Analysis

So far, the networks have learned that some global anomaly is going on. However, to actually troubleshoot the anomaly, more detailed information is needed. We applied PPTKS (a SEPIA protocol for privacy-preserving distributed top-k queries) to the networks’ UDP traffic and show top-k statistics for incoming destination ports¬† in the following (click on an image for a bigger version):

Incoming UDP Destination Ports:

Aggregate:
Org 1: Org 2: Org 3:
Org 4: Org 5: Org 6:

The plots show the share of each port of the total traffic in terms of flow count. The covered period spans 6 days around the Skype outage. Before the anomaly, the traffic mix is dominated by NTP (port 123) and DNS (port 53). Also, ports 1434 and 1026 have significant support across several organizations. Port 1434 is associated with Microsoft SQL Monitor and the Slammer worm, which is still trying to propagate. Port 1026 is presumably used for attempted spamming of the Windows Messenger service.  When the anomaly starts, organizations see a sudden increase in activity on specific high port numbers. Connections also originate mainly from a series of dynamic ports. Some of the scanned high ports are extremely prevalent, e.g., destination port 19690 accounts for 93% of all flows of organization #4, at the peak rate. Investigation of the traffic shows that most of the anomalous flows within organizations #3 and #4 are targeted at a single IP address and originate from thousands of distinct source addresses connecting repeatedly up to 13 times per minute. These patterns indicate that the two organizations host popular supernodes, attracting a lot of traffic to specific ports. Other organizations mainly host client nodes and see uniform scanning, while organization #2 has banned Skype completely. Interestingly, each organization is affected on distinct dynamic destination ports that are not shared with other organizations. For instance, organization #3 is affected on ports 1562 and 17145, whereas #4 is affected on port 19690 and #6 on port 27550. Each organization can conclude that their anomalous port is not shared with others from the aggregate plot. By using the absolute flow count numbers from above, they are able to calculate the absolute flows for each port in the aggregate plot. From these numbers they can conclude that they are the main contributors to the counts of their anomalous ports. Furthermore, they learn that other organizations are affected in a similar way, because the aggregate plot reveals an increase in a number of different dynamic ports similar to their own local anomaly.

Unlike anomalous ports, which are unique in each organization, the top-k IP address plots reveal that there is a subset of anomalous external IP addresses that all organizations have in common:

External Destination IP Addresses (replaced by identifiers 1-25):

Aggregate:
Org 1: Org 2: Org 3:
Org 4: Org 5: Org 6:

In particular, addresses 7, and 23-25 start being active at the start of the anomaly and cause a major part of the anomalous traffic in each organization. Organizations can deduce from the aggregate plot that these external hosts are not unique to their own network. This gives a strong indication that all organizations indeed see parts of a single global anomaly instead of unrelated local events.

Based on these types of collaborative analyses, organizations can easily determine the scope and learn details of distributed anomalies, identify its probable root causes and take appropriate measures to mitigate damage. Also, local anomalies can be identified as such, by learning that other organizations are not affected.

The Future

These results are indeed promising and we believe that MPC will play an important role in the future of multi-domain network management. The shaping of this future has only just begun. Yet, as of today, SEPIA is already used in three continuing projects:

DEMONS An EU-FP7 project with the goal of building a decentralized, cooperative, and privacy-preserving network monitoring infrastructure.
PPFA Funded by the Swiss Commission for Technology and Innovation (CTI), ETH Zurich is developing a privacy-preserving network flow analyzer (PPFA) together with IBM Research.
ASMONIA Funded by the German federal ministry of education and research, this project develops security concepts for mobile network infrastructures based on collaborative information exchange.

If you’re interested in this new technology and want to perform your own experiments, the SEPIA library is open source (LGPL) and available for download here: SEPIA web page.

This entry was posted in Security. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *