In the fable “The Blind Men and the Elephant” by the American poet John Godfrey Saxe, six blind men from Indostan heard of a thing called “an elephant” but did not know what it was. To satisfy their minds, they went to observe a real elephant. Each of them approached the elephant from a different side and came to his own conclusion about what an elephant is. The one that touched the side found “It’s very like a wall!”, while the one examining the tusk shouted “It’s very like a spear!”. The knee was judged to be like a tree, the trunk like a snake, the ear like a fan, and the tail like a rope. When they finally came together to discuss their observations they had a long dispute about what an elephant was. However, as Saxe put it: “Though each was partly in the right, all were in the wrong!”
Is the Internet an Elephant?
The situation in today’s Internet research bears quite some similarity with the blind men’s fable. The Internet has grown to be a veritable elephant over the past 20 years, driven mainly by global commercialization in the 1990s and 2000s. According to the Internet Systems Consortium (ISC), there were only 56,000 hosts connected to the Internet in 1988. In 1992 it passed the 1 million hosts mark, in 1996 the 10 million, and in 2001 the 100 million mark. In January 2011, there were already more than 800 million hosts connected. Today, the Internet is rapidly expanding to include mobile devices, such as smart phones. According to a report by Initiative, the mobile phone will overtake the computer as the most common web access device worldwide by 2013, with an estimated 1.82 billion internet-enabled phones in use.
Though being entirely man-made, the distributed nature, huge size, and strong dynamics of the Internet have made it impossible to describe its state in simple terms and from a single point of view. It has become a complex phenomenon people have opinions about. Consequently, methodologies used today in Internet measurement research are often empirical, capturing large amounts of data that are later analyzed in-depth and have to be interpreted.
Studying the Elephant
Similar to the reports of the blind men, Internet measurement studies are limited in scope and accuracy. Each study examines the Internet at a specific location, e.g., a university network, at a specific point in time, using specific tools. Typically, results from these studies are generalized to some degree, i.e., they are believed to reflect at least parts of the Internet. However, there are many parameters constraining generalization. First of all, the Internet is constantly evolving. It is difficult enough to obtain and process high quality traffic data. But to get data spanning months or even years, allowing to analyze temporal evolution and trends, is close to impossible. Moreover, measurements in one network cover just a tiny fraction of the global Internet. To compensate this, researchers from CAIDA proposed to establish periodic “Day in the Life of the Internet” events with the goal to measure the Internet core simultaneously from all over the world. Such a setup would allow correlation of different measurements at the same point in time. However, depending on what we measure and where we are, the Internet might actually look different. For instance, most studies are carried out in academic setups, which makes it difficult to argue about residential networks. Also, statistical methods used in anomaly detection or traffic classification are prone to learning site-specific patterns. Due to a lack of reference data sets, it often remains unclear how well these methods generalize. Even seemingly easy questions, such as “How big is the Internet?”, are hard to answer. Odlyzko shows that the Internet growth rate, although substantial, was severely overestimated (by about a factor of 10) in the late 1990s, leading to an inflation of the dot-com and telecom bubbles.
Protecting the Elephant
Unfortunately, also the dark side of the Internet has grown dramatically over the past years. The cybercrime scene has professionalized and governments around the world are preparing for cyber-warfare. Studies show that coordinated wide-scale attacks are prevalent: 20% of the studied malicious addresses and 40% of the IDS alerts are attributed to coordinated wide-scale attacks. According to the 2009 CSI Computer Crime and Security Survey, 23% of responding organizations found botnet zombies, 29% experienced DoS attacks, 14% dealt with webpage defacement, and 14% report system penetration by outsiders. Moreover, there is an imbalance in the cyber arms race. While cybercriminals act globally and are well coordinated, e.g., by using botnets, operators protecting their networks often have to resort to local information only. Yet, many network security and monitoring problems would profit substantially if a group of organizations aggregated their local network data. For example, IDS alert correlation requires the joint analysis of distributed local alerts. Similarly, aggregation of local data is useful for alert signature extraction, collaborative anomaly detection, multi-domain traffic engineering, or detecting traffic discrimination. Even the difficult problem of detecting fast-fluxing P2P botnets seems to become tractable with cross-AS collaboration.
All these examples clearly illustrate the need for large-scale distributed Internet measurements. Only by combining many individual pieces, we will get the big picture of the Internet and the threats therein.
Great! Let’s All Share Our Data!
Now one might ask: If data sharing brings all these benefits, why is it not done in practice? Parts of the problem are certainly a lack of standards and coordination. That is, data captured in different networks might not be directly comparable due to different tools, data formats, or measurement techniques. Another issue is the large amount of data involved. The storage and processing of traffic data requires substantial resources, especially if packet data are involved. Therefore it is not trivial to ship data around or gather it in a central repository. However, these obstacles can be overcome with community initiative, coordination, and engineering, as has been done in other data-driven disciplines such as astronomy or particle physics. By far the more difficult problem is how to address privacy concerns with network data.
In a series of follow-up articles, I will elaborate more on how the research community has tried (and failed) to solve these privacy problems.