In my last article, I discussed why collaboration among networks is essential for monitoring the Internet and maintaining its security in the future. Unfortunately, such collaboration is very difficult in practice due to privacy concerns.
Privacy Concerns Block the Road
Traffic data contain very sensitive information about users, servers, and networks. With packet data, the entire network communication of a user is captured. But even if payload is stripped away, as with packet headers or NetFlow data, stored IP addresses still allow the identification of users and hosts. The associated connection information allows the creation of precise communication profiles, e.g., containing information about who is communicating with whom and when, or which websites a person visits.
From a legal perspective, network data are “personal data”. For instance, European law defines personal data as data identifying a person either directly or indirectly (i.e., through
the use of additional information in possession of third parties). To this category belong, e.g., IP addresses and user profiles. The law restricts the processing allowed on personal data and mandates anonymization for subsequent storage or before further processing.
Ohm et al. discuss many subtleties regarding legal issues in network research, pertaining mainly to U.S. law. They find that many research papers fall short of clear legal compliance due to a disconnect between legislation and current academic practice. For example, the application of data reduction or anonymization does not necessarily legalize analyses.
Furthermore, Burstein et al. point out that the flow of data to be analyzed poses additional problems. In the U.S., researchers (mostly working for governmental institutions) are, in principle, not allowed to analyze data from entities regulated by the Stored Communications Act (SCA), such as commercial ISPs. As a result, there is much uncertainty in the networking community and operators often choose the safe way,
i.e., they completely refrain from data sharing. Data sharing among international partners brings up the additional complication of heterogeneity in international data protection legislation.
Even if ambiguity in legislation is fixed in the future, organizations will not easily engage in data sharing. After all, there are internal network data at stake. Security policies might deny sharing because of a high risk of information disclosure. Even though collaboration might be useful, organizations have to carefully balance benefits with risks of potential damage. Even anonymized data may contain topological information, hint at particular services deployed, or reveal policies in place. In a competitive setting, overall statistics
might reveal information about a participant’s customer base. In summary, the situation is intricate. Even if the men inspecting the Internet elephant are not blind, they refuse to exchange their observations from fear of privacy breaches.
Anonymization to the Rescue!
A seemingly easy solution to these privacy problems would be to anonymize local data before sharing. Then, anonymized data are exchanged, either bilaterally or by using some sort of central (or distributed) repository. Data analyses could then be run on the entirety of data instead of local data only.
In fact, many tools and techniques for anonymizing network data have been developed (e.g., FLAIM). The basic principles for anonymizing IP addresses are blackmarking, permutation, or truncation. Permutations can either be random or (partially) prefix-preserving, i.e., common prefixes of arbitrary length are preserved under the permutation function.
Anonymization is Unexplored
Alas, the creation of an anonymization policy for a specific data set typically involves
a mix of expertise, heuristics, and gut feelings. The security guarantees of particular methods are difficult to quantify. Furthermore, data owners mandating an
anonymization policy have to supervise the use of anonymized data and negotiate with data users whether a certain type of analysis is still possible, and if so, to what degree. On the one hand, the more information is removed from data, the better privacy is protected. On the other hand, removal of information makes data less useful for analyses. Tuning this privacy-utility tradeoff is very delicate, especially if quantitative measures are missing.
The IETF addresses this problem in the specification of IPFIX, the future format for network flow data and successor of NetFlow (RFC3917, RFC6235).
It is the goal of the IETF to require anonymization support on routers in order to be able to directly export anonymized data and avoid privacy breaches during transport, processing, and storage of data. However, due to insufficient understanding of the basic properties of existing techniques, the requirement for anonymization support is not qualified with “must” but with “may” (RFC3917, §6.7). This is in line with Ohm et al., who demand a thorough assessment of traditional strategies for privacy protection. A clearer understanding of these techniques is a first step towards fixing the gap between legislation and research practice.
The Privacy-Utility Tradeoff
We took up this challenge and performed several studies shedding light on quantitative utility and privacy properties of state-of-the-art anonymization techniques for IP addresses.
To evaluate the privacy guarantees of IP address permutations in presence of a strong attacker, we performed active traffic injection attacks (see our ACM CCR 2010 paper). The goal of these attacks is to inject known traffic patterns into networks and recover these patterns from anonymized data, allowing to de-anonymize IP addresses. Our results show that it is indeed easy to perform traffic injection attacks in practice, even in large networks, such as SWITCH. Moreover, the attacks were successful even though secondary flow fields were randomized and coarse-grained to blur patterns. Specifically by stretching injected patterns over time, the attacker can evade detection.
We also evaluated the utility of anonymized data by performing statistical network anomaly detection on original and anonymized NetFlow data. Network anomaly detection seemed a good application to study, as it is a prominent application of NetFlow data and has attracted a lot of research interest in the last years. We analyzed the specific privacy-utility tradeoff of IP address truncation, which goes beyond permutation and actually deletes information from traces. Interestingly, there is an asymmetry between IP addresses assigned to internal and external address ranges. For internal addresses, fewer bits need to be truncated to provide acceptable privacy, simply because networks are more densely filled with active addresses. For instance, by truncating 8 bits, all addresses within the same /24 network become indistinguishable. The external address range is much sparser, requiring roughly 7 bits of truncation more for the same privacy level. Regarding data utility, we found that entropy metrics exhibit better robustness against truncation than count metrics. But overall, only 3 of 20 configurations of metrics and anonymization strength achieve somewhat acceptable utility and privacy at the same time.
The Role of Anonymization Reconsidered
The success of injection attacks, the impossibility of defending against them, and the difficulty in achieving a well-balanced privacy-utility tradeoff leads us to call into
question the role of anonymization as a solution to the problem of data protection. Also, current anonymization techniques do not necessarily provide “anonymization” in the legal sense. Consequently, lawyers start arguing that legislation has to abandon the concept of PII (personally identifying information) and move on to more holistic definitions, considering a series of factors in context-specific solutions. If at all, anonymization must be applied together with legal (e.g. Acceptable Use Policies) and social means (“know the people”) to achieve the aims of better data sharing for research and operations.
Looking for good alternatives to anonymization, my next article will explore a completely different approach to privacy-preserving data sharing: Secure Multiparty Computation (MPC). Although MPC has been studied in theory for almost 30 years, real-world applications are extremely rare. Stay tuned to learn why this is the case and what can be done about it.