A single trajectory is a tragedy, 1.2 million is Big Data.

Pieter Fourie, Cuauhtèmoc Anda and Sergio Ordonez

There is still such a thing as bad publicity, as a recent New York Times exposè on app-driven person tracking confirms. Here’s how to stay out of the headlines by rolling your own data. We have developed methods that allow data stewards to stream completely synthetic location trails, which can fulfil the needs of many location-based services, and unconditionally guarantee individual privacy.

Three days, 1.2 million devices, 235 million locations, reads the tagline of a recent article on mobile data privacy in the New York Times. As a mobility researcher in the age of big data, one becomes inured to the staggering numbers involved in location services data gathering (one of many euphemisms for persistent individual location tracking). The bigger the numbers are, the better! More data means better models. Moreover, nobody is forcing the consumer to use these apps and services, right? Everyone is a knowing, willing participant. However, viewed through less familiar eyes, these numbers represent an Orwellian nightmare in the making.

The NY Times Daily podcast does an excellent job of shaking us out of our complacency; revealing the tragedy of vulnerable individuals whose privacy gets sold en masse to the highest bidder. It raises the question: what are our alternatives? The hottest topics in urban research and responsive cities all hint at some degree of surveillance: connected devices, digital twins, Internet-of-Things, Mobility-as-a-Service, all require us to become more connected. Information about us and our movements is put at risk of becoming more widely used by an increasing number of actors.

The NY Times article raises many issues around a lack of policy and oversight in the field of location tracking and exposes its personal, societal, institutional and corporate dimensions. These are all tough but pertinent problems. As we come to terms with living in an ever-more connected world, it is worthwhile to discuss some technological considerations to inform our decisions.

Data privacy preservation techniques

In the initial phase of their investigation, the journalists’ queries to data providers were met with claims that data were being aggregated or anonymised. What do aggregation and anonymisation mean? Generally, it either means bundling data points together so individuals cannot be told apart, or masking identifying information about them. When it comes to data on people’s movement, this becomes a tricky task, as the growing literature on the topic attests. Nevertheless, let’s assume that you somehow have a sufficiently robust approach to protecting people’s privacy. Then the next question becomes: how should you apply this protection?

Privacy-by-design vs post-processing

A data collector generally has two options when enforcing privacy preservation: either embedded into the device, meaning that no individual identifying information is ever recorded, or as a post-processing technique, that is applied after obtaining a fully detailed data set. The second case is vulnerable to compromise, as evidenced by the NY Times journalists who managed to get hold of a motherlode of raw data.

It may then be preferable to have devices that can be hard-coded and certified by an authority in order to enforce privacy preservation by design. Such data can be recorded in an auditable distributed ledger to uniquely associate each data point with an identified device. Encryption techniques make the device and its data tamper-proof. Authorities can start to insist on and enforce such end-to-end protection standards and certifications, similar to how they enforce the installation of sealed, tamper-proof taxi and electricity meters. Several projects are underway to enable this technology, some with a specific orientation towards mobility application, such as IOTA, TravelSpirit and IoMob.

Synthetic data: an alternative to privacy preservation techniques.

As the technological development in protecting and anonymising individual trajectory privacy grows, so does the body of shadow literature on de-anonymising algorithms that attempt to reconstruct back the individual traces. This means that a responsible data collector might go ahead and invest in an array of certified devices, only to find that their privacy protection gets defeated sometime later, in an unending privacy protection arms race.

This is the motivation for our interest in developing an alternative to typical location masking techniques. What if we could create synthetic location data streams with the same resolution in time and space as what is actually sensed through devices without reproducing any given trajectory in the real world? In our research on building such synthetic data streams, we use techniques that intentionally restrict the actual raw data view for machine-eyes-only. They can be hard-coded into purpose-built, certified devices that are only capable of recording and releasing aggregate statistics.

The synthetic location data generation operates in two steps: an encoding step which produces the aggregates, and can be audited and certified; and a reconstruction step, that produces synthetic data with the same aggregate statistics as the real data. We are developing two distinct techniques to implement this.

Multiple histogram matching

The first technique is repeated histogram matching in a high-dimensional space. This approach re-purposes an old statistical technique to ‘sculpt’ a synthetic dataset until it looks like the real data. This is achieved by repeatedly ‘raking’ the synthetic data along various directions in a multidimensional space.

Figure 1 illustrates the first four steps for a two-dimensional spatial data example. Note how the blue points (synthetic) grow to resemble the orange ones (real) more as we go from images 1 to 4.

histogram matching
Figure 1: Illustration of the iterative multiple histogram matching process against a two-dimensional target histogram.

Graphical generative techniques

The second approach is to generate synthetic data using a so-called ‘Traveller Generation Machine’.  This approach belongs to the domain of machine-learning or a so-called ‘graphical generative technique’. Here, the word graphical refers to a graph in the computer science sense, i.e. a map of relationships (edges) between quantities (nodes).

The ‘Traveller Generation Machine’ identifies a minimal set of aggregate information to be released in order to produce synthetic data that closely resembles the real thing. Unlike histogram matching, this approach requires structural knowledge of the data, that is, how does one thing relate to another in the dataset.

Figure 2: Generative Model for urban mobility data

Take E1 in the model in Figure 2 as an example. The arrows leading to E1 can be interpreted as follows: given S1 (defined as the start time of the first  activity in the day) and Z1 (the geographical area of the first  activity in the day), E1 (the end time of the first travel activity in the day) has the following likely values: … Note how the description intentionally reflects the fact that the machine does not record the information of any given individual, only aggregates.

Limitations

The two approaches were developed with the objective of providing reassurance to data providers when releasing synthetic data streams in the form of complete day travel trajectories of individuals. However, this data will not satisfy the needs of certain service providers, who insist on communicating directly with specific individuals who are in a specific place at a specific time.

If, as a society, we still want to sign on for this invasive form of direct marketing, in spite of the dangers reported in the NY Times article, then we should at least insist on limits to the number of locations that may be recorded in sequence. According to De Montjoye et al. (2013) , knowing only four location points in a sequence may be sufficient to uniquely identify most people in a dataset.

However, we are growing increasingly confident that this approach of relying on entirely synthetic datasets is sufficient for very detailed urban and transport planning, as well as location-based services that do not rely on real-time interaction with individuals. Feeding this synthetic data into a state-of-the-art mobility simulation such as MATSim represents the next step in producing an entire ‘doppelgänger city’ to test, probe and experiment with policy decisions, while leaving people in the real world safe and surveillance-free.

We thank our editors Tanvi Maheshwari and Geraldine Ee for their efforts in compiling this post.

Author: Pieter J. Fourie

Pieter is a simulation modeller at the Future Cities Laboratory, specializing in transportation. His interests include planning for future transport modes, mobility data privacy, the role of distributed ledger tech in transport and logistics, and the convergence of all these technologies.

Leave a Reply

Your email address will not be published. Required fields are marked *