A single trajectory is a tragedy, 1.2 million is Big Data.

Pieter Fourie, Cuauhtèmoc Anda and Sergio Ordonez

There is still such a thing as bad publicity, as a recent New York Times exposè on app-driven person tracking confirms. Here’s how to stay out of the headlines by rolling your own data. We have developed methods that allow data stewards to stream completely synthetic location trails, which can fulfil the needs of many location-based services, and unconditionally guarantee individual privacy.

Three days, 1.2 million devices, 235 million locations, reads the tagline of a recent article on mobile data privacy in the New York Times. As a mobility researcher in the age of big data, one becomes inured to the staggering numbers involved in location services data gathering (one of many euphemisms for persistent individual location tracking). The bigger the numbers are, the better! More data means better models. Moreover, nobody is forcing the consumer to use these apps and services, right? Everyone is a knowing, willing participant. However, viewed through less familiar eyes, these numbers represent an Orwellian nightmare in the making.

The NY Times Daily podcast does an excellent job of shaking us out of our complacency; revealing the tragedy of vulnerable individuals whose privacy gets sold en masse to the highest bidder. It raises the question: what are our alternatives? The hottest topics in urban research and responsive cities all hint at some degree of surveillance: connected devices, digital twins, Internet-of-Things, Mobility-as-a-Service, all require us to become more connected. Information about us and our movements is put at risk of becoming more widely used by an increasing number of actors.

The NY Times article raises many issues around a lack of policy and oversight in the field of location tracking and exposes its personal, societal, institutional and corporate dimensions. These are all tough but pertinent problems. As we come to terms with living in an ever-more connected world, it is worthwhile to discuss some technological considerations to inform our decisions.

Data privacy preservation techniques

In the initial phase of their investigation, the journalists’ queries to data providers were met with claims that data were being aggregated or anonymised. What do aggregation and anonymisation mean? Generally, it either means bundling data points together so individuals cannot be told apart, or masking identifying information about them. When it comes to data on people’s movement, this becomes a tricky task, as the growing literature on the topic attests. Nevertheless, let’s assume that you somehow have a sufficiently robust approach to protecting people’s privacy. Then the next question becomes: how should you apply this protection?

Privacy-by-design vs post-processing

A data collector generally has two options when enforcing privacy preservation: either embedded into the device, meaning that no individual identifying information is ever recorded, or as a post-processing technique, that is applied after obtaining a fully detailed data set. The second case is vulnerable to compromise, as evidenced by the NY Times journalists who managed to get hold of a motherlode of raw data.

It may then be preferable to have devices that can be hard-coded and certified by an authority in order to enforce privacy preservation by design. Such data can be recorded in an auditable distributed ledger to uniquely associate each data point with an identified device. Encryption techniques make the device and its data tamper-proof. Authorities can start to insist on and enforce such end-to-end protection standards and certifications, similar to how they enforce the installation of sealed, tamper-proof taxi and electricity meters. Several projects are underway to enable this technology, some with a specific orientation towards mobility application, such as IOTA, TravelSpirit and IoMob.

Synthetic data: an alternative to privacy preservation techniques.

As the technological development in protecting and anonymising individual trajectory privacy grows, so does the body of shadow literature on de-anonymising algorithms that attempt to reconstruct back the individual traces. This means that a responsible data collector might go ahead and invest in an array of certified devices, only to find that their privacy protection gets defeated sometime later, in an unending privacy protection arms race.

This is the motivation for our interest in developing an alternative to typical location masking techniques. What if we could create synthetic location data streams with the same resolution in time and space as what is actually sensed through devices without reproducing any given trajectory in the real world? In our research on building such synthetic data streams, we use techniques that intentionally restrict the actual raw data view for machine-eyes-only. They can be hard-coded into purpose-built, certified devices that are only capable of recording and releasing aggregate statistics.

The synthetic location data generation operates in two steps: an encoding step which produces the aggregates, and can be audited and certified; and a reconstruction step, that produces synthetic data with the same aggregate statistics as the real data. We are developing two distinct techniques to implement this.

Multiple histogram matching

The first technique is repeated histogram matching in a high-dimensional space. This approach re-purposes an old statistical technique to ‘sculpt’ a synthetic dataset until it looks like the real data. This is achieved by repeatedly ‘raking’ the synthetic data along various directions in a multidimensional space.

Figure 1 illustrates the first four steps for a two-dimensional spatial data example. Note how the blue points (synthetic) grow to resemble the orange ones (real) more as we go from images 1 to 4.

histogram matching
Figure 1: Illustration of the iterative multiple histogram matching process against a two-dimensional target histogram.

Graphical generative techniques

The second approach is to generate synthetic data using a so-called ‘Traveller Generation Machine’.  This approach belongs to the domain of machine-learning or a so-called ‘graphical generative technique’. Here, the word graphical refers to a graph in the computer science sense, i.e. a map of relationships (edges) between quantities (nodes).

The ‘Traveller Generation Machine’ identifies a minimal set of aggregate information to be released in order to produce synthetic data that closely resembles the real thing. Unlike histogram matching, this approach requires structural knowledge of the data, that is, how does one thing relate to another in the dataset.

Figure 2: Generative Model for urban mobility data

Take E1 in the model in Figure 2 as an example. The arrows leading to E1 can be interpreted as follows: given S1 (defined as the start time of the first  activity in the day) and Z1 (the geographical area of the first  activity in the day), E1 (the end time of the first travel activity in the day) has the following likely values: … Note how the description intentionally reflects the fact that the machine does not record the information of any given individual, only aggregates.


The two approaches were developed with the objective of providing reassurance to data providers when releasing synthetic data streams in the form of complete day travel trajectories of individuals. However, this data will not satisfy the needs of certain service providers, who insist on communicating directly with specific individuals who are in a specific place at a specific time.

If, as a society, we still want to sign on for this invasive form of direct marketing, in spite of the dangers reported in the NY Times article, then we should at least insist on limits to the number of locations that may be recorded in sequence. According to De Montjoye et al. (2013) , knowing only four location points in a sequence may be sufficient to uniquely identify most people in a dataset.

However, we are growing increasingly confident that this approach of relying on entirely synthetic datasets is sufficient for very detailed urban and transport planning, as well as location-based services that do not rely on real-time interaction with individuals. Feeding this synthetic data into a state-of-the-art mobility simulation such as MATSim represents the next step in producing an entire ‘doppelgänger city’ to test, probe and experiment with policy decisions, while leaving people in the real world safe and surveillance-free.

We thank our editors Tanvi Maheshwari and Geraldine Ee for their efforts in compiling this post.

Data Ecosystems, Transport, and Urban Transformation in Sao Paulo

-Notes on the ESRC urban modelling workshop

From 20 to 24 June 2016, I had the opportunity to participate in the urban modelling workshop organised by the ESRC Strategic Network: Data and Cities as Complex Adaptive Systems (DACAS). The workshop was held in the ICTP-South American Institute for Fundamental Research in the municipality of Sao Paulo, Brazil. The event brought together researchers across multi-disciplinary fields, all interested on how Data and Complex Adaptive Systems can be applied to describe and understand the underlying emergent behaviours in cities, and ultimately, plan for smarter cities: sustainable and resilient.

Data and urban challenges

On the opening day, Tomás Wissenbach from the Sao Paulo’s urban development agency talked about the challenges of urban transformation in Brazil and explained the recent efforts of Sao Paulo’s administration to collect all available datasets across the different governmental authorities regarding Sao Paulo’s population and infrastructure. This data fusion and processing endeavour culminated in an online interactive application (Figure 1) which anyone can access and download the datasets. In the second phase of the project, Wissenbach announced the possibility to collaborate in projects that can capture the urban transformation experienced in the city, and that can help the government to make informed decisions to plan for a better city.

Figure 1. Screenshot of interactive Map of Sao Paulo. Blue for the metro lines, red for the bicycle lanes, and in orange tones the population density
Figure 1. Screenshot of interactive Map of Sao Paulo. Blue for the metro lines, red for the bicycle lanes, and in orange tones the population density

Prof. Ana Bazzan, from the Institute of Informatics at Universidade Federal do Rio Grande do Sul (UFRGS), presented in her keynote presentations on her work in agents and multi-agent systems in traffic and transportation. (video here, slides here) The talk started with the rise of the cities, and the inherent transportation challenges within. Prof. Bazzan introduced then the idea of a data ecosystem triggered by people’s participatory sensing as the key to develop analytical applications to improve the transportation system. In a smart city, citizens interact directly with the system instead of just being passively receiving information. This change in the paradigm requires a human/agent- approach for the information, modelling and control challenges in which humans act as both targets and active subjects (i.e. sensors).

Putting all together: Data and Complex Adaptive Systems for Transportation Planning

My presentation on our research project Engaging Big Data supplemented the prior presentations quite nicely. This ongoing project conducted at the Future Cities Laboratory of the Singapore ETH Centre seeks to build up an agent-based simulation framework for transport planning using MATSim that can benefit from both urban mobility sensors (e.g. mobile phone and smart card data) and traditional data inputs (e.g. household travel survey and census information) (Figure 2). In the era of ubiquitous sensing and big data, the first challenge for developing the next generation of predictive, large-scale transport simulation models relies on designing a data mining pipeline that can fuse the knowledge from these different datasets in order to have an enriched and full explanation of the urban mobility dynamics. The second challenge aims in using this information to automate the parameters of a MATSim scenario, which would not only allow to significantly lower the efforts required for setting up simulation scenarios but would also lead to even more realistic results. This will ultimately serve as a platform to test the viability of policy and infrastructure decisions before they are implemented, and guide and inform the urban and transport planning process.

Figure 2. Big Data-driven MATSim
Figure 2. Big Data-driven MATSim

Witnessing Sao Paulo’s Mobility transformation

Besides the workshop, I took the opportunity to experience some of the results of the city of Sao Paulo’s recent pushes to improve the adoption of sustainable  transportation policies. Those initiatives primarily target the notorious traffic congestion the 21 million inhabitants of the metropolitan areas are suffering from. With the introduction of the ‘Bilhete Único’ in 2004, a smart card automatic fare collection system for the public transport, citizens are being incentivise to opt for public transport through standard fares regardless of distance or number of connections. The data on mobility patterns that this system generates every day would also be an ideal source for setting up Big Data driven urban transport simulation. In addition, Sao Paulo’s municipality has recently done major investments on bicycle infrastructure throughout the main avenues of the city, including the symbolic, Avenida Paulista. (Figure 3)

– although my colleagues at FCL who study how street design can support active mobility think that there is potential to make cyclists feel more comfortable and safe on this major arteria ;-).

Figure 3. Bicycle lane in Avenida Paulista
Figure 3. Bicycle lane in Avenida Paulista