A single trajectory is a tragedy, 1.2 million is Big Data.

Pieter Fourie, Cuauhtèmoc Anda and Sergio Ordonez

There is still such a thing as bad publicity, as a recent New York Times exposè on app-driven person tracking confirms. Here’s how to stay out of the headlines by rolling your own data. We have developed methods that allow data stewards to stream completely synthetic location trails, which can fulfil the needs of many location-based services, and unconditionally guarantee individual privacy.

Three days, 1.2 million devices, 235 million locations, reads the tagline of a recent article on mobile data privacy in the New York Times. As a mobility researcher in the age of big data, one becomes inured to the staggering numbers involved in location services data gathering (one of many euphemisms for persistent individual location tracking). The bigger the numbers are, the better! More data means better models. Moreover, nobody is forcing the consumer to use these apps and services, right? Everyone is a knowing, willing participant. However, viewed through less familiar eyes, these numbers represent an Orwellian nightmare in the making.

The NY Times Daily podcast does an excellent job of shaking us out of our complacency; revealing the tragedy of vulnerable individuals whose privacy gets sold en masse to the highest bidder. It raises the question: what are our alternatives? The hottest topics in urban research and responsive cities all hint at some degree of surveillance: connected devices, digital twins, Internet-of-Things, Mobility-as-a-Service, all require us to become more connected. Information about us and our movements is put at risk of becoming more widely used by an increasing number of actors.

The NY Times article raises many issues around a lack of policy and oversight in the field of location tracking and exposes its personal, societal, institutional and corporate dimensions. These are all tough but pertinent problems. As we come to terms with living in an ever-more connected world, it is worthwhile to discuss some technological considerations to inform our decisions.

Data privacy preservation techniques

In the initial phase of their investigation, the journalists’ queries to data providers were met with claims that data were being aggregated or anonymised. What do aggregation and anonymisation mean? Generally, it either means bundling data points together so individuals cannot be told apart, or masking identifying information about them. When it comes to data on people’s movement, this becomes a tricky task, as the growing literature on the topic attests. Nevertheless, let’s assume that you somehow have a sufficiently robust approach to protecting people’s privacy. Then the next question becomes: how should you apply this protection?

Privacy-by-design vs post-processing

A data collector generally has two options when enforcing privacy preservation: either embedded into the device, meaning that no individual identifying information is ever recorded, or as a post-processing technique, that is applied after obtaining a fully detailed data set. The second case is vulnerable to compromise, as evidenced by the NY Times journalists who managed to get hold of a motherlode of raw data.

It may then be preferable to have devices that can be hard-coded and certified by an authority in order to enforce privacy preservation by design. Such data can be recorded in an auditable distributed ledger to uniquely associate each data point with an identified device. Encryption techniques make the device and its data tamper-proof. Authorities can start to insist on and enforce such end-to-end protection standards and certifications, similar to how they enforce the installation of sealed, tamper-proof taxi and electricity meters. Several projects are underway to enable this technology, some with a specific orientation towards mobility application, such as IOTA, TravelSpirit and IoMob.

Synthetic data: an alternative to privacy preservation techniques.

As the technological development in protecting and anonymising individual trajectory privacy grows, so does the body of shadow literature on de-anonymising algorithms that attempt to reconstruct back the individual traces. This means that a responsible data collector might go ahead and invest in an array of certified devices, only to find that their privacy protection gets defeated sometime later, in an unending privacy protection arms race.

This is the motivation for our interest in developing an alternative to typical location masking techniques. What if we could create synthetic location data streams with the same resolution in time and space as what is actually sensed through devices without reproducing any given trajectory in the real world? In our research on building such synthetic data streams, we use techniques that intentionally restrict the actual raw data view for machine-eyes-only. They can be hard-coded into purpose-built, certified devices that are only capable of recording and releasing aggregate statistics.

The synthetic location data generation operates in two steps: an encoding step which produces the aggregates, and can be audited and certified; and a reconstruction step, that produces synthetic data with the same aggregate statistics as the real data. We are developing two distinct techniques to implement this.

Multiple histogram matching

The first technique is repeated histogram matching in a high-dimensional space. This approach re-purposes an old statistical technique to ‘sculpt’ a synthetic dataset until it looks like the real data. This is achieved by repeatedly ‘raking’ the synthetic data along various directions in a multidimensional space.

Figure 1 illustrates the first four steps for a two-dimensional spatial data example. Note how the blue points (synthetic) grow to resemble the orange ones (real) more as we go from images 1 to 4.

histogram matching
Figure 1: Illustration of the iterative multiple histogram matching process against a two-dimensional target histogram.

Graphical generative techniques

The second approach is to generate synthetic data using a so-called ‘Traveller Generation Machine’.  This approach belongs to the domain of machine-learning or a so-called ‘graphical generative technique’. Here, the word graphical refers to a graph in the computer science sense, i.e. a map of relationships (edges) between quantities (nodes).

The ‘Traveller Generation Machine’ identifies a minimal set of aggregate information to be released in order to produce synthetic data that closely resembles the real thing. Unlike histogram matching, this approach requires structural knowledge of the data, that is, how does one thing relate to another in the dataset.

Figure 2: Generative Model for urban mobility data

Take E1 in the model in Figure 2 as an example. The arrows leading to E1 can be interpreted as follows: given S1 (defined as the start time of the first  activity in the day) and Z1 (the geographical area of the first  activity in the day), E1 (the end time of the first travel activity in the day) has the following likely values: … Note how the description intentionally reflects the fact that the machine does not record the information of any given individual, only aggregates.

Limitations

The two approaches were developed with the objective of providing reassurance to data providers when releasing synthetic data streams in the form of complete day travel trajectories of individuals. However, this data will not satisfy the needs of certain service providers, who insist on communicating directly with specific individuals who are in a specific place at a specific time.

If, as a society, we still want to sign on for this invasive form of direct marketing, in spite of the dangers reported in the NY Times article, then we should at least insist on limits to the number of locations that may be recorded in sequence. According to De Montjoye et al. (2013) , knowing only four location points in a sequence may be sufficient to uniquely identify most people in a dataset.

However, we are growing increasingly confident that this approach of relying on entirely synthetic datasets is sufficient for very detailed urban and transport planning, as well as location-based services that do not rely on real-time interaction with individuals. Feeding this synthetic data into a state-of-the-art mobility simulation such as MATSim represents the next step in producing an entire ‘doppelgänger city’ to test, probe and experiment with policy decisions, while leaving people in the real world safe and surveillance-free.

We thank our editors Tanvi Maheshwari and Geraldine Ee for their efforts in compiling this post.

Meet us at TRB 2019

Our research group’s research will be presented at the 98th Annual Meeting of the Transportation Research Board in Washington, DC, January 13-17, 2019 in the following sessions:

Hands-On Workshop for Virtual Reality in Stated Response Research

Sunday, January 13, 2019, 1:30 PM-4:30 PM, Convention Center

Zachary Patterson, Concordia University, presiding, Michael van Eggermond

Sponsored by Standing Committee on Travel Survey Methods; Standing Committee on Urban Transportation Data and Information Systems; and Standing Committee on Traveler Behavior and Values

A main challenge of the use of virtual reality (VR) in stated response surveys is actually putting together a VR environment. After short presentations on recent VR surveys, attendees will learn how to set up a basic virtual environment for stated response survey applications with the soon-to-be open-source Virtual Immersive Reality Environment platform developed by Bilal Farooq of Ryerson University. Other VR platforms also will be sought for inclusion in the workshop.

Processing cycling risk under different elicitation methods: comparing 2D and 3D in virtual reality choice environments

Martyna Bogacz, Chiara Calastri, Charisma Choudhury, Stephane Hess, Alex Erath, Michael Van Eggermond, Faisal Mushtaq

Collecting and Analyzing Pedestrian and Bicyclist Data, January 14, Monday 10:15 AM- 12:00 PM, Hall A / Convention Center

The aim of this study is to provide a better understanding of cyclists’ risk perception in different scenarios under different elicitation methods. In particular, 2D computer-based videos and 3D virtual reality simulations of road situations are contrasted. We collect data on cyclists’ behavioural responses in risky conditions and their stated responses on propensity to cycle and risk perception. Electroencephalography (EEG) is used to gain insight into the temporal sequence of cortical risk processing, which gives a better understanding of neural mechanisms underlying choices. In addition, this study provides the validation of virtual reality as a tool for risk preference elicitation. Our results are in line with expectations: they show behavioural responses in line with the stimuli of the scenarios and an effect of the elicitation method, e.g. the perception of the riskiest elements seem to be exacerbated in 3D. Overall, we show that the 3D presentation method has an impact on the neural processing of risk and not only it changes the way people perceive risk but also their behaviour. The findings provide useful insights about data collection in the context of cycling behaviour and beyond.

Operator and User Perspectives on Fleet Mix, Parking Strategy and Drop-Off Bay Size for Autonomous Transit on Demand

Biyu Wang, Sergio Arturo Ordonez Medina, Pieter Jacobus Fourie
Parking Potpourri, Monday, January 14, 1:30 PM- 3:15 PM, Hall A / Convention Center

Autonomous vehicles (AVs), but in particular shared autonomous transit on demand (ATOD), promises many efficiencies in future transport provision, and may lead to concomitant changes in urban form. Considering the effects of car-oriented planning on the livability, efficiency and sustainability of 20th century cities, there is growing interest in how we may anticipate the changes that this disruption will bring about. Parking and pick-up drop-off infrastructures are some of the several aspects which may change travel behaviour in the upcoming era of AVs. In the paper, three different parking strategies as well as four types of pick-up drop-off infrastructures are simulated to assess their influence in users and operators. The studied parking strategies include demand-based roaming, parking on the street and parking in depots. The four types of pick-up drop-off interfaces are infinity bay, demand-based bay, curbside and single vehicle. The proposed fleet include 3 vehicle sizes: 4-, 10- and 20-seaters for sharing mobility, and 1-seaters for private mobility. Combinations of different parking strategies and different pick-up drop-off infrastructures were evaluated from the perspective of travel time, walk distance, vehicle occupancy, rejected requests and vehicle kilometers traveled. Results show that strategies produce radically different utilization of vehicles to provide the same minimum service level for a particular study area in Singapore. We conclude that urban designers and policy-makers need to consider these as important parameters when designing or retrofitting neighborhoods if they want to maximize potential benefits from this new transportation mode.

Studying Cyclists’ Behavior in a Non-naturalistic Experiment Utilizing Cycling Simulator with Immersive Virtual Reality

Transportation Issues and Solutions in Major Cities, Wednesday, January 16, 2019 2:30PM 4:00PM, Hall A / Convention Center

Mohsen Nazemi, Michael van Eggermond, Alex Erath, Kay W. Axhausen

This study investigates the combination immersive virtual reality (VR) and an instrumented cycling simulator for in-depth behavioral studies of cyclists. To this end, a cycling simulator was developed, virtual environments resembling Singapore were created, combined with the output of a traffic microsimulation. This set-up was created with the specific objective of evaluating the effects environment properties and road infrastructure designs on cyclists’ perceived safety. Forty participants, mainly university students, were recruited for the experiment. Results showed that the average speed of the participants changes between scenes with different bicycle facilities, with the highest value for the segregated bicycle path. The braking and head movement activities also changed within each scene, where they significantly occurred more before arriving at the intersections. Questionnaire results revealed adding a painted bicycle path to a sidewalk increases the level of perceived safety. Moreover, participants felt safest for cycling on the segregated bicycle path, in line with findings from previous research. This study provides evidence that cyclists’ behavior and perceptions in VR is very similar to reality and that VR, combined with a cycling simulator, is suitable to communicate (future) cycling facilities.