News
The Privacy Risks of Compiling Big Data

Repost of the MIT News Office article by Rob Matheson

A new study by MIT researchers finds that the growing practice of compiling massive, anonymized datasets about people’s movement patterns is a double-edged sword: While it can provide deep insights into human behavior for research, it could also put people’s private data at risk.

Companies, researchers, and other entities are beginning to collect, store, and process anonymized data that contains “location stamps” (geographical coordinates and time stamps) of users. Data can be grabbed from mobile phone records, credit card transactions, public transportation smart cards, Twitter accounts, and mobile apps. Merging those datasets could provide rich information about how humans travel, for instance, to optimize transportation and urban planning, among other things.

But with big data come big privacy issues: Location stamps are extremely specific to individuals and can be used for nefarious purposes. Recent research has shown that, given only a few randomly selected points in mobility datasets, someone could identify and learn sensitive information about individuals. With merged mobility datasets, this becomes even easier: An agent could potentially match users trajectories in anonymized data from one dataset, with deanonymized data in another, to unmask the anonymized data.

In a paper published in IEEE Transactions on Big Data, MIT researchers show how this can happen in the first-ever analysis of so-called user “matchability” in two large-scale datasets from Singapore, one from a mobile network operator and one from a local transportation system.

The researchers use a statistical model that tracks location stamps of users in both datasets and provides a probability that data points in both sets come from the same person. In experiments, the researchers found the model could match around 17 percent of individuals in one week’s worth of data, and more than 55 percent of individuals after one month of collected data. The work demonstrates an efficient, scalable way to match mobility trajectories in datasets, which can be a boon for research. But, the researchers warn, such processes can increase the possibility of deanonymizing real user data.

“As researchers, we believe that working with large-scale datasets can allow discovering unprecedented insights about human society and mobility, allowing us to plan cities better. Nevertheless, it is important to show if identification is possible, so people can be aware of potential risks of sharing mobility data,” says Daniel Kondor, a postdoc in the Future Urban Mobility Group at the Singapore-MIT Alliance for Research and Technology.

“In publishing the results — and, in particular, the consequences of deanonymizing data — we felt a bit like ‘white hat’ or ‘ethical’ hackers,” adds co-author Carlo Ratti, a professor of the practice in MIT’s Department of Urban Studies and Planning and director of MIT’s Senseable City Lab. “We felt that it was important to warn people about these new possibilities [of data merging] and [to consider] how we might regulate it.”

The co-authors of the study are Behrooz Hashemian, a postdoc at the Senseable City Lab, and Yves-Alexandre de Montjoye of the Department of Computing and Data Science Institute of Imperial College London.

Read Rob Matheson's full article here.