I want to debunk the myth that location data can be anonymized. This is false in most practical circumstances where the movement of individuals is concerned, and any statement to the contrary should raise suspicion, here’s why.
This week the Toyota Motor Corporation disclosed a data breach exposing the location information of more than 2 million of its customers’ vehicles. Due to a configuration error, the data was exposed on the internet for more than ten years. The leaked data included the vehicle’s unique chassis number (also called vehicle identification number or VIN), location information, and time of day.
Reporting on the leak, Bleeping Computer was quick to reassure its readers, explaining that: “It is important to note that the exposed details do not constitute personally identifiable information, so it wouldn’t be possible to use this data leak to track individuals unless the attacker knew the VIN (vehicle identification number) of their target’s car.”
In their original notice, Toyota where a bit more cautious in their choice of words: “This time, customer information that may have been viewed from the outside will not identify the customer based on this data alone, even if accessed from the outside” (machine translation). Note the disclaimer “based on this data alone”.
Except that the first claim is wrong and the second one is misleading.
Over the past years, we have seen many similar datasets containing so-called anonymous or de-identified location data, some leaked as a result of a data breach, others sold on the open market. I would like to take this opportunity to clarify that, in most cases, the claim that the data is not personally identifiable is false. This is because the only way to anonymize location data is by reducing its accuracy, making it lose its commercial value.
Location data collected through GPS-enabled devices such as smartphones, wearables, or vehicles is often considered anonymized because it does not directly contain any PII. However, this assumption is far from accurate. In fact, it is relatively trivial to de-anonymize any dataset containing the location data of a set of individuals and reveal their identities. This is because most people spend the majority of their weekdays at work and most of their nights at home, allowing for the identification of their home and work addresses.
By analyzing the patterns of the location data in Totyota’s or any of the other datasets, one can easily discern the location of each individual’s home and workplace. This is possible through a process known as “clustering,” where data points are grouped based on their proximity. For example, if a particular individual’s data points are frequently found in a specific area during nighttime hours, it is reasonable to assume that this area represents their home address. Similarly, if the data points are frequently found in another area during daytime working hours, it is likely to be their workplace.
Once each individual’s home and work addresses have been identified, it is only a matter of cross-referencing this information with publicly available datasets, such as voter registration records, property ownership databases, and social media profiles. Worse, this process can be done automatically for the entire dataset.
The process can also be reversed, starting with a particular individual we want to target. We first find their home and work addresses. This can be achieved by utilizing various online resources, such as search engines, social media platforms, and public records. We can then find the coordinates of these addresses using, for example, Google Maps. Once the coordinates are obtained, one can locate the individual’s entries in the anonymized location dataset and analyze their every movement, potentially revealing sensitive information about their habits, preferences, and social interactions.
The issue is not the inability to anonymize location data; rather, it poses a catch-22 situation for companies. Anonymizing location data can be achieved by decreasing the precision of each data point, for instance, rounding up the coordinates to reduce accuracy from a few meters to about a kilometer. This approach would make our de-anonymization method ineffective, as it would be challenging to differentiate between multiple individuals’ home addresses appearing in the same location.
However, this method of anonymization also makes the location data lose its commercial value. This is because the anonymized data cannot be used to learn anything interesting about the individual and, therefore, cannot be used to provide personalizing services or be sold to other businesses, such as data brokers – both of which are the primary reasons companies collecting location data in the first place. While anonymized location data remains useful for other objectives, like collecting statistics, these can be achieved without retaining a detailed location history.
Our working assumption should therefore be that an individual’s location data is Personally Identifiable Information.
Photo by Dennis Kummer on Unsplash