“…oh, no don’t worry, we’ve anonymised the data so it’s no problem. We’ve removed all the identifying information, we’ve removed the name and personal number.” – Far too common.
On May 25 2018 the European Union general data protection regulation(GDPR) becomes law in all EU member states. Among other things it requires both data controllers – the people who collect and process personal data – and data providers – those who supply infrastructure to store and process personal data – to take continuous measures to ensure that personal data don’t fall into the wrong hands.
The requirements set by the law are neither ground breaking nor excessive from a technical standpoint. For those of us working in security, the requirements are meat and potatoes:
Taking into account the state of the art, the costs of implementation and the nature, scope, context and purposes of processing as well as the risk of varying likelihood and severity for the rights and freedoms of natural persons, the controller and the processor shall implement appropriate technical and organisational measures to ensure a level of security appropriate to the risk, including inter alia as appropriate:
(a) the pseudonymisation and encryption of personal data;
(b) the ability to ensure the ongoing confidentiality, integrity, availability and resilience of processing systems and services;
(c) the ability to restore the availability and access to personal data in a timely manner in the event of a physical or technical incident;
(d) a process for regularly testing, assessing and evaluating the effectiveness of technical and organisational measures for ensuring the security of the processing. (Article 32 “General Data Protection Regulation” Regulation (EU) 2016/679)
This is what the IT-security industry has been preaching since forever. Nothing new under the sun – Except: Now it is written into law, at least concerning personal data, and with a side of hefty fines.
Much can and will be said about the GDPR, about which impact it will have on society from an IT-security stand point. But in this blog post we would like to specifically address one, at least in our experience, common retort when it comes to securing platforms handling personal data:
“No need, we have anonymised the data”
Now, if the data is truly anonymised – of course – then it is no longer personal data and the GDPR doesn’t apply (However securing your platform may still be a good idea). But more often than not, data that is thought to be anonymised is discovered not to be. This is the reason that the legislation instead uses the term pseudonymised – given a false name.
Identifiers
Identifiers are those attributes that can be used to directly identify a person. A name or personal number are prime examples.
The GDPR removes a few grey areas when it comes to identifiers. For instance it makes it clear that technical and online identifiers indeed are identifiers and thus personal data. So for instance log files containing IP-numbers, IMEI-numbers etc. contain personal data and need to be handled appropriately.
Quasi-identifiers
Quasi-identifiers are a set of attributes that can be used to identify a person indirectly. The main purpose of an identifier (like a name or personal number) is to identify a person. The main purpose of a quasi-identifier however is not to identify a person, but it is possible to identify a person using it.
Quasi-identifiers are attributes that within a set of other quasi-identifiers are unique to a single individual. Which these quasi-identifiers are, may vary from person to person depending on how rare the attribute is or how rare the combination of attributes are.
An example these attributes:
- Age
- Occupation
- Municipality (sv. Kommun)
- Gender
These are enough to uniquely identify approximately 1% of the Swedish population – 85% are identified down to a group of 256 individuals.
(Flashover based on SCB tables: Anställda 16-64 år med bostad i regionen (nattbef) efter län, yrke (3-siffrig SSYK 2012), ålder och kön. År 2014 and Folkmängden efter region, civilstånd, ålder och kön. År 1968 – 2015)
In order to further distinguish between these 256 individuals within the group, only 8 bits of information is needed. That is, a unique set of 8 likes/dislikes or a unique set of approximately 3,5 star ratings.
In fact – the attributes age, occupation, municipality or gender are not even needed if a unique set of 10 star ratings or 21 likes/dislikes is used.
How quasi-identifiers can be used to identify an individual
Quasi-identifiers cannot be used to directly identify an individual (then they would be identifiers) but instead they can be used to find the same individual in another dataset – where the user is identified.
If the set of attributes is unique to an individual and the same set of attributes is present elsewhere – the quasi-identifiers can be used to link the two individuals together across the two datasets and thereby establishing the identity.