Anonymization vs Pseudonymization
The term anonymization has often been used to casually describe what is now more appropriately termed as pseudonymization.
Data masking in general should be regarded as pseudonymization.
Anonymization is the process of transforming data such that the re-identification of the original data is impossible.
At first glance, it's tempting to think "Well, isn't that what data masking is for?". The short answer is "Maybe", but this is less likely as the data sample increases in size and complexity.
Key factors to consider:
Size of data sample.
Is the data sample size small or trivial (one row) or large (hundreds of millions of rows)?
Number of associated identifiers.
Are only a few identifiers present or are there dozens? E.g. Phone numbers, postal codes, birth dates, occupation, etc.
Another factor is the availability of statistical data relating to identifiers (such as frequency of names, demographics, public records, etc) and actual identifiable personal information. Considering that such data sources have become prevalent and accessible it should be assumed that it is available to anyone who is determined to re-identify data.
As an example, consider that the data to be anonymized consists of only two columns, Given Name and Family Name, and there are only 100 rows and each name is distinct (or almost distinct). This would be easy to anonymize. Just mask each instance of original name with a randomly chosen name. The data sample is so small and varied that no meaningful statistical pattern can be discerned. Furthermore, there are no other identifiers (e.g. social insurance number, phone number, or even partial such fields) to cross-reference to assist in re-identification. It would be safe to consider this data anonymized.
Conversely, suppose the data consists of 100 million rows of names. If the data is masked consistently, such as all instances of "Smith" are masked to "Baker", then a statistical analysis of the frequency of masked names can be used as an initial hint of the original names. If other identifiers are present then those can be used to cross-reference and further refine the likely range of original identities. E.g. If the Occupation is known, the Birth Date is present and has even been masked but is known to have been masked within +/- 6 months, and it is also known that the Postal Code/ZIP had been preserved in order for the masked data to remain useful for the intended user then these identifiers help further narrow the range of possible original identities.
Therefore, if more identifiers are present and more assumptions can be made about those identifiers such as statistical distributions, or the masking ranges used (e.g. +/- 6 months for Birth Dates) or which fields were partially preserved (e.g. account or phone prefixes) then the more likely it becomes for the original data to be re-identified. It doesn't necessarily mean that it would be easy, but for a determined and skilled analyst it may be possible to re-identify at least some of the data.
There have been many reports in the media of anonymized data having been re-identified. A recurring theme in these reports is that more than just a few unmasked (or poorly masked) identifiers were present to enable such statistical or cross-referencing analyses. The portion of data that is re-identified is often not done with 100% confidence although in some cases it may be close .
Pseudonymization generally means replacing original sensitive data with apparently anonymous or unrelated masked data.
The masked data may appear randomized, realistic and anonymous but may or may not actually withstand re-identification techniques, as described above, to re-identify at least some of the original data.
In addition to masking all identifiers as required by applicable data laws and regulations (e.g. GDPR and HIPAA) here are some additional suggested practices:
Non-deterministic masking means that a random value is chosen to mask each original value. E.g. The first instance of "Smith" could get masked to "Baker", but the next instance of "Smith" could get masked to "Jones", and the next to "Bunting", and so on. Furthermore, the next execution of the masking project would generate a different set of values altogether. Therefore, a statistical analysis is useless on an identifier (column) that has been masked in non-deterministic mode.
Unfortunately, end-user requirements often demand consistently masked data in which case deterministic mode masking will be required. However, please consider carefully which columns need to be masked consistently and which can be masked using non-deterministic mode.
If deterministic mode is required then consider using automatically generated secret deterministic seeds instead of fixed value seeds.
The more of your data that you mask using non-deterministic mode, the more secure and closer to anonymous your masked data shall become.
Ordinarily, when one value in a foreign key relationship is masked, then the other(s) in the relationship are masked to the same value in order to maintain referential integrity.
DataVeil provides the ability to temporarily disable foreign key relationships during masking so that relationships between parent and child rows can be reassigned to different rows.
Care needs to be taken to ensure that altering such relationships does not unacceptably diminish the usefulness of the masked data nor to violate referential integrity constraints.
Please refer to the Relationship Obfuscation section in the Dependencies chapter for further details.
Limit Access to Masked Data
Although data masking can be used as a means to protect sensitive data through pseudonymization there generally remain some risks as discussed above.
An additional means of mitigating that risk is to share the masked data only with the intended users. In other words, if the masked data is intended for internal testing purposes then only share it with that group. i.e. Do not publish the masked data on the internet for access by the general public.