Sunday, October 08, 2017

I need Data, but I mustn't have Data!

[ Tech ]
That's a strange title, but it's a fairly common dilemma these days in software development, where the developers need access to realistic data to be able to test their general algorithms and also to exceptional data (such as incorrect and "illegal" or impermissible values), but, on the other hand, cannot be given access to really "real" data because of privacy and security concerns.

Normally the answer to the privacy and security concerns is to encrypt the data with an algorithm that can be reversed, so that "Fred Bloggs" is stored as "&*@#$G KJF^^S" but is decrypted back to "Fred Bloggs" every time. However, a lot of the software will still have to deal with the "Fred Bloggs" version for functions like mail composition, and for checking that someone isn't trying to attack the system (entering their name as "Fred ';drop table patient;", for example).

For development there's a very lucky freebie - "real" real data doesn't have to be used - so long as it appears to be real then that's fine. So, we can take "real" real data and transform it into something that developers can use ("fake" real data ?), providing that the transformation cannot be reversed.

So here's a suggestion of how to anonymise your data [Please think hard about the way you implement this and, if you realise any flaws, please tell me!].

1. Determine the fields that are to be anonymised. Obviously things like names, addresses, phone numbers, and identifiers like Social Security IDs have to be altered, but other, less obvious items, can be used to link to individuals. For example, if you leave an intact claim number to a claim for medical treatment then the recipient of that treatment could be located (the fact that such activity would be illegal doesn't matter - it's still possible).

2. Make some tables to store these data. For example, one table for the person, one for their address(es), another for phone(s), etc., all linked so that a phone number will be linkable to the correct person. Additionally, each table must also have a unique key (just an integer).

3. Each item in (2) above must also include a reference to the table(s) and field(s) where that piece of data is located.

3. When you have collected all the data that will need to be altered make a complete copy of this metadata. This copy will be altered to be anonymous, and then used to replace the occurrences of the original data. Having both available will allow a comparison to verify that the anonymisation has worked.

4. There are a number of types of data. Determine an algorithm for anonymisation for each type.

Overall plan. 
The way that I have successfully done this has been to utilise the data itself for substitute values. For example, for first names, I chose an minimal offset that was the number of instances of the most common first name. Then, each time I wanted to alter a first name, I obtained a random number, added the minimal offset to it, and used the first name from that many records further on down the table. If the search went over the end then I simply continued at the top of the table. One note here: you need to ensure that the replacement values that you are obtaining for some fields come from a record having the same gender as the record receiving them - someone called Bill Smith is unlikely to be undergoing a hysterectomy, for example (not impossible, but unlikely, and exceptions like this, while technically possible, will be very rare in real life).

Now for suggestions for dealing with particular types
  • First Names. As mentioned above, select names of same gender as the receiving record (this means that gender must be recorded in the metadata, but not altered).
  • Names. You may have to take national origin into account for both first and last names. This may be needed in order to allow software testing that "looks" real. Having someone with a name that is obviously of Indian origin assigned an English birthplace is not unreasonable; Alaska, however, despite being nearer, is lots less likely. 
  • National IDs. In the USA the format is 999-99-9999; in the UK it's XX999999X. In either case replace the three sections separately, picking replacements from three different source records.
  • Addresses. These can be complicated! The postal code has to match the country and local area. The street name has to match the country. For example, 90453 translates to the Postleitzahl for  Nürnberg in Germany, but is (at the time of writing) not an assigned value for the US zip code system. Thus simply grabbing a number that looks correct can trigger warnings later on when software tries to use it! Similarly, one might have to have some sort of street lookup system to make sure that a street does exist if the street name and city are not used together. House numbers, of course, can be changed, but should not be raised with abandon if address checking is part of the software being tested. 
  • References to other items. These need to remain contextually correct, but the reference identities must change. For example, as pointed out above, assigning a claim for a female-only procedure to a patient tagged with another gender should raise errors that you probably don't want to see. Not altering the ID code of the claim will offer a viewer an easy way of locating the original claimant, thus making all your work pointless!

Example

Lastly, a description of how I actually did a project like this.
  1. This was for a health insurance claims system from auto insurance, so HIPAA was involved from the start. Also, overseas drivers were a possibility, so names and addresses, etc., had to be handled with care.
  2. The metadata included 
    1. Surrogate key
    2. First name
    3. Middle name(s)
    4. Last name
    5. Gender
    6. National ID (e.g. SSN)
    7. Address Street Number
    8. Address Street Name
    9. Address City
    10. Address County
    11. Address State
    12. Address Postal Code
    13. Address Country
    14. Claim Insurer  (multiple, in child table)
    15. Claim Identifier (multiple, in same child table)
    16. Some other fields as well
  3. The metadata items were mapped back to their sources, so that a single record in the metadata would map to a varying number of fields in data tables, depending on the field in the metadata table. For example, National ID would only be mapped to one field in one table, whereas First Name and Last Name might appear in several other tables. 
  4. In addition, text within the claims tables might contain the name of the individual, so this also had to be searched and possibly altered.
  5. The data above was extracted from all the basic tables of the database.The surrogate key here was available for linking an individual record (which is what this was) with a table of claim records. 
  6. Once the metadata was assembled it was retained as a Source Copy, and copied as a Destination Copy, which would store the alterations.  
  7. Finally the data in the Destination Copy would be used to alter the data in the database that was to be used for development and sales purposes.

StartRecordPlace in metadata table to start. In the example it starts at the beginning.
RunLengthNumber of records to process in this run
RecordNrVariable holding surrogate key of metadata table.
OffsetNumber of records to skip down the metadata table before starting to search for an appropriate record from which to use data
RunLengthNumber of records to process in this run


StartRecord = 0
RunLength = SystemDate.seconds
Wait(random number of seconds, seeded by part of system time)
Offset = SystemDate.seconds
For RecordNr from (StartRecord + 1) to (StartRecord + 1 + RunLength)
Begin
      ProcessRecord(RecordNr, Offset)
End
//Save RecordNr for next batch.

Within ProcessRecord the code would search for the first record in the Source Table after the record to be altered where the Gender and the Address Country matched the record to be altered. The search would start a random number of records forward - as determined by the Offset value. One part of the located record was used to update the Destination Table, and the field marked as changed in a log table. If the search reached the end of the Source Table without finding a Country/Gender match then the search would resume at the start of the table, but only to the record before the Source Record of the individual being altered.
The alterations made are described above: if the National ID was being altered then three progressive searches would be made, each returning one part of the ID. In this way the identity value was scrambled but still composed of valid parts.

After processing one batch of records the next batch would be processed using different basic values as obtained by using the system time.
Once this was completed for one small group of fields then the process was restarted for the next group of fields.

Finally, the log would be examined to locate any records with fields that had not been altered. These were altered by hand.

The randomising agent was the system time, which would not recur. Because no precise record of when the operation took place was kept, it would not be reasonable to expect someone to be able to work backwards from the transformed Destination Data to obtain the Source Data, but the transformed data itself, when viewed, appeared to be very normal, to the extent that it caused a major panic when demonstrators first used it and believed that they were showing real data!

So, as I pointed out above, this process does take some time, but yields you some very real-looking data that you can then use for demonstrations and for development purposes.  As I also asked above, if you notice something that is missing or a flaw in the logic, please tell me (in the comments would be a good place!), as leaving a known flaw for others to adopt in ignorance would be bad.

TTFN

No comments:

Found Food

I have published quite a few recipes here on my blog over the last few years, and I hope that all my readers have tried at least some of the...