CompTIA CASP+ CAS-004 – Data Security (Domain 1) Part 3

February 11, 2023

5. Deidentification (OBJ 1.4)

In dealing with data security, it’s important to protect sensitive data and information that you collect from your users or your customers. One way to do this is use a process known as deidentification, which allows you to still maintain the data collected, but it removes the ability for your organization and others to uniquely identify an individual. Now, deidentification allows you to remove identifying information from data before it’s distributed to people within or external to your organization. For example, my company keeps a record of how many of our students pass or fail a certain exam. That way, we can know how effective our training is at helping students earn their certifications. Now, for my use case, I don’t need to know that John Smith has failed and Jane Doe has passed. Instead, I need to know overall how many students have passed and how many have failed.

This is one form of deidentification because I could strip out all of the students names, and instead I can rely on a simple number like student one or student two. This is a very simple form of deidentification, but there are several other ways to conduct this, too. When I’m talking about de identification, I’m talking about any method or technology that removes identifying information from data before it’s distributed. The real benefit of deidentification is the ability to take that data that may be protected by privacy regulations like medical data or personally identifiable information, and then make it usable again for other purposes by removing that personal, identifiable pieces of information from within that data. This doesn’t violate anybody’s privacy because we still have first deidentified that data before we distribute it.

Now, deidentification is often implemented as part of your database design, but there are other methods that can be used, too, such as data masking, tokenization and aggregation, and banding. Data masking is a type of deidentification method that substitutes a generic or placeholder label for real data while preserving the structure or format of that original data. For example, if you gave me a list of all of your credit cards, I could substitute in a generic or placeholder label, such as 16 ones if it was a Visa card, or 16 twos if it was a Mastercard. This would work because all credit card numbers are 16 digits in length. So by replacing your actual credit card numbers with a series of ones or a series of twos before I stored in my database, I’m removing all of your sensitive data by masking it. Now, nobody would be able to identify those credit cards as yours because they don’t have the credit card number itself, just a series of ones or twos.

Another example of data masking might be if I have a database with all of my customers in it, and one of the fields was their Social Security number. Now, we all know that a Social Security number in the United States contains exactly nine digits, and it sits in the format of 123-45-6789 so I could configure my database to store your Social Security number. But if any of my employees wanted to pull up a customer record, it would instead display the Social Security numbers for all customers as 123-45-6789. Because my employees don’t have a valid business reason to view my customer’s Social Security numbers.

This form of data masking would now deidentify your Social Security number from your customer records when it’s viewed by an employee. Remember, with data masking, the data maintains the same format and the same structure of the data, but it doesn’t actually display any personal information and instead it’s going to show a generic or placeholder value. The next deidentification mechanism we have is known as tokenization. Now tokenization is a deidentification method where a unique token is going to be substituted in for that real data. When tokenization is utilized, this allows the data to be deidentified in some cases, but this is also a reversible function so you could properly reidentify a record for other business cases as needed.

So again, let’s say I collected all of my students Social Security numbers and instead of replacing them with a generic value like 123-45-6789, I might instead create a random nine digit student ID number and I save that into the Social Security field of my student records and use that to reference all of my students. These new student numbers are tokens and they can refer to a specific database record that contains the student’s Social Security number in a separate table or database. Or maybe I have a master list offline in my safe that has all the student IDs in one column and all the Social Security numbers in a second column. Now I can match the true data that Social Security number using the token, the student ID that I have already created. Therefore, tokenization is considered a reversible process. Now tokenization does have its benefits though.

If anyone on my staff went in to look at your student record, they’re not going to see your Social Security number. They’ll only see your student ID. Then, if they have a real business case that they need to see your Social Security number, we could use the table that contains the token and the protected data to reverse the process and see that Social Security number. Now, the next de identification method we have is known as aggregation and banding. Aggregation and banding occurs when you deidentify people by gathering the data and generalizing it to protect the individuals involved. For example, we could use aggregation and banding when we’re conducting a medical trial. Instead of identifying the participants as the person or the subject number in that study, we would instead report the results as an aggregated group.

Such as 90 out of 100 people who participated in this trial didn’t have any side effects for that medication. By aggregating this group from a single participant up to a group of 100 people, we can’t easily identify whether or not a single person did or did not have that side effect. Therefore, they are properly deidentified. Even if we knew that you didn’t have side effects, we still aren’t going to know which number you were in the study, only that you are one out of the 90 people who didn’t have side effects. Now, all of these methods of deidentification are really a form of observation of the personal data that’s going to be stored in the database.

But if you want to remove the data completely from the record, you would instead perform a scrubbing of your database. Data scrubbing refers to the process of amending or removing data in a database. For example, if I went through my entire database and overwrote every student’s Social Security number with the value 123-45-6789, that would be considered data scrubbing. Alternatively, you could scrub the entire record by removing or deleting it from your database. But then you’re going to lose all the data, not just a protected or private data like you would if you scrubbed a single field inside the records. Now, deidentification is also known by the term anonymization. Anonymization is a data processing technique that removes or modifies personal, identifiable information so that the resulting data set cannot be associated with any one individual.

The challenge with deidentification and anonymization is ensuring that the data remains anonymous. Sometimes people in organizations attempt to reidentify people from an anonymized data set. But wait, how are they able to do that? Well, it comes down to the quality of your anonymized data set, especially if you’re using techniques like aggregation and banding. Let me give you a real world example from my own company here. Recently, we did a survey of our employees to see how our company was doing from the employees perspective. So we sent out a survey to all of our employees, and we told them not to include their names because we wanted it to be anonymous.

This way, our employees would feel comfortable really sharing how they felt and we could get some honest feedback. So we asked a bunch of questions. We asked, how do you like working for Dion training? Do you feel your pay is competitive for your position? Do you enjoy your job? Do you enjoy helping our students? We have, like, ten or 15 of these types of questions. Then we asked a few demographic questions at the end of the survey. How old are you? What’s your gender? Are you married or are you single? Basic stuff like that. Okay, this all seems pretty innocuous, right? It’s pretty generic and likely the same type of things you’ve been asked at your own organization when you did a survey. So we didn’t ask for an employee ID. We didn’t ask for a Social Security number, and we didn’t ask for their name.

Nothing that is directly, personally identifiable. So a week goes by, and we get back all the results from the survey. The leadership team runs through the results, and we take them through a randomizer first so we don’t know who submitted what, and then we start reading. The first one give us an overall grade of five stars from that employee. The next one four and a half stars. The third one a one star. Uhoh, it looks like we have an upset employee. So I read the comments, and they say they think Jason is the worst boss ever. Now, I’m a bit concerned here because I think that my staff really loves their jobs and they like me. And so now I want to figure out who is this person who thinks I’m a one star boss? Am I able to reidentify them? Well, I look at the last few questions which put people into different bands based on age or gender or marital status.

This person who left a one star rating is a married woman between the ages of 35 and 40 years old. Now, if I was a huge company with thousands of employees, this might be 50 or 100 people, but we are a relatively small company. We usually have between ten and 20 employees at any given time. At the time of this survey, we only had one person who matched the description of a married woman between 35 and 40 years old. So guess who thinks Jason is the worst boss ever? None other than Tamra. For those of you who don’t know who Tamara is, she likes to think she’s quite funny, and she’s often really disruptive at our staff meetings, too.

Tamara was actually the first employee hired on at deon training, and she often thinks surveys are really silly and just like to throw out ridiculous comments like this one. After all, she figures she is never going to get fired because Tamra, well, she’s my wife. And so the point here is that even though you do aggregation and banding, if you don’t have a large enough group, it’s not going to actually give you the anonymization you want. It can be really easy to reidentify people within the data set. Now, I made up the end of that just as a little story to prove my point, because we really don’t ask those demographic questions.

The reason is I value the anonymisation of data, and if I ask those type of questions like age, gender, or marital status, at the end of my survey, I’m going to be able to reidentify my employees, because we have such a small data pool thinking across my staff. Even if I ask the question of age, using ten year bands like 20 to 30 years old, 30 to 40 years old, 40 to 50 year old, and asking, are you a male or a female? Or if you’re married or not, I’m going to be able to identify pretty much every person on my staff just using those details, and that is going to eliminate all the benefits of deidentification, because we could easily reidentify each employee.

Remember, when it comes to Reidentification, this is an attack that combines deidentified data sets with other data sources to discover how secure the de identification method actually is. So if we use that system in our company, it is not going to be secure. But if I use that same system in my last position, where I had 400 employees when I was serving as an It director, I could have done that and it would have been very secure because we have a lot more people who might have been a woman who is married between 30 and 40 years old. So keep that in mind when you’re building out your identification systems and think things through. Because just because something works in a large company, it doesn’t always work at a small company. And the same can hold true in the reverse.

6. Data Encryption (OBJ 1.4)

Comes to data security, one of the most tried and true methods is to use data encryption. Now, in this lesson, we’re not going to do a full review of all the different encryption types because you should already know that from your security plus studies. Instead, we’re going to focus our efforts on the concepts surrounding encryption as opposed to the specific technical implementations and algorithms like Desire or AES. We’re going to first talk about two different types of data in your system unencrypted and encrypted, and the three different data states that information may pass through during its creation, usage, and storage. First, it’s important to remember that data can be in your systems and exist as either unencrypted or encrypted data.

Now, unencrypted data is any data that remains in an easily viewable or accessible format. This is also known as clear text or plain text data. This data is stored, transmitted, or processed in an unprotected format that anyone can view and read. For example, if I’m using the network to transmit my username and password in an unencrypted format, such as when I log into a telnet server that is considered open and available to anybody who happens to be capturing the packets as they’re going across that network to secure my data in any of its states. Such as when the data is at rest, the data is in motion or the data is in processing, you need to encrypt that data. Now, data encryption is a security method where information is encoded and can only be accessed or decrypted by a user with a correct security key. There are many different ways to encrypt and decrypt data.

But for now, just remember that encrypted data is scrambled up and unreadable to anybody without the proper encryption or decryption key. This scrambled up data is known simply as encrypted data or ciphertext data when it’s in this scrambled form. When it comes to encryption, remember it is a form of risk mitigation for the access controls used in your system. If an attacker is able to circumvent your access controls and gain access to a file, but that file is encrypted, well, guess what? That attacker still can’t read it. This is what makes encryption a great risk mitigation to use in order to protect the confidentiality of your data. Even if other things fail, there are going to be three different data states that data and information can continually move between. Now, when we talk about a data state, we’re talking about the location of the data within the processing system.

Data can really exist in only one of three places. You can have data at rest, data in motion, or data in processing. First, we have data at rest. Data at rest is any data that is stored in memory, on a hard drive or on a storage device. For example, if I have data simply sitting on an external hard drive, that data could be vulnerable. To protect that data, I might want to use BitLocker to perform a full disk encryption of that external hard drive, or I might use file level encryption to protect just the specific files and folders that I need to protect. Either way, if I encrypt the critical data that’s stored on that hard drive, I can ensure that no one can read that contents of those files unless they have the decryption key. And therefore, I’m going to be ensuring confidentiality. Now, there are many different types of encryption used to support the confidentiality of data at rest.

This includes full disk encryption, folder encryption, file encryption, or database encryption. The second type of data state we have is known as Data in Transit or Data in Motion. Data in Motion is any data that’s currently moving from one computer to another over the network or from one part of the computer system to another part of the computer system within the same tower. This could be from the hard disk to the memory or from the memory to the processor. All of these are examples of data in Transit or Data in Motion. Now, let’s take another example. Let’s say I want to access my bank’s web server and I need to log in there.

I need to send them data like my username and password in order to be authenticated by their systems. So to secure the communication path between my laptop and their Web server, I need to rely on some form of a transport encryption protocol. In the example of logging into a bank account over their website, I’m likely going to rely on TLS Transport Layer Security or SSL Secure Socket Layer. Both of these are commonly used to protect the transport layer of a web application. Now, if I’m connecting my laptop back to my corporate network over a VPN connection, I might use something like IPsec or Layer Two Tunneling Protocol. If I’m trying to secure the wireless connection between my laptop and my home wireless access point, I might use WPA Two with AES encryption as my algorithm.

Regardless of the exact technology and use, our goal is always the same we want to protect the data and maintain its confidentiality as it’s moving from one system like my laptop to another system, like the bank’s website or over to the Internet by adding a layer of encryption to it. The third data state we have is known as Data in Use or Data in Processing. Now, Data in Use or Data in Processing is any data that has been read into memory or is currently inside the processor and is currently being worked on or manipulated. This is active data that is non persistent in its digital state. Typically, it’s being held in Random Access Memory or Ram inside your system, or it’s in the CPU’s caches or the CPU’s registers. This is data that is currently being utilized by the computer and its central processing unit.

So remember, your data state for any piece of data can be either at rest, in motion, or in processing at any given time. And these data states change as the data is created, processed, and stored within a given system or network. Let me give you a quick example of this. To tie it all together. Let’s pretend I have a file on my hard drive with a list of all my students who have passed their certification exams. Now, I want to maintain the confidentiality of that file so I might encrypt it using the AES algorithm and use a long, strong symmetric encryption key. This file is now being protected when it’s at rest, using data at rest, and stored in this encrypted ciphertext format. But there is one person on my team who needs to read the contents of that file to determine if Jane Doe has really passed her certification exam or not. So I enter the decryption key, and that file is converted back to plain text.

Now, I find the line that has Jane DOE’s name and status, and I want to send this securely to a team member. I could save this line to a file, encrypt that file, and then send that file over. But that’s a bit cumbersome. So instead, I’m going to open a direct message with them using Slack. Now, I’m logged into the Slack website using the Https protocol, so that data being sent from my computer to the Slack server is being encrypted using TLS or SSL, ensuring that the data is encrypted as it becomes data in transit or data in motion. Once that data is received by my team member, they could put it into a file and encrypt it and make it data at rest again. Or they can simply delete the message, since they now have the information they needed, and they can get rid of that data. All the while, when I access that file on my computer or begin to send that information over the network, that data moved from data at rest to data and transit.

And between those two states, the data was also in use because my CPU had to perform the operations on that data as part of the encryption and decryption process, as well as the transmitting and receiving process to help protect that data while it’s being processed as data in use. AMD and intel processors involve the use of secure processing mechanisms with encryption and integrity checks as appropriate. So, as you can see, data is not just in one state. It moves constantly between these different states, continually from data at rest to data in transit and data in use. And you must consider how it’s going to be protected during each of these data states.

Uncategorized

Related posts:

Leave a Reply Cancel reply