Data Classification: Definition, Types, Examples & Vendors

1 Star2 Stars3 Stars4 Stars5 Stars
(17 votes, average: 5.00 out of 5)

The general answer to the question “What is data classification?” is that it’s a data organization process into various categories that helps with both protection and general usage of such data. The very purpose of a classification process is to make your data easily locatable and retrievable without needing to interrogate it again. There’s three main fields that rely heavily on a data classification as a process:

  • Data security;
  • Risk management;
  • Compliance.

As the definition suggests, data classification is all about making data easy to find and track via the tagging process (i.e. within metadata properties). The same process also includes finding out and deleting data duplicates to save both storage costs and backup time. The entire data classification process may sound complicated, but it still has to be properly understood by organization’s leaders to make correct data-related decisions.

Data classification types

Data classification is all about using a variety of labels to define a piece of information based on its data’s type, integrity, access permissions, and content. It’s not uncommon to use different security measures based on the results of the data classification with one of the parameters being the data’s importance and/or confidentiality.

As for the types of data classification methods, there are generally three of them that are considered to be the industry standard:

  • Classifying data based on its context, the main points of interest are indirect indicators of the information’s sensitivity, including location, creator, application, etc.
  • User-defined classification is entirely reliant on manual user selection for each document, it relies heavily on the end-user’s discretion and knowledge to appropriately flag documents with different types of sensitivity.
  • Classification type that automatically inspects files’ contents to determine their importance, also called content-based classification.

There’s no concrete right or wrong choice when it comes to types of data classification, each of them can fit your company really well or not fit at all.

Here’s an example of a company applying one of the three confidentiality labels to their data:

  • Public data;
  • Private data;
  • Restricted data.

According to this example, public data needs little to no security measures and can be open to everyone, while the restricted data implies the highest security measures you can create and implies that those are the most sensitive files that you have.

Other data classification examples with additional classification levels also exist, but such three-tier classification is often used as a groundwork for the majority of companies to build their own classification framework off of. And, of course, performing the classification is just one part of the process, you have to actually follow-up with the appropriate security measures and/or solutions to call the entire classification process successful and to protect your most important data. Such follow-on actions are usually, moving,archiving, encryption or outright deletion.

Data classification process

If you’re wondering how to classify data, then you’ll need some sort of a base to work from. In some cases data classification as a process might become complicated and cumbersome very quickly, and while the automatization of the classification process helps a lot – the company itself must perform a variety of operations for the entire process to work properly, including:

  • Finding out the correct criteria and/or categories that would be used to perform the entire data classification process;
  • Implement various security-related measures based on the results of the classification process;
  • Ensure the maintaining of proper data classification protocols by outlining the responsibilities of a company’s employees.

This process should provide both the company with an operational data classification framework to work with. Each category should also include additional information around security considerations, data types and rules that relate to various processes that can be performed with said data (storage, retrieval, transmission and other processes).

Data classification and compliance

Some compliance regulations also put a lot of weight towards a company to implement data classification, in one way or another. A good example of such a regulation is GDPR, implying that if your company works with EU citizens in any way – you have to know what that data is, where it is stored, and protect it with appropriate security measures.

There’s also the fact that compliance regulations like GDPR often demand much heavier security measures for specific data categories. For example, GDPR prohibits any sort of processing of data that’s related to philosophical beliefs, racial or ethnical origins, or political opinions. The properly performed classification procedure should be able to alleviate a lot of risk that comes with such specific topics, thus lessening the chances of a company having compliance issues and ultimately paying for any mistakes.

Data classification general steps

While there are some possible deviations in the way specific companies approach the topic of data classification, you can always use the general three-step recommendation about the data classification-related procedures if you don’t know where to begin.

  1. Understanding where your data is located and what regulations your organization is bound to comply with is a good first step in regards to data classification as a complicated process.
  2. Classification policy is the top priority for any company that doesn’t have one, since it’s the set of rules that your classification process would be relying on.
  3. Classification process can safely begin as soon as you have both the idea of where your data is, and a policy in regards to what you should do with this data.

It would be unfair to say that data classification only makes everything easier to find, and that’s it. The current world’s enterprises are often operating with massive amounts of data, and finding out what is where in such a big data lake is much easier if the data in question is already classified and faster to work with.

5 steps to data classification

It’s incredibly hard to have a proper sensitive data handling system without a correct data classification framework in place. However, there’s also a lot of examples when companies can’t find the right approach to their data classification system, making it either too complicated or rendering it useless in the first place. Here are five general steps that you should follow for a successful classification system:

  1. Risk assessment. Clear understanding of all of the requirements from the confidential and privacy standpoint is a requirement to begin.
  2. Classification policy development. A comprehensive classification policy without overcomplicating everything is another big step towards a decent data classification system.
  3. Data categorization. Understanding your data types and how important they might be beforehand is also heavily recommended before starting.
  4. Data location discovery, identification and classification. The main part of the process when it comes to classifying data is the actual data discovery, along with identification and subsequent classification.
  5. Security measures and maintenance. Applying appropriate security measures and updating them when necessary is the last significant part of the data classification system.

Data classification examples

One of the most common data classification examples is performed via RegEx – a popular string analysis system that works by searching pre-set patterns. For example, we want to find all of the VISA credit card numbers in the database, the line would look like this:

\b(?<![:$._’-])(4\d{3}[ -]\d{4}[ -]\d{4}[ -]\d{4}\b|4\d{12}(?:\d{3})?)\b

It might look complicated, but it’s really not that hard to understand. This sequence of commands looks for a 16-digit character number that starts with a number 4 and is separated in 4 quarters with “-” symbols between them.

RegEx can also search for many other data types, but needs additional clarification/validation with some of them. For example, in one of our data classification examples, RegEx can find email addresses but can’t tell between a personal email and a business one. The addition of some sort of a dictionary to the results of the identification process is a must in this case.

That’s not to say that this is the only option when it comes to data classification examples. There’s a lot of solutions that are more sophisticated and can look into metadata, permissions, user activity, and so on – depending on what you need to find.

Machine learning is also handy in this case, making it possible for an algorithm to recognise legal documents after getting a set number of learning materials of such documents.

Data classification policy

One of the prime purposes of a data classification policy is to define who is responsible for the process in question. It can be someone responsible for the data correctness, the information creators, or subject matter experts.

Your classification policy is basically the data classification standard, specifying how to do it in the first place, as well as various specifics. There’s also more specifics that a policy should be able to define in regards to the data classification process, including the time periods between subsequent data classifications, what types of data are classified, how to classify data (the appliance that performs data classification), and so on. It’s also important to remember that a classification policy remains a part of the general information security policy – the one that specifies the means of protecting sensitive data in the first place.

A few questions to consider when forming a data classification standard:

  • Who is responsible for the data being accurate and complete?
  • Who is the creator/owner of this information?
  • Is this information a subject to any compliance regulations? What are the consequences of non-compliance in that case?
  • Which part of the organization has the most information about the context and/or content of this specific data?
  • What is the storage location of this data?

Data classification methods of scaling

At some point you won’t be working with just a basic set of rules anymore and this is where various data classification methods come in to help you make the scanning more efficient.

One way of doing it is to leverage the metadata of already scanned files to further increase the accuracy of the subsequent content search. If you can filter out the files that you are not interested in based on metadata, then you are saving precious time by not even sending them for content classification.

Another tip is to attempt incremental scanning instead of everything in one go. This allows for more agile and faster feedback to ensure your rules and logic is accurate.

More complicated scaling techniques involve various modern technologies like machine learning, comprehensive audit, permission logging, and so on.

Myths surrounding data classification

Surprisingly enough, there are quite a lot of people that think there’s no need for data classification to exist, or it’s more trouble than it’s worth. Here are the top-3 myths about data classification and why they’re wrong:

  1. It’s extremely complicated. While the overcomplication of data classification projects is a thing, a lot of times the blame for that lies on the scheme creators. When it comes to data classification and the number of categories, you don’t want to do a lot of them, because that just makes it harder for everyone. This is why the general recommendation is to start with only three categories, and add more only after careful consideration if you really need it or not.
  2. It’s just more bureaucracy with no reason for it. Quite the contrary, actually – data classification is one of the ways to make data protection that much simpler. It can also allow for better resource allocation, helps with prioritizing the protection measures and spreads the understanding about the exact parts of your data that have the biggest importance.
  3. It takes a long time for the data classification efforts to become valuable. Classification automatization usually begins paying off since the first day by bringing order to your data, no matter if it’s on the context or content basis. At the same time, simplification of your data’s order helps with implementing a correct policy, and the entire classification automatization effort helps with security improvements in a lot of ways.

Data classification vendors

The global data classification market is quite vast, and includes a lot of companies, including:

  • Dataguise (US),
  • Titus (Canada),
  • Google (US),
  • Innovative Routines International (IRI),
  • SoftWorks AI (US),
  • AWS (US),
  • Clearswift (UK),
  • PKWARE (US),
  • Microsoft (US),
  • OpenText (Canada),
  • Cipherpoint (Australia),
  • MinerEye (Israel),
  • IBM (US),
  • Boldon James (England),
  • Forcepoint (US),
  • Varonis (US),
  • Informatica (US),
  • Spirion (US),
  • Janusnet (Australia),
  • Digital Guardian (US),
  • Seclore (US),
  • Netwrix Corporation (US),
  • GTB Technologies (US),
  • Sienna Group (US),
  • Expert TechSource (India),
  • Symantec (US).

All of these companies have adopted different growth strategies, including M&A, collaborations,product launches, partnerships and other means of expanding their influence over the market. As prominent examples, we’ll look into Google and IBM a bit further.

Google is one of the leaders on the data classification vendors market, relying heavily on its organic growth pace, and launches a lot of next-gen products. One of the company’s prime investment targets is their R&D department, inventing new solutions for cloud, search, machine learning, advertising, and so on. For example, Google’s Data Loss Prevention (DLP) API was recently upgraded and now includes redaction, tokenization and masking features. DLP API itself provides access to a comprehensive platform that specializes in powerful and intelligent data inspection and classification.

IBM is also a large data classification vendor, focusing their efforts on platform scaling, automation improvements, AI-powered products and cloud infrastructure expansion. As a part of their growth strategy, IBM introduced PowerAI, that simplifies that developing experience for both scientists and developers by preparing and classifying data while significantly reducing the AI system training time, from weeks to mere hours.

And, of course, there’s also Cipherpoint, with their solution cp.Discover that discovers and classifies information with different confidentiality types. cp.Discover applies three main working methods of classifying data: machine learning, metadata matching and pattern matching, and it prides itself on working with unstructured data repositories, as well – those are often ignored by other database-related data discovery and classification tools.

Leave a Comment

Your email address will not be published. Required fields are marked *