Overcoming the Challenges of Getting Clean Data
Clean data is crucial to good decision making with analytics and AI, but it’s not always easy to attain. Here are some common causes of dirty data.
Algorithms are only as powerful as the data they analyze. That’s why “dirty data”–data that’s inaccurate, incomplete, or inconsistent–is a major concern for people analytics and data science teams.
It doesn’t matter how sophisticated machine learning algorithms or analytics tools are—if the information they’re processing isn’t of high enough quality, they will not generate valuable insights or predictions. Clean people data is crucial to ensuring that AI-assisted decision-making is accurate, evidence-based, and capable of providing a return on your investment.
But achieving clean and consistent data is no simple matter. Especially for large organizations that generate a high volume of people data, starting to assess it for accuracy and consistency can feel like an insurmountable task.
For example, when government contractor NCI decided to start working towards a more data-driven style of decision-making, they estimated they would need to hire 10 people to fix their data and produce meaningful analysis. But with Visier, they found a way to avoid this and still get their data to a valuable state (we’ll explain how later in this article).
Clearly, clean data is important—but the first step in cleaning it is to understand what causes the issues in the first place.
What causes dirty data?
Data may seem objective and impartial, like something separate from the day-to-day of our working lives. But data is always created by, and refers to, real people. Because the real world is never free of errors and inconsistencies, data is almost never perfectly clean, either.
Here are a few of the most common ways errors and inconsistencies find their way into your data:
Human error and obfuscation
Typos are one human error that often leads to dirty data. At the data entry stage, most source systems rely on individuals typing out information so that it can be added to a database and analyzed. At this stage, some inaccuracies, such as misspelled employee names, are inevitable.
Intentional misinformation may also make its way into your data. For example, an employee may have misreported or omitted information to hide a failing or exaggerate a success. While obfuscated data of course differs from unintentional errors, both should be considered part of the margin of error to be expected in people-centric data.
Some level of inconsistency here is inevitable. Even attempts to automate the data-entry process and reduce the margin of error, such as through Optical Character Recognition (OCR) technology, will not be completely error-free. Instead, organizations should have a plan in place to identify inconsistencies within their data, then work towards cleaning it.
Inconsistent values, or the use of multiple labels to refer to the same characteristics, stem from a lack of standardized information representation within organizations or from processing information in multiple languages or systems of record.
For example, at the same company, different people may refer to the Vice President of Marketing as the VP of Marketing, V-P, Marketing, or Vice President, Marketing. At a multinational organization, Male, Female, 男, 女, Masculin, and Femme may all be used to refer to an employee’s gender, depending on the location where the data was collected.
As humans, we can easily infer which of these terms refer to the same attributes, but a machine may lack the contextual awareness to do so. To analyze data that is inconsistently labeled, your organization must define a standard set of values and relabel the data accordingly or your system must be able to reconcile all of these possible values into standard ones itself. This is why Visier uses analytical concepts, or concepts, to standardize representation and more.
Data does not live in isolation. Data points typically exist in relation to one another and, just like your organization, those relationships change over time. These references are important to sophisticated techniques like Organizational Network Analysis (ONA) and predictions that often have business impact well outside of HR practice.
For instance, employees nearly always report to other employees. But what happens if an employee changes managers or leaves the company entirely? What about if the department an employee is associated with changes its name, restructures, or is closed down? It’s not enough to simply update an employee’s record. To keep your data clean, all the data points affected by that change must be updated, too.
It’s common for an organization’s data to come from multiple sources. But to work towards a clean dataset that can be productively analyzed, organizations need to unify data from all their systems together in order to provide a cohesive picture of everything happening with their organization.
While disparate systems don’t always present an obstacle, one common challenge here arises when data is duplicated across two systems. For example, if a company is changing their Human Resource Information System (HRIS) or merging with another company that uses a different one. Analysts and data science teams may need to handle two employee records, or even the same unique identifier in use across both systems, and be left with the difficult task of trying to establish which should be used, and when.
In practice: how NCI approached cleaning their data
NCI is a leading provider of artificial intelligence services to U.S. defense, intelligence, health, and civilian government agencies, employing 2,000 people across 100 locations. But due to dirty data, NCI found it difficult to make progress on their goals, such as reducing days-to-fill for open positions.
“You ask five different people what the days-to-fill is, and you get a, ‘Well, what do you want it to be?’ answer,” says Clay Worley, SVP and Chief Human Resources Officer at NCI.
Because no one could even agree on what the current state of these metrics were, trying to put together a meaningful strategy for improving them felt impossible. At first, NCI projected that for the data to be usable, they’d need to hire 10 new people just to assess and clean it!
But Visier allowed them to assess their data’s cleanliness much more easily. With Visier, NCI was able to break the data down by business unit and even by specific program, making it much easier to identify outliers and inconsistencies that could be affecting the data’s accuracy. As a result, NCI now works with the cleaner data they need—and didn’t have to hire additional headcount to get here.
Working towards cleaner data
There are many techniques and tools, including sophisticated people analytics solutions like Visier, that can identify dirty data. But these solutions require varying degrees of technical knowledge, and in many cases, finding the right one is simple—it’s the one your team is most likely to actually use.
To work towards data cleanliness, your organization’s people must understand how powerful clean, reliable data can be, and prioritize the cleaning of records that affect the most important outcomes. While some margin of error is unavoidable, the goal should be to put cleaner data into your systems in the first place, by providing teams and employees with systems for managing it that are sophisticated, intuitive, and accessible.
Especially in HR, data always comes back to people. When people understand the importance of data cleanliness, and are given the tools they need to support it, organizations are able to make better business decisions and reach their goals.