Overcoming the Challenges of Cleaning Dirty Data
Clean data is critical to making good decisions with analytics and AI, but it’s not easy to attain. Read on to learn the most common causes of dirty data.
Algorithms are only as powerful as the data they analyze. That’s why “dirty data”–data that’s inaccurate, incomplete, or inconsistent–is a major concern for people analytics and data science teams.
It doesn’t matter how sophisticated machine learning algorithms or analytics tools are—if the information they’re processing isn’t of high enough quality, they will not generate valuable insights or predictions. Clean people data is crucial to ensuring that AI-assisted decision-making is accurate, evidence-based, and capable of providing a return on your investment.
But achieving clean and consistent data is no simple matter. Especially for large organizations that generate a high volume of people data, starting to assess it for accuracy and consistency can feel like an insurmountable task.
For example, when government contractor Empower AI (formerly NCI) decided to start working towards a more data-driven style of decision-making, they estimated they would need to hire 10 people to fix their data and produce meaningful analysis. But with Visier, they found a way to avoid this and still get their data to a valuable state (we’ll explain how later in this article).
Clearly, clean data is important—but the first step in cleaning it is to understand what causes the issues in the first place.
What causes dirty data?
Data may seem objective and impartial, like something separate from the day-to-day of our working lives. But data is always created by, and refers to, real people. Because the real world is never free of errors and inconsistencies, data is almost never perfectly clean, either.
Here are a few of the most common ways errors and inconsistencies find their way into your data:
Human error and obfuscation
Typos are one human error that often leads to dirty data. At the data entry stage, most source systems rely on individuals typing out information so that it can be added to a database and analyzed. At this stage, it's inevitable to have some inaccuracies, such as misspelled employee names.
Intentional misinformation may also make its way into your data. For example, an employee may have misreported or omitted information to hide a failure or exaggerate a success. While obfuscated data differs from unintentional errors, both should be considered part of the margin of error to be expected in people-centric data.
Some level of inconsistency here is inevitable. Even attempts to automate the data-entry process and reduce the margin of error, such as through Optical Character Recognition (OCR) technology, will not be completely error-free. Instead, organizations should have a plan to identify inconsistencies within their data, and then work towards cleaning it.
Inconsistent values
Inconsistent values, or the use of multiple labels to refer to the same characteristics, stem from a lack of standardized information representation within organizations or from processing information in multiple languages or systems of record.
For example, at the same company, different people may refer to the Vice President of Marketing as the VP of Marketing, V-P, Marketing, or Vice President, Marketing. At a multinational organization, Male, Female, 男, 女, Masculin, and Femme may all be used to refer to an employee’s gender, depending on the location where the data was collected.
As humans, we can easily infer which of these terms refer to the same attributes, but a machine may lack the contextual awareness to do so. To analyze data that is inconsistently labeled, your organization must define a standard set of values and relabel the data accordingly or your system must be able to reconcile all of these possible values into standard ones itself. This is why Visier uses analytical concepts, or concepts, to standardize representation and more.
Broken references
Data does not live in isolation. Data points typically exist in relation to one another and, just like your organization, those relationships change over time. These references are important to sophisticated techniques like Organizational Network Analysis (ONA) and predictions that often have business impact well outside of HR practice.
For instance, employees nearly always report to other employees. But what happens if an employee changes managers or leaves the company entirely? What about if the department an employee is associated with changes its name, restructures, or is closed down? It’s not enough to simply update an employee’s record. To keep your data clean, all the data points affected by that change must be updated, too.
Disparate systems
It’s common for an organization’s data to come from multiple sources. But to work towards a clean dataset that can be productively analyzed, organizations need to unify data from all their systems together in order to provide a cohesive picture of everything happening with their organization.
While disparate systems don’t always present an obstacle, one common challenge here arises when data is duplicated across two systems. For example, if a company is changing their Human Resource Information System (HRIS) or merging with another company that uses a different one. Analysts and data science teams may need to handle two employee records, or even the same unique identifier in use across both systems, and be left with the difficult task of trying to establish which should be used, and when.
Working towards cleaner data
There are many techniques and tools, including sophisticated people analytics solutions like Visier, that can identify dirty data. But these solutions require varying degrees of technical knowledge, and in many cases, finding the right one is simple—it’s the one your team is most likely to actually use.
To work towards data cleanliness, your organization’s people must understand how powerful clean, reliable data can be, and prioritize the cleaning of records that affect the most important outcomes. While some margin of error is unavoidable, the goal should be to put cleaner data into your systems in the first place, by providing teams and employees with systems for managing it that are sophisticated, intuitive, and accessible.
Especially in HR, data always comes back to people. When people understand the importance of data cleanliness, and are given the tools they need to support it, organizations are able to make better business decisions and reach their goals.
Want to see how easy it is to set up self-serve people analytics at your organization? Start a 30-day free trial of Visier People®, which comes with sample data and complimentary access to email and community support to help you get up and running—fast.
On the Outsmart blog, we write about people analytics and HR technology topics like how bad data can’t stop good people analytics, the benefits of augmented analytics, and everything you need to know about HR data sources and HR data connectors. We also report on trending topics like artificial intelligence, using generative AI in HR, and how skills are rapidly evolving, and advise on people data best practices like how to ingest people data and business data, how to turn source data into insights, and reports vs. analytics. But if you really want to know the bread and butter of Visier, read our post about the benefits of people analytics.
Get the Outsmart newsletter
You can unsubscribe at any time. For more information, check out Visier's Privacy Statement.