Chronic data quality problems?
“What’s wrong with our data?” or “We’ve been through several data cleanses but still we are getting problems” or “We hired on a data car to fix this problem, but it still happens – what are we doing wrong?” I’ve heard it in many organisations, where HR people ask “can we hire a consultant to fix this, or as I heard one Chief People Officer (who shall remain nameless) say more despairingly “how come our HR data is such a mess?”. There are many possible reasons – to name a few –
- Line managers fail to advise HR of changes e.g. employee surname, change of address, marital status, promotion or transfer, left the company, etc – expecting that HR should somehow mind-read what happens in the lives of hundreds or thousands of employees
- Line managers advise HR of changes, but the system is not updated to reflect the changes
- Maybe you’re new into the situation, and finding that there’s a legacy of data mismanagement that you have to deal with
- The HR data cleanses that happen regularly were only partially accurate, and there are still many data items out of date
- Many of the current employees were TUPE’d in but the new data was either incorrectly or not fully captured
And there might be many other reasons. But suppose none of these things was ever a problem, and the data still isn’t right? What then? Well, it might be. something deeper, something more structural. If you truly want to get a long term solution, it helps if you understand some of fundamental data and data quality concepts – and in this piece I’ll explore the issue of data normalisation.
“How come our HR data is such a mess?” – Chief People Officer
This might at first strike you as a more technical topic than most Compensation and HR professionals are likely to encounter or to require knowledge of, but if you have chronic data quality problems in your HRIS, the reason might be that the database tables were not set up using referential integrity (to keep the bad data out) and data normalisation in the first place. I’ve seen this happen even in a company that make their money from selling HR software, and the paying clients are often left scratching their heads, simply puzzled by what’s going wrong. HRIS people probably need to know this stuff though – especially if buying a new HR system.
Database normalisation – or data normalisation – helps to organise the contents of the tables in your HR database or in your data warehouse. It’s a fundamental part of successful database design, and if it’s ignored, your HR database can be inaccurate, perform slowly and inefficiently, and may often fail to reduce the data you expect. I’ll go into some detail of database normalisation – so if this helps you in your work, read on and nerd out.
Think a new HR system would solve your problems?
You might have a database system that your company bought from a mainstream vendor, or you might have to create a database to hold the data. How you create a database is really a whole other topic, but the database tables (which are 2-dimensional structures using rows and columns) need to be set up using referential integrity to enforce the rules on what constitutes valid data. This is done to keep illogical, impossible or bad data out.
If you’re more accustomed to spreadsheets like Excel, know that a database table looks like a spreadsheet, but also know that there are many tables in the database system. You might have one table called EMPLOYEE that contains employee first name, employee last name, home address, phone number and so on, with each bit of data living in a cell at the intersection of a row and column. You might have another table called POSITION that contains job title, job family, grade, salary minima and maxima, and so on. These tables, EMPLOYEE and POSITION, would be linked in a relationship where one POSITION might be performed by many EMPLOYEEs.
Simply buying an off-the-shelf database system might not be the answer to your data problems, especially if these tables are not joined in a sensible and logical way. It helps to get under the surface with this, so that you can understand how the database tables are in relationship to each other. This is done by determining whether they are linked using one-to-one or one-to-many relationships. Take time with the vendor and have them explain how it works in a way you can understand clearly.

What is database normalisation?
When you normalise your database, you’ll be doing several things. These are –
- organising the data into logical tables so that you group related data in one single table
- Indirectly related data is put into different tables which are connected with a logical relationship between parent and child tables
- Minimise or eliminate the amount of duplicate data stored in your database
- organising the data so that a change made in one place carries forward to update other tables without your having to make all the updates manually
- creating the database so you can access and manipulate the data rapidly without compromising the integrity of the data elsewhere in the database system
This results in a better grouping of data. It also improves faster searching because indexes can be created faster.
You’ll need to understand a couple of terms –
- Entity – Entity is a real-life object e.g. an EMPLOYEE or DEPARTMENT or JOB ROLE – the data associated with each object is stored in the table.
- Attribute – Attributes are the characteristics of the entity, that give some information about the entity. In the database table, Attributes are usually each column
- Primary Key A primary key is a database table column (or combination of columns) designated to uniquely identify all table records. It must contain a unique value for each row of data. The Primary Key is important and we’ll see this mentioned throughout this post.
From First Normal Form (1NF) to Third Normal Form (3NF)
First Normal Form (1NF)
“The Key…”
An entity that does not have any repeating columns or data groups can be termed as the First Normal Form. In the First Normal Form, every column is unique. An entity that is 1NF and one of its attributes is defined as the primary key and the remaining attributes are dependent on the primary key. All the columns depend on the Primary Key.
This table shows how a typical Employees and Department table would have looked if it was in First Normal Form (1NF) –
Emp_No |
First_Name |
Last_Name |
Dept_Name |
Dept_City |
Dept_Country |
2003 |
Donnie |
Brasco |
Accounts |
Naples |
Italy |
2004 |
Vito |
Corleone |
Marketing |
Las Vegas |
USA |
2005 |
Nicky |
Santoro |
Sales |
Las Vegas |
USA |
2006 |
Ginger |
McKenna |
Engineering |
Munich |
Germany |
2007 |
Lester |
Diamond |
HR |
Naples |
Italy |
The columns of both Employees and Department tables are bundled together into one. There’s now no need for us to have connecting columns, e.g. Dept_No, because all the data is available in one place.
But a table like this with all the required columns in it would be difficult to manage. It would also be difficult to perform any operations on it, and it’s inefficient in terms of storage.
Second Normal Form (2NF)
“…the Whole Key…”
Next, let’s take a look at what the Employees and Department tables look like in 2NF. All the columns depend on the Whole Primary Key – which may be comprised of more than one field.
EMPLOYEE TABLE
DEPARTMENTS TABLE
Dept_No |
Dept_Name |
Dept_City |
Dept_Country |
01 |
Accounts |
Naples |
Italy |
02 |
Sales |
Las Vegas |
USA |
03 |
Marketing |
Las Vegas |
USA |
04 |
Engineering |
Munich |
Germany |
05 |
HR |
Naples |
Italy |
EMPDEPT TABLE
EmpDept_ID |
Emp_No |
Dept_No |
1 |
2003 |
01 |
2 |
2004 |
02 |
3 |
2005 |
03 |
4 |
2006 |
04 |
5 |
2007 |
05 |
Now that we’ve split the 1NF table into three tables, we can tell more clearly that the EMPLOYEE table is now an entity about all the employees in the business. The attributes (or columns) now describe the properties relevant to each employee. The Primary Key for EMPLOYEE is Emp_No.
Likewise, the Departments table is now an entity about all the departments in the business. It’s columns describe the properties of each department, and the Primary Key for the DEPARTMENTS table is Dept_No.
In the third table, you’ll see that the Primary Keys of both tables are combined. In this table the Primary Keys for the EMPLOYEES and DEPARTMENTS tables are termed Foreign Keys.
If the user wants to get an output like the one shown in 1NF, the SQL query has to join all of the three tables, using the Primary Keys as follows –
SELECT Emp_No, First_Name, Last_Name Dept_No, Dept_Name, Dept_City, Dept_Country
FROM Employees A, Departments B, EmpDept C
WHERE A.Emp_No = C.Emp_No
AND B.Dept_No = C.Dept_No
WITH UR;
|
Third Normal Form (3NF)
“…and nothing but the Key”
The following conditions have to be satisfied for us to say that the database is in 3NF –
- It is in second normal form
- There is no transitive functional dependency
What do we mean by transitive functional dependency? It means we have the following relationships in the table: A is functionally dependent on B, and B is functionally dependent on C. In this case, C is transitively dependent on A via B. The columns depend on primary key and nothing else. Hence “The Key, the Whole Key, and Nothing but the Key”.
Consider another example –
JOB_DETAILS
Job_ID |
Contract_ID |
Contract_Type |
Salary_Band |
3074 |
1 |
FT Perm |
14 |
3075 |
2 |
FT Fixed term |
16 |
3076 |
3 |
PT Perm |
20 |
3077 |
4 |
PT Fixed Term |
8 |
3078 |
1 |
FT Perm |
8 |
In the above table, Job_ID determines Contract_ID and Contract_ID determines Contract_Type. Therefore Job_Id determines Contract_Type and we have transitive functional dependency, which means that this structure doesn’t satisfy 3NF.
To bring this table to third normal form, we split the table into two as follows:
SALARY_BAND
Job_ID |
Contract_ID |
Salary_Band |
3074 |
1 |
14 |
3075 |
2 |
16 |
3076 |
3 |
20 |
3077 |
4 |
8 |
3078 |
1 |
8 |
and…..
CONTRACT_TYPE
Contract_Type |
Contract_Type |
1 |
FT Perm |
2 |
PT Perm |
3 |
FT Fixed Term |
4 |
PT Fixed Term |
Now all the non-key attributes are fully functional dependent only on the primary key. In the table JOB_DETAILS, both the Job_ID and Salary_Band are only dependent on Job_ID. In the table SALARY_BAND, Contract_Type is only dependent on Contract_ID.
There are higher normal forms which include the Boyce-Codd Normal Form (BCNF), Fourth Normal Form (4NF) and Fifth Normal Form (5NF) – however this short article probably suffices for most cases in Reward and HR. Database normalisation is really a design technique for re-designing database table schemas to reduce redundant data, and data dependencies by breaking the data down into smaller and more logical, relevant tables.
These higher normal forms are regarded as a bit theoretical, and 3NF is the most widely used Normal Form in the production of practical database solutions.
OK, but I work in HR and you’re bamboozling me with tech stuff…
Not really. That’s kind of a weird attitude born of techno-fear. There’s a real benefit in learning about database normalisation and SQL though.Why be dependent on your IT team to grasp what data you need to add value in your job? Even if you aren’t working with data on a daily basis, learning database normalisation and SQL will probably help you to become a more systematic thinker. As with any programming language, you have to be logical and structured, and you have to develop solutions to problems. Knowing this stuff can also help you in areas like project management, procedure development and communications. Database normalisation and SQL are much easier for the average person with no coding knowledge – basically it’s simpler than programming languages, and most people have a basic grasp of how tables work.
Conclusion
Learning about the underlying structures of your database system will help you to understand logic and a systematic approach to data issues. Poor quality data causes so many problems of course – it’s important to take realistic steps to sort it. If any of these problems with your HR data sounds familiar, let me know on the Contact page. I’d be happy to help you.