Database Querying
Filtering, Cleaning, Summarizing
Joining Tables
Common Table Expressions
Database Querying
Filtering, Cleaning, Summarizing
Joining Tables
Common Table Expressions
Data Visualization
This dataset is an employees database, sourced from on Github and reorganized into an SQL script by 365 Careers. The database file can be downloaded from this repository.
This script contains the code that will enable you to create your database. Open the script in your SQL client and run it.
The schema of this database is as follows:
The purpose of this task is to provide a summary of the database to an incoming data manager at the company who wants to know the state of the database. The information required include understanding the overall structure, volume and quality of the data. As a bonus, I have also decided to include insights on the employees composition within the company.
Part of the database task is to check on the quality of the data and any quality issues will be noted down. However no cleaning is intended to be done (yet) as the purpose of this task is to present the state of the database as is.
The following questions were brainstormed in order to guide the exploration of the database. This will be used as a guide to develop a high level dashboard.
• How much data is in the database?
• What is the department and setup, with bosses?
• How many active employees?
• How many employees and what are the employees title in the departments?
A quick check on duplicates was conducted. There were no duplicates found in all the table and one initial finding was that 31,579 employees have had more than 1 department.
Although no duplicates were found in the employees table, it was interesting to note that 19,683 employees share the same first name and last name with 1 or 2 other people. Out of this, 6 of them have the same first_name, last_name and birth_date. Their employee numbers and hire dates are different, so it assumed that these are different people.
In the dept_emp table, an anomaly in the to_ date column were found citing 9999-01-01. This is assumed to represent the current and active date.
To understand the magnitude of the data, total number of records were queried.
The following sections details out the approach and query used, leading to the above visualization. Challenges encountered during analysis are also noted.
Querying a list of current active employees was based on a where statement with max(to_date) aggregate. [sql link]. Joining the emp_dept, department and titles table will give you more information on the current employees.
Challenge #1: Getting the wrong total of active employees and the challenge was in figuring out why. This was able to be identified because the count of active employees in the department was queried and noted before. Lesson learnt is that it helps to identify the big picture before drilling down so that the results can be cross checked.
Solution: It was found that the max(to_date) was erroneously referred to the emp_dept table and not the title table. This causes the titles to be filtered according to the max dates in the department (and not the last recorded title from the titles table).
Wrong code:
Right code
Challenge #2: Showing total number of active employees by year. Various options were explored and the final query was found to be the simplest query.
This was addressed using a subquery in both the select and from statement.
During analysis, frequent the cross checks and drill downs were conducted, especially on individual employee records across all tables whenever there was an anomaly. Rather than copy and paste the codes over and over, it was easier to create a stored procedure and call it up. Head on here to find out how to create a stored procedure in mySQL.
During data exploration of the dept_emp table, it was found that several employee records had conflicting and questionable records. For example,
• 29 employees were found to have multiple departments with the same from_date
• 51 employees were found to have the same from_date and to_date in a department (0 days in a department).
• 21 employees were found to be in only 1 dept and have same from_date and to_date in a department
This prompted a check on the dept duration on the total 331,603 records with the following findings:
• There are records of employee working in a single department for less than 10 days
• There are overlapping dates in multiple departments for several employees.
*Analysis on salary table is ongoing.