Revolutionizing Data Quality: Unleashing the Power of SQL for Seamless Data Cleaning

A Data Cleaning Project Using SQL

Boyega
9 min readJun 29, 2023
Unleashing the Power of SQL for Seamless Data Cleaning

Data cleaning is an essential and integral step in data analysis and management. It encompasses a range of techniques and processes aimed at identifying, rectifying, and eliminating errors, inconsistencies, and inaccuracies within datasets. By undertaking data cleaning, we can enhance the overall quality and reliability of the data, resulting in more accurate and meaningful insights.

In this project, I will leverage the power of SQL to clean a dataset called “housing_data.” By employing SQL’s robust capabilities, I will address various data issues, such as standardizing SQL date format, removing unused columns, breaking columns into different columns and adjusting incorrect entries. My primary objective is to ensure that the dataset maintains its integrity, consistency, and value.

Through this project, I will demonstrate how SQL can effectively handle data cleaning tasks, allowing me to uncover hidden patterns, discover meaningful trends, and make informed decisions based on reliable data. By the end of the project, I will have transformed the “housing_data” into a pristine dataset that upholds the highest standards of data quality, ensuring its usability and trustworthiness for further analysis and insights.

View the SQL Script on GitHub

Without delay i will delve into cleaning my data

CLEANING ONE:- Standardized Date Format

One of the data cleaning tasks in this project involves standardizing the “salesDate” column to the SQL format. Currently, the dates in this column are not in the standard SQL format, which can cause inconsistencies and hinder proper data analysis.

Before importing the CSV file into MySQL, I observed that the “salesDate” column was initially in the format ‘D-MMM-YYYY(9-Apr-2013)’. To ensure compatibility and consistency with MySQL’s date format, a cell formatting step was performed to convert the dates into the ‘4/9/2013’ format.

select saledate FROM sql_data_cleaning_project.housing_data_project;

To view the “SaleDate" I imported in SQL queried

select saledate FROM sql_data_cleaning_project.housing_data_project;
image of saledate before cleaning

Next, to standardise the date format I utilized the MySQL function str_to_date() to convert the values in the "saledate" column from a string format to a date format. I use the specified format string ("%m/%d/%Y %H/%i/%s") as a template to interpret the input string and convert it to a valid date.

select saledate,str_to_date(saledate, "%m/%d/%Y %H/%i/%s")
FROM sql_data_cleaning_project.housing_data_project;

NOTE: To successfully update the “saledate” column from a string format to a date format, considering that the column was imported as TEXT data type, it is important to include the time portion (hh:mm:ss) in the string representation of the date. This is necessary to avoid any errors or warnings during the update process.

Next, I updated the “saledate” column with my query result

Set sql_safe_updates = 0;
Update sql_data_cleaning_project.housing_data_project
Set saledate = str_to_date(saledate, "%m/%d/%Y %H/%i/%s");
set sql_safe_updates = 1;

NOTE: Because I didn’t specify the WHERE clause in my Update function, I need to set the “sql_safe_updates” off before making any changes to the “saledata” column and set it on after the query as well.

Next, because I want my “saledata” column to be in DATE data and not DATETIME data, I used the CONVERT() function in SQL to facilitate the conversion of the column and updated the column with the DATE data.

Set sql_safe_updates = 0;
Update sql_data_cleaning_project.housing_data_project
set saledate = CONVERT(saledate, Date);
set sql_safe_updates = 1;
image of saledate after cleanig

As you can see, the image shows the “saledate” column in the proper DATE format.

CLEANING TWO: Populate Property Address data

Prior to importing the CSV file, I observed that some rows in the “propertyAddress” column were empty. Importing the file using the MYSQL Table Import wizard would result in truncating the entire row with empty space and moving to the next row. However, this approach would make it impossible to populate the “propertyAddress” data where it is null.

To address the issue of empty spaces in the “propertyAddress” column before importing the CSV file, I approached this by filling the blank spaces with the value "null" prior to the import process. I achieved this by using the “Fill the Blank” function in Excel.

Next, by selecting the “propertyAddress” column from the table where the property address is "null", it becomes apparent that some rows contain "null" values. This indicates that the corresponding rows have empty or missing property addresses.

SELECT PropertyAddress FROMsql_data_cleaning_project.housing_data_project
where PropertyAddress is null;

Upon ordering the entire ‘house_date’ table by ParcelID and examining the rows where UniqueID is (6993, 43071, 5805, 5806), it becomes evident that there are duplicate records with the same address and ParcelID for UniqueID (6993, 43071), as well as for UniqueID (5805, 5806).

select * FROM sql_data_cleaning_project.housing_data_project
where uniqueid IN (6993,43071,5805,5806) order by parcelid;

Stay with Me!

To address a scenario, where the customer with UniqueID 6993 has an address but the customer with UniqueID 43071 does not have an address for the same ParcelID (015 14 0 060.00), I performed a self-join operation to populate the address for UniqueID 43071 with the address from UniqueID 6993.

select * FROM sql_data_cleaning_project.housing_data_project ha
JOIN sql_data_cleaning_project.housing_data_project hb ON ha.ParcelID=hb.ParcelID AND ha.UniqueID <> hb.UniqueID

Then, I selected the ParcelID and PropertyID from two tables where the PropertyAddress is ‘null’.

To populate the PropertyAddress in the table "ha" with the corresponding PropertyAddress from table "hb" using the IFNULL() function, I executed the following query

Lastly, I updated my Table (housing_data_project), where the ‘ha.PropertyAddress is null’

To reconfirm if the query above really worked, I queried

SELECT PropertyAddress FROM sql_data_cleaning_project.housing_data_project WHERE PropertyAddress is null;

As you can see the query above shows no value, meaning where all the propertyAddress is null, the query run has populated the data.

CLEANING THREE: Breaking out Addresses into Individual Columns (Address, City, State)

In this section, the goal is to separate the “PropertyAddress” column and “OwnerAddress” column into individual components, namely Address, City, and State.

Querying

SELECT PropertyAddress FROMhousing_data_project

You will see that in the propertyAddress imaged below, I have the address and after the comma, I have the city

Next, to retrieve the individual Address and City from the PropertyAddress column using the substring(), Locate(), and Length() functions, I executed the following query

Next, to include the new columns, Address and City, in my existing table, I used the ALTER TABLE statement in SQL to include the new columns and I updated the columns as well.

To view the results after successfully altering the table and updating the column values, I used the SELECT statement

Select * From sql_data_cleaning_project.housing_data_project

Separating the OwnerAddress Column

Select OwnerAddress From housing_data_project;

Next, to separate the OwnerAddress column into its individual components (Address, City, and State), I used “substring_Index” manipulation functions in SQL.

Next, to include the new columns, OwnerAddress1, OwnerCity and OwnerState, in my existing table, I used the ALTER TABLE statement in SQL to include the new columns and I updated the columns as well.

To view the results after successfully altering the table and updating the column values for the OwnerAddress components (Address, City, and State), I used the SELECT statement.

Select * From sql_data_cleaning_project.housing_data_project

CLEANING FOUR: Change Y and N to Yes and No in “Sold as Vacant” field

In this section, I will be changing the values of the “SoldAsVacant” column from ‘Y’ and ’N’ to ‘Yes’ and ‘No’ respectively.

To properly see the distinct values and count of the “SoldAsVacant” column, I used the SELECT statement with the DISTINCT keyword and the COUNT function.

Select distinct soldasvacant, count(soldasvacant) From housing_data_project Group by 1 Order by 2;

In the image provided above, it shows that the ‘SoldasVacant’ column contains various values and I’m to ensure that the column is correctly filled with only ‘Yes’ and ‘No’ values.

Next, to properly fill the ‘SoldasVacant’ column with ‘Yes’ and ‘No’ values I used a case statement in my query

After using the CASE statement to properly fill the ‘SoldasVacant’ column with ‘Yes’ and ‘No’ values, I updated the column in my table with the corrected values using an UPDATE statement.

To confirm if the column is now properly formatted after using the UPDATE statement, I used the following SELECT query.

Select distinct soldasvacant, count(soldasvacant)From housing_data_project Group by 1 Order by 2;

CLEANING FIVE: Delete Unused Columns

In this section, I will be removing several unused columns from the table. Specifically, I will be deleting the “propertyAddress” and “OwnerAddress” columns, which were separated in the previous section. Additionally, I will eliminate other columns that are not required for my purposes.

I ran the query below to drop the unused columns;

Alter Table sql_data_cleaning_project.housing_data_project
Drop column PropertyAddress, Drop OwnerAddress, Drop TaxDistrict

Based on the image provided, it appears that the query to delete the columns from the table was successful.

CLEANING Six: Format LandUse Column into Proper Case

In this section, I will be converting the string values in the “LandUse” column into the proper case. This means that the first letter of each word in the string will be capitalized, while the remaining letters will be in lowercase.

First, I ran a query to view what the column looks like

Select Landuse From sql_data_cleaning_project.housing_data_project;

Next, I utilized the CONCAT(), UCASE(), and LOWER() functions to modify the case format of the string.

Select
CONCAT(UCASE(SUBSTRING(`LandUse`, 1, 1)), LOWER(SUBSTRING(`LandUse`, 2)))
From sql_data_cleaning_project.housing_data_project;

Next, I proceed to Update the column with the result of the query above

Set sql_safe_updates = 0;
UPDATE housing_data_project
Set LandUse = CONCAT(UCASE(SUBSTRING(`LandUse`, 1, 1)), LOWER(SUBSTRING(`LandUse`, 2)));
set sql_safe_updates = 1;

To confirm if the column is now properly formatted, I used the query

Select Landuse From sql_data_cleaning_project.housing_data_project;

If you have any questions or would like to contribute further insights, please feel free to like, share, comment and reach out.

Thank you for reading.

--

--

Boyega
Boyega

Written by Boyega

Data Scientist, Technical Writer and a Content Creator. I simplify complex Data Science/ML, Analyst & Statistics topics through articles & videos.

No responses yet