Introducing Decide

Adaobi Adibe
3 min readJan 22, 2021

Decide automates data cleaning, to put it simply. If you would like to join our beta sign up here!

For anyone who has ever cleaned data, you know how painful it can be. A few months ago I was cleaning data for a personal project of mine. Long story short, it took me 3 weeks to clean this data. After processing the nightmare I went through, I decided (pun intended) that this is something I never want to experience again.

I searched for solutions, however, none of them really eased the burden of cleaning data. I found myself still having to manually clean the data, with the difference being that instead of coding I was just clicking a whole bunch of buttons.

So, I created Decide.

How does it work?

We clean your data in 3 easy steps. Below I illustrate how Decide works with a very dirty, but realistic, dataset I created based on different issues I have faced cleaning different data sets in the past.

Issues include:

  • Null values, null values and more null values.
  • Inaccurate data
  • Inconsistent formatting
  • Multiple variations of the same thing
  • Duplicates
  • Anomalous data

Step 1 Upload your data

Sample of dirty data. In the Ethnicity column there are 15 unique values! There are unrealistic ages such as 244 and 2, a number of null values along with the iPhone’s operating system being mislabelled as Android in rows 7,10,11,12 and 13.

Step 2 Label your data

Let Decide know what type of data your column should be. For example, “Ethnicity” should be a “Text” column, however the values do not need to be unique. Whereas “Email” should be a “Text” column that contains unique values.

Label the data type for each column

Step 3 Label your relationships

Here is where the magic happens.

Label the relationships between the columns using either the options from our dropdown menu or free form text

Step 4 Magic!

Ta-da! Your data is cleaned.

The data in the “Ethnicity *” column has been standardised and now only 6 unique values. Previously incorrect values such as the “Age” in the 0th column have now been corrected to 24. Null values have been filled and the “Operating System” column has the correct values. Rows with errors that can not be fixed have been dropped — in this case the rows with duplicates and birthdate of 6/12/1898.

Step 5 Understand your data

Don’t fret.We recognise that the purpose of cleaning data isn’t solely to clean data, so we always show what happens behind the scenes.

In this instance we show you the index of both the original and newly cleaned column (before any errors or duplicates were dropped), along with the original and new values.
The title of each column is the “standardised” version of the dirty column’s original values. Each row contains its former name and the index of the original dirty data set within the bracket.

Step 6 Just incase, you decide

In the instance that we are unable to clean all your data with 100% confidence, we flag the items in question for you to inspect.

In the case of the duplicate entry, as there is no other relevant data that will help us determine which is the correct entry, it is near impossible for anyone to figure out what the correct row is, so we let the you decide (haha, get it?).

We highlight rows with errors that we are unable to clean ourselves

And that’s it. A real easy and low effort way to clean data.

Sign up for our beta test today!

--

--