Movie Recommendation Data Cleaning With Great Expectations

Zhangbeilei
4 min readOct 28, 2020

--

I am using great_expectations for cleaning the dataset for movie recommendation. The goal is to delete “under-qualified” data in the dataset. There are some problems in the current data

Why the data is “dirty”

Let’s take a look at the dataset, there are 10 columns in the data. They are

  1. Adult
  2. Genres
  3. Id
  4. Original_language
  5. Original_title
  6. Overview
  7. Popularity
  8. Production companies
  9. Vote average
  10. Vote Count

In this article, we will consider filtering the data based on

3. Id — Ensure the ID is unique for further training

4.Movie language — Move movies without movie language

10.Vote Count — Move movies with vote count less than 5

Quick Start with Great Expectation

The installation of great_expections is straight forward as all python package

pip install great_expectations

And in your python file

import great_expectations as ge

You need to prepare a csv file as your data

movie_data = ge.read_csv("movies_metadata2.csv")

Step 1 — Get Unique Movie by Unique ID

We notice that the movie ID should be unique, otherwise it doesn’t make sense, so we run the following code to check

movie_data.expect_compound_columns_to_be_unique(['id'])

Unfortunately, we got 59 repeated movie ID. We’ll also move it by calling the above function.

{
"exception_info": null,
"meta": {},
"success": false,
"result": {
"element_count": 45466,
"missing_count": 0,
"missing_percent": 0.0,
"unexpected_count": 59,
"unexpected_percent": 0.12976729864074252,
"unexpected_percent_nonmissing": 0.12976729864074252,
"partial_unexpected_list": [
{
"id": "105045"
},
{
"id": "132641"
},
{
"id": "22649"
},
{
"id": "105045"
},
{
"id": "84198"
},
{
"id": "10991"
},
{
"id": "110428"
},
{
"id": "15028"
},
{
"id": "12600"
},
{
"id": "109962"
},
{
"id": "4912"
},
{
"id": "5511"
},
{
"id": "23305"
},
{
"id": "5511"
},
{
"id": "23305"
},
{
"id": "69234"
},
{
"id": "14788"
},
{
"id": "77221"
},
{
"id": "13209"
},
{
"id": "14788"
}
]
}
}

Step 2— Delete movies with vote_count under 5

A rating score is important for a movie. Imagine there is a movie, only 3 people rated it as 10.0, the same as the one rated by 100000 people. Can we say these two movies are equally good?

The answer is “not sure”. Because we have less confidence in movies that are rated by few people. It’s better to consider movies with more confident ratings in model training.

So, we decide to delete movies which are rated by less than 5 people

movie_data.expect_column_values_to_be_between('vote_count', min_value=5)

It turns out around 32% movie is under-qualified in this case

{
"exception_info": null,
"meta": {},
"success": false,
"result": {
"element_count": 45466,
"missing_count": 6,
"missing_percent": 0.013196674438041614,
"unexpected_count": 14562,
"unexpected_percent": 32.028328861127,
"unexpected_percent_nonmissing": 32.03255609326881,
"partial_unexpected_list": [
4.0,
2.0,
3.0,
1.0,
2.0,
3.0,
2.0,
0.0,
3.0,
4.0,
2.0,
0.0,
2.0,
4.0,
3.0,
0.0,
2.0,
4.0,
1.0,
0.0
]
}
}

Step 3 — Delete movie without language label

It’s kind of weird if in the dataset the movie language is not defined. How can I know if the user can understand the movie or not? Imagine you are a big fan of action movies but you only speak English. What do you feel when the system recommends a French action movie?

Let’s move movies without language label to ensure the recommendation won’t go too wrong

movie_data.expect_column_values_to_not_be_null('original_language')

And we found 2% of movies are not qualified

{
"exception_info": null,
"meta": {},
"success": false,
"result": {
"element_count": 45466,
"unexpected_count": 11,
"unexpected_percent": 0.024193903136409626,
"partial_unexpected_list": []
}
}

Step 4 — Save the expectation config for future use

What if we need the same pipeline for future data filtering?

We can save the config to a JSON by running the following code!

>>> movie_data.get_expectation_suite(){
"expectations": [
{
"expectation_type": "expect_column_values_to_not_be_null",
"kwargs": {
"column": "original_title"
},
"meta": {}
}
],
"meta": {
"great_expectations_version": "0.12.6"
},
"expectation_suite_name": "default",
"data_asset_type": "Dataset"
}

Let’s save the JSON to file

import json

with open( "my_expectation_file.json", "w") as my_file:
my_file.write(
json.dumps(my_df.get_expectation_suite().to_json_dict())
)

Step 5— Use this JSON file for a new batch of data

For future use, we can run the following code, it should give the same result as the code we mentioned above

my_expectation_suite = json.load(file("my_expectation_file.json"))movie_data = ge.read_csv(
"movies_metadata2.csv",
expectation_suite=my_expectation_suite)
movie_data.validate()

Conclusion

Great Expectations is a handy tool for quick and reusable data cleaning. I like the name of the package. Like we always have some expectation over our dataset, sometimes it’s based on the mathematical consideration, sometimes it’s based on common sense. Great expectation can cover most of your data constrains.

But Great Expectations always depend on your expectation about the data, you need to specify all the constrain yourself. It’s kind of the limitation here!

For more detail, you can visit the official github

--

--

Responses (4)

Write a response