Movie Recommendation Data Cleaning With Great Expectations
I am using great_expectations for cleaning the dataset for movie recommendation. The goal is to delete “under-qualified” data in the dataset. There are some problems in the current data
Why the data is “dirty”
Let’s take a look at the dataset, there are 10 columns in the data. They are
- Adult
- Genres
- Id
- Original_language
- Original_title
- Overview
- Popularity
- Production companies
- Vote average
- Vote Count

In this article, we will consider filtering the data based on
3. Id — Ensure the ID is unique for further training
4.Movie language — Move movies without movie language
10.Vote Count — Move movies with vote count less than 5
Quick Start with Great Expectation
The installation of great_expections is straight forward as all python package
pip install great_expectations
And in your python file
import great_expectations as ge
You need to prepare a csv file as your data
movie_data = ge.read_csv("movies_metadata2.csv")
Step 1 — Get Unique Movie by Unique ID
We notice that the movie ID should be unique, otherwise it doesn’t make sense, so we run the following code to check
movie_data.expect_compound_columns_to_be_unique(['id'])
Unfortunately, we got 59 repeated movie ID. We’ll also move it by calling the above function.
{
"exception_info": null,
"meta": {},
"success": false,
"result": {
"element_count": 45466,
"missing_count": 0,
"missing_percent": 0.0,
"unexpected_count": 59,
"unexpected_percent": 0.12976729864074252,
"unexpected_percent_nonmissing": 0.12976729864074252,
"partial_unexpected_list": [
{
"id": "105045"
},
{
"id": "132641"
},
{
"id": "22649"
},
{
"id": "105045"
},
{
"id": "84198"
},
{
"id": "10991"
},
{
"id": "110428"
},
{
"id": "15028"
},
{
"id": "12600"
},
{
"id": "109962"
},
{
"id": "4912"
},
{
"id": "5511"
},
{
"id": "23305"
},
{
"id": "5511"
},
{
"id": "23305"
},
{
"id": "69234"
},
{
"id": "14788"
},
{
"id": "77221"
},
{
"id": "13209"
},
{
"id": "14788"
}
]
}
}
Step 2— Delete movies with vote_count under 5
A rating score is important for a movie. Imagine there is a movie, only 3 people rated it as 10.0, the same as the one rated by 100000 people. Can we say these two movies are equally good?
The answer is “not sure”. Because we have less confidence in movies that are rated by few people. It’s better to consider movies with more confident ratings in model training.
So, we decide to delete movies which are rated by less than 5 people
movie_data.expect_column_values_to_be_between('vote_count', min_value=5)
It turns out around 32% movie is under-qualified in this case
{
"exception_info": null,
"meta": {},
"success": false,
"result": {
"element_count": 45466,
"missing_count": 6,
"missing_percent": 0.013196674438041614,
"unexpected_count": 14562,
"unexpected_percent": 32.028328861127,
"unexpected_percent_nonmissing": 32.03255609326881,
"partial_unexpected_list": [
4.0,
2.0,
3.0,
1.0,
2.0,
3.0,
2.0,
0.0,
3.0,
4.0,
2.0,
0.0,
2.0,
4.0,
3.0,
0.0,
2.0,
4.0,
1.0,
0.0
]
}
}
Step 3 — Delete movie without language label
It’s kind of weird if in the dataset the movie language is not defined. How can I know if the user can understand the movie or not? Imagine you are a big fan of action movies but you only speak English. What do you feel when the system recommends a French action movie?
Let’s move movies without language label to ensure the recommendation won’t go too wrong
movie_data.expect_column_values_to_not_be_null('original_language')
And we found 2% of movies are not qualified
{
"exception_info": null,
"meta": {},
"success": false,
"result": {
"element_count": 45466,
"unexpected_count": 11,
"unexpected_percent": 0.024193903136409626,
"partial_unexpected_list": []
}
}
Step 4 — Save the expectation config for future use
What if we need the same pipeline for future data filtering?
We can save the config to a JSON by running the following code!
>>> movie_data.get_expectation_suite(){
"expectations": [
{
"expectation_type": "expect_column_values_to_not_be_null",
"kwargs": {
"column": "original_title"
},
"meta": {}
}
],
"meta": {
"great_expectations_version": "0.12.6"
},
"expectation_suite_name": "default",
"data_asset_type": "Dataset"
}
Let’s save the JSON to file
import json
with open( "my_expectation_file.json", "w") as my_file:
my_file.write(
json.dumps(my_df.get_expectation_suite().to_json_dict())
)

Step 5— Use this JSON file for a new batch of data
For future use, we can run the following code, it should give the same result as the code we mentioned above
my_expectation_suite = json.load(file("my_expectation_file.json"))movie_data = ge.read_csv(
"movies_metadata2.csv",
expectation_suite=my_expectation_suite)movie_data.validate()
Conclusion

Great Expectations is a handy tool for quick and reusable data cleaning. I like the name of the package. Like we always have some expectation over our dataset, sometimes it’s based on the mathematical consideration, sometimes it’s based on common sense. Great expectation can cover most of your data constrains.
But Great Expectations always depend on your expectation about the data, you need to specify all the constrain yourself. It’s kind of the limitation here!
For more detail, you can visit the official github