How I Used Pandas and Map Reduce to Sift Through 7 Gigs of Steam Reviews

Jacob Perkins
4 min readMay 26, 2022

--

I’m a Data Science grad student and this was my final project for my big data class. I was tasked with finding a large dataset, pick it apart, and return interesting results. Easy right?

Finding a dataset that was both interesting and large was a bit tricky for me. I spent a long time just aimlessly looking through Kaggle until, luckily, I stumbled upon this steam reviews dataset.

My Initial Stumbling

Anyone familiar with steam reviews will know about the reviews that use checkboxes to format the review. Look here for an example. Initially, I wanted to see if I could identify all of the reviews that contained this checkbox format. Then I would create a rolling monthly window counting these reviews over time. That task, unfortunately, proved too difficult for my four day final deadline.

My first barrier was that Pandas was able to identify the unique checkbox character (☑) and return the rows that contained these but the map reduce, which was hortonworks running on an AWS (Amazon Web Services) ubuntu server, couldn’t use pandas. In hindsight, I probably could have locally created a new dataset containing the desired reviews but my finals brain was in panic mode haha.

My second issue was that pandas, again, could handle reviews that contained commas but my inflexible map reduce was struggling to make sense of it. So, for the sake of time efficiency, I opted to remove the reviews column all together using pandas on my local machine. This removal brought my total file size from 7 gigs to about 3 gigs. Surprisingly, pandas was able strip all of the review text away and create the condensed file in about 2–3 minutes.

Since my initial curiosity didn’t pan out, I had to come up with new lines of inquiry. I ended up with four questions to address, these would all be solved with map reduce driven by AWS clusters.

The First Problem

Problem one was to split reviews up by game and find the review with the most activity (likes and laughs) and the user with the most games owned for each game.

The mapper for this problem was pretty simple, collect the reviews and only send the columns of interest in CSV format. The key for each value was the game under review.

The reducer was a bit more interesting, here’s the code for that.

For each review with the current key (game) it would check for higher activity/games owned, check for the early access flag, aggregate, then aggregate the count of reviews.

Once the key changed to a new game the reducer would output the current values to the file, change the current variables to the new key’s values, and begin the process again.

After a successful job, many output files were produced by the AWS cluster with data looking like this:

*****

Trailmakers: Total reviews: 6026, Most owned games: 2093, Highest review activity: 2449, Number of Early Access: 1297

*****

To bundle them all up nice and neat, I created a program that would do that for me.

Looking ahead, if I wanted to dig deeper into the data, I could make these files using a CSV format and use pandas to explore further.

Something Else That’s Interesting

Another problem of importance was to map reduce the data to procure the earliest and latest reviews for each game and then create a review period.

The mapper was, again, pretty simple. The only thing that it had to do was sort the reviews by game and then convert the timestamp for the review into a datetime.

Below you can see the reducer which took the string passed by the mapper and converted it into a datetime that could be checked against known earliest and latest review dates.

Like the last problem, the AWS cluster created many output files that had data looking like this:

****

Ancestors Legacy: 2018–05–22 16:57:54–2021–01–23 04:29:20

****

Once again, I used that compression program to wrap all the files into one text document. Another missed opportunity would have been to make this a CSV file and investigate which game had the longest review period. Perhaps another time!

Final Thoughts

I had never worked with AWS before this course and wasn’t even aware that clusters could be leveraged in such a powerful way. After playing around with these tools I have witnessed the massive potential laying here.

Working with map reduce was a lot of fun but was equal parts frustrating. My biggest problem, netting me many cluster failures, was library inconsistencies between my local machine and the AWS cluster. There were lengthy spells of trial and error to identify what was and wasn’t acceptable on the cluster.

If I were to tackle this problem again, I would like to see if I could use apache spark to make sense of the data with a lot less effort. Thanks for listening to my struggles and triumphs!

--

--

Jacob Perkins

World traveler, life-long learner, writer and aspiring data scientist.