Mining-Yelp-Dataset-with-Spark

Big Data Mining using Apache Spark, data source: https://www.yelp.com/dataset

Contributors

yuying_avatar yang_avatar
Yuying Yang

This Repository

File Description
data_exploration.ipynb exploratory data analysis on the yelp dataset
frequent_itemset_mining.ipynb mining frequent itemsets using SON, A-Priori algorithm
similar_businesses.py detecting similar businesses using MinHash and LSH algorithm
hybrid_recommender_system.py combination of different types of recommendation techniques

Table of Contents

Data Exploration

We performed an Exploratory Data Analysis on the dataset, and here are some Interesting Findings:

Frequent Itemset Mining

High level design

son-algorithm

Conclusion

Not surprisingly, we found that the restaurants are geographically close to each other or they serve similar food (maybe have similar business names) in almost all frequent sets. (e.g. Ramen Sora, Sushi House Goyemon, Monta Ramen)

Similar Businesses

High level design

similar-items

First we use MinHash to generate signature of each business, then apply LSH to find all candidate pairs, and finally do a full pass to eliminate all false positives.

Results

We spent quite some time on designing hash functions, and suprisingly, we achieve precision=1.0 and recall=1.0 .

Hybrid Recommender System

High level design

hybrid-recommender

Each candidate is rated by 3 recommenders: content-based filtering, model-based collaborative filtering, and user-based CF. A linear combination of the 3 scores computed, which becomes the item’s predicted rating.

Results:

The ratings range from 1 to 5, and the error distribution on testing data looks like:

Error distribution on testing data

About 98% prediction error are less than 1.0, and the overall RMSE is 0.9782, which is much better than any individual recommender system.

Dependencies