Mining-Yelp-Dataset-with-Spark

Big Data Mining using Apache Spark, data source: https://www.yelp.com/dataset

Contributors


Yuying	Yang

This Repository

File	Description
`data_exploration.ipynb`	exploratory data analysis on the yelp dataset
`frequent_itemset_mining.ipynb`	mining frequent itemsets using SON, A-Priori algorithm
`similar_businesses.py`	detecting similar businesses using MinHash and LSH algorithm
`hybrid_recommender_system.py`	combination of different types of recommendation techniques

Data Exploration
Frequent Itemset Mining
Similar Businesses
Hybrid Recommender System

Data Exploration

We performed an Exploratory Data Analysis on the dataset, and here are some Interesting Findings:

Frequent Itemset Mining

High level design

son-algorithm

Conclusion

Not surprisingly, we found that the restaurants are geographically close to each other or they serve similar food (maybe have similar business names) in almost all frequent sets. (e.g. Ramen Sora, Sushi House Goyemon, Monta Ramen)

Similar Businesses

High level design

similar-items

First we use MinHash to generate signature of each business, then apply LSH to find all candidate pairs, and finally do a full pass to eliminate all false positives.

Results

We spent quite some time on designing hash functions, and suprisingly, we achieve precision=1.0 and recall=1.0 .

Hybrid Recommender System

High level design

hybrid-recommender

Each candidate is rated by 3 recommenders: content-based filtering, model-based collaborative filtering, and user-based CF. A linear combination of the 3 scores computed, which becomes the item’s predicted rating.

Results:

The ratings range from 1 to 5, and the error distribution on testing data looks like:

Error distribution on testing data

About 98% prediction error are less than 1.0, and the overall RMSE is 0.9782, which is much better than any individual recommender system.

Dependencies

Spark 2.4
Python 3.6
JDK 1.8

Mining-Yelp-Dataset-with-Spark

Big Data Mining using Apache Spark

Mining-Yelp-Dataset-with-Spark

Contributors

This Repository

Table of Contents

Data Exploration

Frequent Itemset Mining

High level design

Conclusion

Similar Businesses

High level design

Results

Hybrid Recommender System

High level design

Results:

Dependencies