Recommender System
This is a detailed reproduction of ref.
Sunny Summary #
3 steps:
preprocessing.py
preprocessing to extract: author, average sentence length, average word length, punctuation profile, sentiment scores, part-of-speech profiles/tags (only in code, not taken into the csv).TFIDF.py
content-wise k-means clustering (on TFIDF scores) to get: 3 levels/degrees of clustering/classification results.knn.py
knn search on the results of step 1 and 2 to get: recommendations (k=15 by default).
Preprocessing #
pip2 install nltk
pip2 install twython # optional ? got warning, not error.
git clone https://github.com/SunnyBingoMe/gutenberg.git
cd gutenberg/data
Download the books txt data (e.g. the 404M 3k data on Google Drive) and unzip
. Then set this folder as the txt_path
in preprocessing.py
.
WARN: do rename 's/:/-/g' *; rename 's/,/\./g' *; rename "s/\"/'/g" *;
in the txt files folder. The :
char in the filenames will cause problems for spark TFIDF path system. The ,
may introduce issues for csv files as well as "
occuring together with '
in the same files.
Download nltk data (into folder ~/nltk_data by default):
import nltk
nltk.download('punkt')
nltk.download('vader_lexicon')
nltk.download('averaged_perceptron_tagger')
Run:
python2 preprocessing.py
A single i7 CPU thread will take 4 hours for 3k books. This script will generate output_POS.txt
(csv-like format)
head -1 output_POS.txt
# gives:
# book_name|total_words|avg_sentence_size|!|#|''|%|$|&|(|+|*|-|,|/|.|;|:|=|<|?|>|@|[|]|_|^|`|{|~|neg|neu|pos|compound|ID|Title|Author|CC|CD|DT|EX|FW|IN|JJ|JJR|JJS|LS|MD|NN|NNP|NNPS|NNS|PDT|PRP|PRP$|RB|RBR|RBS|RP|VB|VBD|VBG|VBP|VBN|WDT|VBZ|WRB|WP$|WP|
TF-IDF #
(term frequency and inverse document frequency)
For py3-spark users: as the file is in py2, comment out the print
s in TFIDF_Kmeans.py.
The output is HDFS format, so we will get /path/to/out/file/tfidf_3k_output.txt/
as a folder.
rm -r /mnt/nfsMountPoint/datasets/gutenberg_data/tfidf_3k_output.txt/
# I have hard-coded the dir and path, so:
/path/to/spark/spark-submit --master spark://sparkmaster.dmml.stream:7077 TFIDF_Kmeans.py
# for original version:
/path/to/spark/spark-submit --master spark://sparkmaster.dmml.stream:7077 TFIDF_Kmeans.py /path/to/data/Gutenberg_2G_3k/txt/ /path/to/out/file/tfidf_3k_output.txt
OBS: the cluster number k
is also hard-coded.
A single node takes 4 minutes (by removing the --master
) for 3k books. A multi-node spark takes 12 minutes.
cd /path/to/out/file/tfidf_3k_output.txt/
cat part-* | sed 's/file:\/mnt\/nfsMountPoint\/datasets\/gutenberg_data\/Gutenberg_2G_3k\/txt\///' |sed 's/^(//' |sed 's/ \([0-9]\+\))$/\1/' > tfidf_3k_output_noHeader.csv
Theoretical Notes #
Ref1-theory-basics: Building a Recommendation System with Python Machine Learning & AI
simple approaches #
Popularity-based.
Correlation-based. Pearson’s R correlation of user ratings/comments (a basic form of collaborative filtering).
machine learning approaches #
User-profile classification based.
Collaborative filtering model based.
Content based.
Ref2-theory-details-and-in-action: Machine Learning & AI Foundations: Recommendations
See Also #
MovieLens dataset.
Recipe Recommendation in Spark
Yelp in Kafka, Spark, Flask