Recommender System

Recommender System

2018-01-20. Category & Tags: Recommender System, Recommendation System, Literature, Book, Text, News

This is a detailed reproduction of ref.

Sunny Summary #

3 steps:

preprocessing.py preprocessing to extract: author, average sentence length, average word length, punctuation profile, sentiment scores, part-of-speech profiles/tags (only in code, not taken into the csv).
TFIDF.py content-wise k-means clustering (on TFIDF scores) to get: 3 levels/degrees of clustering/classification results.
knn.py knn search on the results of step 1 and 2 to get: recommendations (k=15 by default).

Preprocessing #

pip2 install nltk
pip2 install twython  # optional ? got warning, not error.
git clone https://github.com/SunnyBingoMe/gutenberg.git
cd gutenberg/data

Download the books txt data (e.g. the 404M 3k data on Google Drive) and unzip. Then set this folder as the txt_path in preprocessing.py. WARN: do rename 's/:/-/g' *; rename 's/,/\./g' *; rename "s/\"/'/g" *; in the txt files folder. The : char in the filenames will cause problems for spark TFIDF path system. The , may introduce issues for csv files as well as " occuring together with ' in the same files.

Download nltk data (into folder ~/nltk_data by default):

import nltk
nltk.download('punkt')
nltk.download('vader_lexicon')
nltk.download('averaged_perceptron_tagger')

Run:

python2 preprocessing.py

A single i7 CPU thread will take 4 hours for 3k books. This script will generate output_POS.txt (csv-like format)

head -1 output_POS.txt
# gives:
# book_name|total_words|avg_sentence_size|!|#|''|%|$|&|(|+|*|-|,|/|.|;|:|=|<|?|>|@|[|]|_|^|`|{|~|neg|neu|pos|compound|ID|Title|Author|CC|CD|DT|EX|FW|IN|JJ|JJR|JJS|LS|MD|NN|NNP|NNPS|NNS|PDT|PRP|PRP$|RB|RBR|RBS|RP|VB|VBD|VBG|VBP|VBN|WDT|VBZ|WRB|WP$|WP|

TF-IDF #

(term frequency and inverse document frequency)
For py3-spark users: as the file is in py2, comment out the prints in TFIDF_Kmeans.py.
The output is HDFS format, so we will get /path/to/out/file/tfidf_3k_output.txt/ as a folder.

rm -r /mnt/nfsMountPoint/datasets/gutenberg_data/tfidf_3k_output.txt/
# I have hard-coded the dir and path, so:
/path/to/spark/spark-submit --master spark://sparkmaster.dmml.stream:7077 TFIDF_Kmeans.py
# for original version:
/path/to/spark/spark-submit --master spark://sparkmaster.dmml.stream:7077 TFIDF_Kmeans.py /path/to/data/Gutenberg_2G_3k/txt/ /path/to/out/file/tfidf_3k_output.txt

OBS: the cluster number k is also hard-coded.
A single node takes 4 minutes (by removing the --master) for 3k books. A multi-node spark takes 12 minutes.

cd /path/to/out/file/tfidf_3k_output.txt/
cat part-* | sed 's/file:\/mnt\/nfsMountPoint\/datasets\/gutenberg_data\/Gutenberg_2G_3k\/txt\///' |sed 's/^(//' |sed 's/ \([0-9]\+\))$/\1/' > tfidf_3k_output_noHeader.csv

Theoretical Notes #

Ref1-theory-basics: Building a Recommendation System with Python Machine Learning & AI

simple approaches #

Popularity-based.
Correlation-based. Pearson’s R correlation of user ratings/comments (a basic form of collaborative filtering).

machine learning approaches #

User-profile classification based.
Collaborative filtering model based.
Content based.
Ref2-theory-details-and-in-action: Machine Learning & AI Foundations: Recommendations