Recommender System
2018-01-20.
Category & Tags:
Recommender System,
Recommendation System,
Literature,
Book,
Text,
News
This is a detailed reproduction of ref.
Sunny Summary #
3 steps:
preprocessing.py
preprocessing to extract: author, average sentence length, average word length, punctuation profile, sentiment scores, part-of-speech profiles/tags (only in code, not taken into the csv).TFIDF.py
content-wise k-means clustering (on TFIDF scores) to get: 3 levels/degrees of clustering/classification results.knn.py
knn search on the results of step 1 and 2 to get: recommendations (k=15 by default).
Preprocessing #
pip2 install nltk
pip2 install twython # optional ? got warning, not error.
git clone https://github.com/SunnyBingoMe/gutenberg.git
cd gutenberg/data
Download the books txt data (e.g. the 404M 3k data on Google Drive) and unzip
. Then set this folder as the txt_path
in preprocessing.py
.
WARN: do rename 's/:/-/g' *; rename 's/,/\./g' *; rename "s/\"/'/g" *;
in the txt files folder. The :
char in the filenames will cause problems for spark TFIDF path system. The ,
may introduce issues for csv files as well as "
occuring together with '
in the same files.