News

Recommender System

2018-01-20. Category & Tags: Recommender System, Recommendation System, Literature, Book, Text, News

This is a detailed reproduction of ref.

Sunny Summary #

3 steps:

  1. preprocessing.py preprocessing to extract: author, average sentence length, average word length, punctuation profile, sentiment scores, part-of-speech profiles/tags (only in code, not taken into the csv).
  2. TFIDF.py content-wise k-means clustering (on TFIDF scores) to get: 3 levels/degrees of clustering/classification results.
  3. knn.py knn search on the results of step 1 and 2 to get: recommendations (k=15 by default).

Preprocessing #

pip2 install nltk
pip2 install twython  # optional ? got warning, not error.
git clone https://github.com/SunnyBingoMe/gutenberg.git
cd gutenberg/data

Download the books txt data (e.g. the 404M 3k data on Google Drive) and unzip. Then set this folder as the txt_path in preprocessing.py. WARN: do rename 's/:/-/g' *; rename 's/,/\./g' *; rename "s/\"/'/g" *; in the txt files folder. The : char in the filenames will cause problems for spark TFIDF path system. The , may introduce issues for csv files as well as " occuring together with ' in the same files.

...