This is a detailed reproduction of ref.

Sunny Summary

3 steps:

  1. preprocessing to extract: author, average sentence length, average word length, punctuation profile, sentiment scores, part-of-speech profiles/tags (only in code, not taken into the csv).
  2. content-wise k-means clustering (on TFIDF scores) to get: 3 levels/degrees of clustering/classification results.
  3. knn search on the results of step 1 and 2 to get: recommendations (k=15 by default).


pip2 install nltk
pip2 install twython  # optional ? got warning, not error.
git clone
cd gutenberg/data

Download the books txt data (e.g. the 404M 3k data on Google Drive) and unzip. Then set this folder as the txt_path in
WARN: do rename 's/:/-/g' *; rename 's/,/\./g' *; rename "s/\"/'/g" *; in the txt files folder. The : char in the filenames will cause problems for spark TFIDF path system. The , may introduce issues for csv files as well as " occuring together with ' in the same files.

Download nltk data (into folder ~/nltk_data by default):

import nltk'punkt')'vader_lexicon')'averaged_perceptron_tagger')



A single i7 CPU thread will take 4 hours for 3k books. This script will generate output_POS.txt (csv-like format)

head -1 output_POS.txt
# gives:
# book_name|total_words|avg_sentence_size|!|#|''|%|$|&|(|+|*|-|,|/|.|;|:|=|<|?|>|@|[|]|_|^|`|{|~|neg|neu|pos|compound|ID|Title|Author|CC|CD|DT|EX|FW|IN|JJ|JJR|JJS|LS|MD|NN|NNP|NNPS|NNS|PDT|PRP|PRP$|RB|RBR|RBS|RP|VB|VBD|VBG|VBP|VBN|WDT|VBZ|WRB|WP$|WP|


(term frequency and inverse document frequency)
For py3-spark users: as the file is in py2, comment out the prints in
The output is HDFS format, so we will get /path/to/out/file/tfidf_3k_output.txt/ as a folder.

rm -r /mnt/nfsMountPoint/datasets/gutenberg_data/tfidf_3k_output.txt/
# I have hard-coded the dir and path, so:
/path/to/spark/spark-submit --master spark://
# for original version:
/path/to/spark/spark-submit --master spark:// /path/to/data/Gutenberg_2G_3k/txt/ /path/to/out/file/tfidf_3k_output.txt

OBS: the cluster number k is also hard-coded.
A single node takes 4 minutes (by removing the --master) for 3k books. A multi-node spark takes 12 minutes.

cd /path/to/out/file/tfidf_3k_output.txt/
cat part-* | sed 's/file:\/mnt\/nfsMountPoint\/datasets\/gutenberg_data\/Gutenberg_2G_3k\/txt\///' |sed 's/^(//' |sed 's/ \([0-9]\+\))$/\1/' > tfidf_3k_output_noHeader.csv

Theoretical Notes

Ref1-theory-basics: Building a Recommendation System with Python Machine Learning & AI

simple approaches

Correlation-based. Pearson's R correlation of user ratings/comments (a basic form of collaborative filtering).

machine learning approaches

User-profile classification based.
Collaborative filtering model based.
Content based.
Ref2-theory-details-and-in-action: Machine Learning & AI Foundations: Recommendations

See Also

MovieLens dataset.
Recipe Recommendation in Spark
Yelp in Kafka, Spark, Flask