For a beginner like me, the following steps were useful for running user recommendation algorithm of Mahout.
1- I downloaded the file from http://grouplens.org/datasets/movielens/, the directory contains many files the one which I thought will be useful for me was ratings.data of size 568.3 MB, contianing the fields userId, movieId, rating, timestamp. Mahout's recommenders expect interactions between users and items as input. Every line of the file has the format userID,itemID,value. Here userID and itemID refer to a particular user and a particular item, and value denotes the strength of the interaction (e.g. the rating given to a movie).
(Perform the following steps after login as hduser (user for hadoop cluster))
2- I removed the last field of time stamp from the file (as it was not required for the current recommendation) using the following command and saved it in .csv file
cut --complement -f 4 -d, ratings.data >ratings.csv (To remove 4th column- timestamp from the file)
3- Now I created directory in hadoop file system to store the ratings file using the following command
hadoop fs -mkdir /mahout_data/
4-Now I copied the downloaded file of movie recommendation to hdfs using the following command
hadoop fs -put /home/hduser/mydata/ml-latest/ratings.csv /mahout_data/
5- go to the mahout directory cd /usr/local/mahout/bin/ and issue the following command to run :( The output file should be unique and JAVA_HOME should be properly set)
./mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -i hdfs://localhost:9000/mahout_data/ratings.csv -o hdfs://localhost:9000/ratings_test/ --numRecommendations 25
-i hdfs://localhost:9000/mahout_data/ratings.csv - Denotes the input file
-o hdfs://localhost:9000/ratings_test/ -denotes the output file .
recommenditembased - Means we are creating itembased recommendation not user based recommendation, there is a difference between itembased and user based recommendation, a user based recommendation finds similar users , and see what they like, item based recommendation see what the user likes and find similar items, Mahout's item-based recommendation algorithm takes as input customer preferences by item and generates an output recommending similar items with a score indicating whether a customer will "like" the recommended item.
Choosing a similarity measure for use in a production environment is something that requires careful testing, evaluation and research. For our example purposes, here I used Mahout similarity classname called SIMILARITY_LOGLIKELIHOOD.
6- It will run for a couple of minutes you can see your output from web interface as well
7- You can check the output file it will contain two columns: the userID and an array of itemIDs and scores.
References :
http://mahout.apache.org/users/recommender/intro-itembased-hadoop.html
http://grouplens.org/datasets/movielens/
http://info.mapr.com/rs/mapr/images/PracticalMachineLearning.pdf
1- I downloaded the file from http://grouplens.org/datasets/movielens/, the directory contains many files the one which I thought will be useful for me was ratings.data of size 568.3 MB, contianing the fields userId, movieId, rating, timestamp. Mahout's recommenders expect interactions between users and items as input. Every line of the file has the format userID,itemID,value. Here userID and itemID refer to a particular user and a particular item, and value denotes the strength of the interaction (e.g. the rating given to a movie).
(Perform the following steps after login as hduser (user for hadoop cluster))
2- I removed the last field of time stamp from the file (as it was not required for the current recommendation) using the following command and saved it in .csv file
cut --complement -f 4 -d, ratings.data >ratings.csv (To remove 4th column- timestamp from the file)
3- Now I created directory in hadoop file system to store the ratings file using the following command
hadoop fs -mkdir /mahout_data/
4-Now I copied the downloaded file of movie recommendation to hdfs using the following command
hadoop fs -put /home/hduser/mydata/ml-latest/ratings.csv /mahout_data/
5- go to the mahout directory cd /usr/local/mahout/bin/ and issue the following command to run :( The output file should be unique and JAVA_HOME should be properly set)
./mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -i hdfs://localhost:9000/mahout_data/ratings.csv -o hdfs://localhost:9000/ratings_test/ --numRecommendations 25
-i hdfs://localhost:9000/mahout_data/ratings.csv - Denotes the input file
-o hdfs://localhost:9000/ratings_test/ -denotes the output file .
recommenditembased - Means we are creating itembased recommendation not user based recommendation, there is a difference between itembased and user based recommendation, a user based recommendation finds similar users , and see what they like, item based recommendation see what the user likes and find similar items, Mahout's item-based recommendation algorithm takes as input customer preferences by item and generates an output recommending similar items with a score indicating whether a customer will "like" the recommended item.
Choosing a similarity measure for use in a production environment is something that requires careful testing, evaluation and research. For our example purposes, here I used Mahout similarity classname called SIMILARITY_LOGLIKELIHOOD.
6- It will run for a couple of minutes you can see your output from web interface as well
7- You can check the output file it will contain two columns: the userID and an array of itemIDs and scores.
References :
http://mahout.apache.org/users/recommender/intro-itembased-hadoop.html
http://grouplens.org/datasets/movielens/
http://info.mapr.com/rs/mapr/images/PracticalMachineLearning.pdf