Friday 31 July 2015

Run a User Recommendation on Mahout

For a beginner like me, the following steps were useful for running user recommendation algorithm of Mahout.

1- I downloaded the file from http://grouplens.org/datasets/movielens/,  the directory contains many files the one which I thought will be useful for me was ratings.data of size 568.3 MB, contianing the fields userId, movieId, rating, timestamp. Mahout's recommenders expect interactions between users and items as input. Every line of the file has the format userID,itemID,value. Here userID and itemID refer to a particular user and a particular item, and value denotes the strength of the interaction (e.g. the rating given to a movie).

(Perform the following steps after login as hduser (user for hadoop cluster))

2- I removed the last field of time stamp from the file (as it was not required for the current recommendation) using the following command and saved it in .csv file
cut --complement -f 4 -d, ratings.data >ratings.csv (To remove 4th column- timestamp from the file)

3- Now I created directory in hadoop file system to store the ratings file using the following command
hadoop fs -mkdir /mahout_data/

4-Now I copied the downloaded file of movie recommendation to hdfs using the following command

hadoop fs -put /home/hduser/mydata/ml-latest/ratings.csv /mahout_data/

5- go to the mahout directory cd /usr/local/mahout/bin/ and issue the following command to run :( The output file should be unique and JAVA_HOME should be properly set)

./mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -i hdfs://localhost:9000/mahout_data/ratings.csv -o hdfs://localhost:9000/ratings_test/ --numRecommendations 25

-i hdfs://localhost:9000/mahout_data/ratings.csv - Denotes the input file

-o hdfs://localhost:9000/ratings_test/  -denotes the output file .

recommenditembased - Means we are creating itembased recommendation not user based recommendation, there is a difference between itembased and user based recommendation, a user based recommendation finds similar users , and see what they like, item based recommendation  see what the user likes and find similar items, Mahout's item-based recommendation algorithm takes as input customer preferences by item and generates an output recommending similar items with a score indicating whether a customer will "like" the recommended item.

Choosing a similarity measure for use in a production environment is something that requires careful testing, evaluation and research. For our example purposes, here I used  Mahout similarity classname called SIMILARITY_LOGLIKELIHOOD.

6- It will run for a couple of minutes you can see your output from web interface as well





7- You can check the output file it will contain two columns: the userID and an array of itemIDs and scores.


References :

http://mahout.apache.org/users/recommender/intro-itembased-hadoop.html
http://grouplens.org/datasets/movielens/
http://info.mapr.com/rs/mapr/images/PracticalMachineLearning.pdf

Mahout Installation

I wanted to try Data mining on big data, so I tried installing Mahout for that, here are the steps which I followed for successful installation of Mahout.

Prerequisite to install Mahout is - JDK, Maven and Hadoop cluster.


  • sudo apt-get install maven
  • Download the latest distribution of mahout from the site http://www.apache.org/dyn/closer.cgi/lucene/mahout/
  • unzip and copy this to the desired location
          cp -R /home/surabhi/Documents/Documents/egap/Mahout/mahout/ /usr/local/
  • issue ls to check the packages inside it











  • cd /usr/local/mahout/distribution
  • sudo mvn install (Install maven 3.0.1 or above for Mahout .20 distribution else it will throw some error)
  • Your installation is complete if you see the following screen














Tuesday 7 July 2015

Resolve the installation problem of "rmongodb"

Being a new user of R as well as MongoDB, I wanted to make a mongoDB database connection with R but had to struggle before I could successfully establish the connection.
The version of R what I was using was 3.0.1, being old version whenever I was trying to install install.packages("rmongodb"), I was getting some error. Ultimately I had to upgrade the version of R using the following step and then I was able to install rmongodb package.
  • sudo gedit /etc/apt/sources.list
  • add the following line as I am using the version 14.04.2 (use the command lsb_release -c to see the name)
    • deb http://cran.cnr.berkeley.edu/bin/linux/ubuntu trusty/
  •  gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
  • sudo apt-get update
  • sudo apt-get upgrade
Now new version is installed, within R I isuued the command sessionInfo(), which shows the R version 3.2.1 (2015-06-18).

Now use the following command to install mongodb package and connect to MongoDB
  • install.packages("rmongodb")
  • library(rmongodb)
  • To connect to local mongoD
    • mongo <- mongo.create()

Will keep on posting as I proceed towards using R, Shiny and MongoDB.