Friday, January 27, 2012

Hadoop 1.0.0 single node configuration on ubuntu

Hadoop is a framework for distributed processing across multiple compute clusters. It provides reliable data storage using Hadoop Distributed File System(HDFS) and high performance parallel data processing using MapReduce method. You can find more information from following
http://wiki.apache.org/hadoop/
Here I am describing my own experience with hadoop 1.0.0 configuration in a ubuntu box. I am using ubuntu 11.04 for this configuration.

Step 1: Download and install oracle jdk

Install jdk 1.6 or above using following step
Add the repository to your apt-get:
hadooptest@hadooptest-VM$sudo apt-get install python-software-properties
hadooptest@hadooptest-VM$ sudo add-apt-repository ppa:sun-java-community-team/sun-java6

Update the source list
hadooptest@hadooptest-VM$ sudo apt-get update

Install sun-java6-jdk
hadooptest@hadooptest-VM$ sudo apt-get install sun-java6-jdk

Select Sun’s Java as the default on your machine.
hadooptest@hadooptest-VM$ sudo update-java-alternatives -s java-6-sun

After the installation check the java version using
hadooptest@hadooptest-VM$java -version
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)
Step 2:Download and Install Hadoop

Download i386 or amd64 version(according to your os version) of .deb package from http://ftp.jaist.ac.jp/pub/apache/hadoop/common/hadoop-1.0.0/. Install the hadoop by double clicking the file or using dpkg command.

hadooptest@hadooptest-VM$sudo dpkg -i  hadoop_1.0.0-1_i386.deb
Step 3: Set up Hadoop for single node

Setup hadoop for single node using following command
hadooptest@hadooptest-VM$sudo hadoop-setup-single-node.sh
Answer "yes" for all questions. Service will automatically started after the installation.

Step 4: Test hadoop configuration
hadooptest@hadooptest-VM$ sudo hadoop-validate-setup.sh --user=hdfs
If you get "teragen, terasort, teravalidate passed." near the end of the output, everything is ok.

Hadoop Tracker websites

JobTracker website: http://localhost:50030/
NameNode website : http://localhost:50070/
Task track website: http://localhost:50060/

Step 5: Example MapReduce job using word count

5.1 Download Plain Text UTF-8 encoding file for following books and store into a local directory (here using /home/hadooptest/gutenberg)
5.2 Download mapreduce programme jar(hadoop-examples-0.20.203.0.jar) file to any local folder (here using /home/hadooptest)
5.3. To run mapreduce programe, we need to copy these files into HDFS directory from local directory. For this purpose, first login to the hadoop user using
hadooptest@hadooptest-VM$su hdfs
Copy local file to HDFS using
hdfs@hadooptest-VM$hadoop dfs -copyFromLocal /home/hadooptest/gutenberg /user/hdfs/gutenberg
Check the content inside HDFS directory using
hdfs@hadooptest-VM$hadoop dfs -ls /user/hdfs/gutenberg
5.4. Move to folder that containing downloaded jar file.
5.5. Run the following command to execute the programme
hdfs@hadooptest-VM:/home/hadooptest$hadoop jar /user/hdfs/hadoop-examples-0.20.203.0.jar wordcount /user/hdfs/gutenberg /user/hdfs/gutenberg-out
Here /user/hdfs/gutenberg is the input directory and /user/hdfs/gutenberg-out is the output directory. Both input and output directory must be in HDFS file system.
It will take some time according to your system configuration. You can track the job progress using hadoop tracker websites

5.6. Check the result of the programme using
hdfs@hadooptest-VM:/home/hadooptest$hadoop dfs -cat /user/hduser/gutenberg-output/part-r-00000

9 comments:

  1. If the JobTracker doesn't start for this particular version of Hadoop and you see permission issues. Try this:
    sudo -u hdfs hadoop fs -mkdir /mapred
    sudo -u hdfs hadoop fs -chown mapred /mapred

    Taken from http://blog.cuongnv.com/2011/12/how-to-setup-hadoop-100-on-rhelcentos.html

    ReplyDelete
  2. I've tried these instructions on two different Ubuntu configurations. You need to do the JobTracker fix specified by the post on Feb 3, 2012 07:22 AM. You then need to run hadoop job-tracker restart

    However, I cannot get the map reduce step to work (tried this now on two different computers - one running Ubuntu in Virtual Box, another just running Ubuntu).

    When I run hadoop-validate-setup.sh it hangs with output:

    map 0% reduce 0%

    If I go to localhost:50030 I can see the job. But it just won't run. I get the exact same result trying the grep example and PI calculation.

    What do I need to do to get this to work? It seems like it may be a permissions issue but I don't know what it could be.

    ReplyDelete
  3. I keep getting hdfs,mapred user not found error. should i create these users in the system ?

    Proceed with setup? (y/n) y
    chown: invalid user: `mapred:hadoop'
    chown: invalid user: `hdfs:hadoop'
    chown: invalid user: `mapred:hadoop'
    chown: invalid user: `hdfs:hadoop'
    chown: invalid group: `root:hadoop'
    chown: invalid user: `hdfs:hadoop'
    chown: invalid user: `mapred:hadoop'
    chown: invalid group: `root:hadoop'
    chown: invalid group: `root:hadoop'
    chown: invalid group: `root:hadoop'
    Configuration setup is completed.
    chown: invalid user: `hdfs:hadoop'
    * Formatting Apache Hadoop Name Node hadoop-namenode sudo: unknown user: hdfs
    * Starting Apache Hadoop Name Node server hadoop-namenode start-stop-daemon: user 'hdfs' not found
    [fail]
    * Starting Apache Hadoop Data Node server hadoop-datanode start-stop-daemon: user 'hdfs' not found
    [fail]
    Unknown id: hdfs
    Unknown id: hdfs
    Unknown id: hdfs
    Unknown id: hdfs
    * Starting Apache Hadoop Job Tracker server hadoop-jobtracker start-stop-daemon: user 'mapred' not found
    [fail]
    * Starting Apache Hadoop Task Tracker server hadoop-tasktracker start-stop-daemon: user 'mapred' not found

    ReplyDelete
    Replies
    1. Run 'hadoop-setup-single-node.sh' as sudo or root user. There is no need to create users explicitly, hadoop automatically create it while installing

      Delete
    2. have the same problem. running script from sudo.

      Delete
    3. I think your problem is with the installation. Remove current installation completely and install the package as sudo or root user.

      Delete
  4. Hadoop is not my idea. This is not place to advertise your training.

    ReplyDelete
  5. I am always looking for some free kinds of stuff over the internet. There are also some companies which give free samples. But after visiting your blog, I do not visit too many blogs. Thanks. BinaryToday

    ReplyDelete