Binary Thoughts: January 2012

Hadoop is a framework for distributed processing across multiple compute clusters. It provides reliable data storage using Hadoop Distributed File System(HDFS) and high performance parallel data processing using MapReduce method. You can find more information from following

http://wiki.apache.org/hadoop/

Here I am describing my own experience with hadoop 1.0.0 configuration in a ubuntu box. I am using ubuntu 11.04 for this configuration.

Step 1: Download and install oracle jdk

Install jdk 1.6 or above using following step
Add the repository to your apt-get:

hadooptest@hadooptest-VM$sudo apt-get install python-software-properties
hadooptest@hadooptest-VM$ sudo add-apt-repository ppa:sun-java-community-team/sun-java6

Update the source list

hadooptest@hadooptest-VM$ sudo apt-get update

Install sun-java6-jdk

hadooptest@hadooptest-VM$ sudo apt-get install sun-java6-jdk

Select Sun’s Java as the default on your machine.

hadooptest@hadooptest-VM$ sudo update-java-alternatives -s java-6-sun

After the installation check the java version using

hadooptest@hadooptest-VM$java -version
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)

Step 2:Download and Install Hadoop

Download i386 or amd64 version(according to your os version) of .deb package from http://ftp.jaist.ac.jp/pub/apache/hadoop/common/hadoop-1.0.0/. Install the hadoop by double clicking the file or using dpkg command.

hadooptest@hadooptest-VM$sudo dpkg -i  hadoop_1.0.0-1_i386.deb

Step 3: Set up Hadoop for single node

Setup hadoop for single node using following command

hadooptest@hadooptest-VM$sudo hadoop-setup-single-node.sh

Answer "yes" for all questions. Service will automatically started after the installation.

Step 4: Test hadoop configuration

hadooptest@hadooptest-VM$ sudo hadoop-validate-setup.sh --user=hdfs

If you get "teragen, terasort, teravalidate passed." near the end of the output, everything is ok.

Hadoop Tracker websites

JobTracker website: http://localhost:50030/
NameNode website : http://localhost:50070/
Task track website: http://localhost:50060/

Step 5: Example MapReduce job using word count

5.1 Download Plain Text UTF-8 encoding file for following books and store into a local directory (here using /home/hadooptest/gutenberg)

5.2 Download mapreduce programme jar(hadoop-examples-0.20.203.0.jar) file to any local folder (here using /home/hadooptest)
5.3. To run mapreduce programe, we need to copy these files into HDFS directory from local directory. For this purpose, first login to the hadoop user using

hadooptest@hadooptest-VM$su hdfs

Copy local file to HDFS using

hdfs@hadooptest-VM$hadoop dfs -copyFromLocal /home/hadooptest/gutenberg /user/hdfs/gutenberg

Check the content inside HDFS directory using

hdfs@hadooptest-VM$hadoop dfs -ls /user/hdfs/gutenberg

5.4. Move to folder that containing downloaded jar file.
5.5. Run the following command to execute the programme

hdfs@hadooptest-VM:/home/hadooptest$hadoop jar /user/hdfs/hadoop-examples-0.20.203.0.jar wordcount /user/hdfs/gutenberg /user/hdfs/gutenberg-out

Here /user/hdfs/gutenberg is the input directory and /user/hdfs/gutenberg-out is the output directory. Both input and output directory must be in HDFS file system.
It will take some time according to your system configuration. You can track the job progress using hadoop tracker websites

5.6. Check the result of the programme using

hdfs@hadooptest-VM:/home/hadooptest$hadoop dfs -cat /user/hduser/gutenberg-output/part-r-00000

Binary Thoughts

Friday, January 27, 2012

Hadoop 1.0.0 single node configuration on ubuntu