概要

  • 「サクっと」のHadoop環境を構築して遊ぶ勉強が目的
    • 事前準備が出来ていれば20分くらいで完成します
    • サクっと構築してHadoopで遊ぶ事に集中しましょう
  • 構成
    • JobTracker+NameNode+TaskTracker+DataNode を仮想マシン1台に搭載する
  • 主な構成物

概要図(こんな感じのモノを作りますよ)

MyHadoopSingle.jpg

今回用意したモノ(事前準備)

vSphereHyperviser(ESXi4.1)をインストールした実サーバ × 1台

NEC Express5800/110Ge
OS VMware vSphereHyperviser (ESXi4.1)
CPU Intel Core2Quad Q9550 @2.83GHz
Memory DDR2-800 @8GB
HDD iSCSI @2TB

CentOS 5.5 x86_64をインストールした仮想マシン × 1台

仮想マシン
OS CentOS 5.5 x86_64([Server]を選択してインストールしたモノ)
CPU 2Core
Memory 1GB
HDD 30GB

構築開始

サーバ名とIPアドレスを設定する

各自の環境に依存しますが、自分は以下の通り設定

  • サーバ名:hdpstd001
# vi /etc/sysconfig/network
HOSTNAME=hdpstd001
  • IPアドレス:10.10.0.99
# vi /etc/sysconfig/network-scripts/ifcfg-eth0
IPADDR=10.10.0.99

サーバ名をhostsに登録

# vi /etc/hosts
10.10.0.99    hdpstd001

DNSで名前解決出来る様にする

# vi /etc/resolv.conf
search  local.sotm.jp
nameserver      8.8.8.8
nameserver      8.8.4.4

Java Developers Kitt (JDK)の取得

  • サイトへアクセス -> SUN Developer Network
  • 「Java Platform, Standard Edition」 -> Download JDK
  • 「Platform」 -> 「Linux x86」選択
  • 「Continue」クリック
  • 「jdk-6u22-linux-x64-rpm.bin」ダウンロード
  • 作成したサーバの/var/tmpに転送

Javaインストール

パッケージ展開

# cd /var/tmp
# sh jdk-6u22-linux-x64-rpm.bin
Unpacking...
Checksumming...
Extracting...
UnZipSFX 5.50 of 17 February 2002, by Info-ZIP (Zip-Bugs@lists.wku.edu).
inflating: jdk-6u22-linux-amd64.rpm
inflating: sun-javadb-common-10.5.3-0.2.i386.rpm
inflating: sun-javadb-core-10.5.3-0.2.i386.rpm
inflating: sun-javadb-client-10.5.3-0.2.i386.rpm
inflating: sun-javadb-demo-10.5.3-0.2.i386.rpm
inflating: sun-javadb-docs-10.5.3-0.2.i386.rpm
inflating: sun-javadb-javadoc-10.5.3-0.2.i386.rpm
準備中...                ########################################### [100%]
1:jdk                    ########################################### [100%]
Unpacking JAR files...
rt.jar...
jsse.jar...
charsets.jar...
tools.jar...
localedata.jar...
plugin.jar...
javaws.jar...
deploy.jar...
Installing JavaDB
準備中...                ########################################### [100%]
1:sun-javadb-common      ########################################### [ 17%]
2:sun-javadb-core        ########################################### [ 33%]
3:sun-javadb-client      ########################################### [ 50%]
4:sun-javadb-demo        ########################################### [ 67%]
5:sun-javadb-docs        ########################################### [ 83%]
6:sun-javadb-javadoc     ########################################### [100%]

Java(TM) SE Development Kit 6 successfully installed.

Product Registration is FREE and includes many benefits:
 * Notification of new versions, patches, and updates
 * Special offers on Sun products, services and training
 * Access to early releases and documentation

Product and system data will be collected. If your configuration
supports a browser, the Sun Product Registration form for
the JDK will be presented. If you do not register, none of
this information will be saved. You may also register your
JDK later by opening the register.html file (located in
the JDK installation directory) in a browser.

For more information on what data Registration collects and
how it is managed and used, see:
http://java.sun.com/javase/registration/JDKRegistrationPrivacy.html

Press Enter to continue.....
       Enterを押す
Done.
# rehash

インストールの確認

# java -version
java version "1.6.0_22"
Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)

Hadoopインストール

Clouderaのyumリポジトリを設定

# curl http://archive.cloudera.com/redhat/cdh/cloudera-cdh3.repo > /etc/yum.repos.d/cloudera-cdh3.repo
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dload  Upload   Total   Spent    Left  Speed
100   211  100   211    0     0    234      0 --:--:-- --:--:-- --:--:--     0

yumの更新

# yum -y update yum
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
 * addons: ftp.iij.ad.jp
 * base: ftp.iij.ad.jp
 * extras: ftp.iij.ad.jp
 * updates: ftp.iij.ad.jp
addons                                                                               |  951 B     00:00
base                                                                                 | 2.1 kB     00:00
cloudera-cdh3                                                                        |  951 B     00:00
cloudera-cdh3/primary                                                                |  19 kB     00:00
cloudera-cdh3                                                                                         68/68
extras                                                                               | 2.1 kB     00:00
updates                                                                              | 1.9 kB     00:00
Setting up Update Process
No Packages marked for Update

Hadoopパッケージをインストール

# yum -y install hadoop-0.20-conf-pseudo
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
 * addons: ftp.iij.ad.jp
 * base: ftp.iij.ad.jp
 * extras: ftp.iij.ad.jp
 * updates: ftp.iij.ad.jp
Setting up Install Process
Resolving Dependencies
 -->Running transaction check
 ---> Package hadoop-0.20-conf-pseudo.noarch 0:0.20.2+737-1 set to be updated
 --> Processing Dependency: hadoop-0.20-namenode = 0.20.2+737 for package: hadoop-0.20-conf-pseudo
 --> Processing Dependency: hadoop-0.20-secondarynamenode = 0.20.2+737 for package: hadoop-0.20-conf-pseudo
 --> Processing Dependency: hadoop-0.20-tasktracker = 0.20.2+737 for package: hadoop-0.20-conf-pseudo
 --> Processing Dependency: hadoop-0.20 = 0.20.2+737 for package: hadoop-0.20-conf-pseudo
 --> Processing Dependency: hadoop-0.20-jobtracker = 0.20.2+737 for package: hadoop-0.20-conf-pseudo
 --> Processing Dependency: hadoop-0.20-datanode = 0.20.2+737 for package: hadoop-0.20-conf-pseudo
 --> Running transaction check
 ---> Package hadoop-0.20.noarch 0:0.20.2+737-1 set to be updated
 ---> Package hadoop-0.20-datanode.noarch 0:0.20.2+737-1 set to be updated
 ---> Package hadoop-0.20-jobtracker.noarch 0:0.20.2+737-1 set to be updated
 ---> Package hadoop-0.20-namenode.noarch 0:0.20.2+737-1 set to be updated
 ---> Package hadoop-0.20-secondarynamenode.noarch 0:0.20.2+737-1 set to be updated
 ---> Package hadoop-0.20-tasktracker.noarch 0:0.20.2+737-1 set to be updated
 --> Finished Dependency Resolution

Dependencies Resolved

============================================================================================================
Package                                Arch            Version                Repository              Size
============================================================================================================
Installing:
hadoop-0.20-conf-pseudo                noarch          0.20.2+737-1           cloudera-cdh3           11 k
Installing for dependencies:
hadoop-0.20                            noarch          0.20.2+737-1           cloudera-cdh3           37 M
hadoop-0.20-datanode                   noarch          0.20.2+737-1           cloudera-cdh3          4.3 k
hadoop-0.20-jobtracker                 noarch          0.20.2+737-1           cloudera-cdh3          4.4 k
hadoop-0.20-namenode                   noarch          0.20.2+737-1           cloudera-cdh3          4.4 k
hadoop-0.20-secondarynamenode          noarch          0.20.2+737-1           cloudera-cdh3          4.4 k
hadoop-0.20-tasktracker                noarch          0.20.2+737-1           cloudera-cdh3          4.4 k

Transaction Summary
============================================================================================================
Install       7 Package(s)
Upgrade       0 Package(s)

Total download size: 37 M
Downloading Packages:
(1/7): hadoop-0.20-datanode-0.20.2+737-1.noarch.rpm                                  | 4.3 kB     00:00
(2/7): hadoop-0.20-tasktracker-0.20.2+737-1.noarch.rpm                               | 4.4 kB     00:00
(3/7): hadoop-0.20-namenode-0.20.2+737-1.noarch.rpm                                  | 4.4 kB     00:00
(4/7): hadoop-0.20-secondarynamenode-0.20.2+737-1.noarch.rpm                         | 4.4 kB     00:00
(5/7): hadoop-0.20-jobtracker-0.20.2+737-1.noarch.rpm                                | 4.4 kB     00:00
(6/7): hadoop-0.20-conf-pseudo-0.20.2+737-1.noarch.rpm                               |  11 kB     00:00
(7/7): hadoop-0.20-0.20.2+737-1.noar (77%) 77% [====================      ] 1.8 MB/s |  29 MB     00:04 ETA (7/7): hadoop-0.20-0.20.2+737-1.noar (82%) 82% [=====================-    ] 1.8 MB/s |  31 MB     00:03 ETA (7/7): hadoop-0.20-0.20.2+737-1.noar (85%) 85% [======================    ] 1.8 MB/s |  32 MB     00:02 ETA (7/7): hadoop-0.20-0.20.2+737-1.noar (87%) 87% [======================-   ] 1.8 MB/s |  33 MB     00:02 ETA (7/7): hadoop-0.20-0.20.2+737-1.noarch.rpm                                           |  37 MB     00:21
 ------------------------------------------------------------------------------------------------------------
Total                                                                       1.5 MB/s |  37 MB     00:24
Running rpm_check_debug
Running Transaction Test
Finished Transaction Test
Transaction Test Succeeded
Running Transaction
Installing     : hadoop-0.20                                                                          1/7
Installing     : hadoop-0.20-datanode                                                                 2/7
Installing     : hadoop-0.20-namenode                                                                 3/7
Installing     : hadoop-0.20-jobtracker                                                               4/7
Installing     : hadoop-0.20-secondarynamenode                                                        5/7
Installing     : hadoop-0.20-tasktracker                                                              6/7
Installing     : hadoop-0.20-conf-pseudo                                                              7/7

Installed:
hadoop-0.20-conf-pseudo.noarch 0:0.20.2+737-1

Dependency Installed:
hadoop-0.20.noarch 0:0.20.2+737-1                       hadoop-0.20-datanode.noarch 0:0.20.2+737-1
hadoop-0.20-jobtracker.noarch 0:0.20.2+737-1            hadoop-0.20-namenode.noarch 0:0.20.2+737-1
hadoop-0.20-secondarynamenode.noarch 0:0.20.2+737-1     hadoop-0.20-tasktracker.noarch 0:0.20.2+737-1

Complete!

Hadoopの設定ファイルを確認

# /usr/sbin/alternatives --display hadoop-0.20-conf
hadoop-0.20-conf -ステータスは自動です。
リンクは現在 /etc/hadoop-0.20/conf.pseudo を指しています。
/etc/hadoop-0.20/conf.empty - 優先項目 10
/etc/hadoop-0.20/conf.pseudo - 優先項目 30
現在の「最適」バージョンは /etc/hadoop-0.20/conf.pseudo です。

Hadoopを起動

# for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done
Starting Hadoop datanode daemon (hadoop-datanode): starting datanode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-datanode-hdpstd001.out [  OK  ]
Starting Hadoop jobtracker daemon (hadoop-jobtracker): starting jobtracker, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-jobtracker-hdpstd001.out [  OK  ]
Starting Hadoop namenode daemon (hadoop-namenode): starting namenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-hdpstd001.out [  OK  ]
Starting Hadoop secondarynamenode daemon (hadoop-secondarynamenode): starting secondarynamenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-hdpstd001.out [  OK  ]
Starting Hadoop tasktracker daemon (hadoop-tasktracker): starting tasktracker, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-tasktracker-hdpstd001.out [  OK  ]

# chkconfig --list | grep hadoop
hadoop-0.20-datanode    0:off   1:off   2:on    3:on    4:on    5:on    6:off
hadoop-0.20-jobtracker  0:off   1:off   2:on    3:on    4:on    5:on    6:off
hadoop-0.20-namenode    0:off   1:off   2:on    3:on    4:on    5:on    6:off
hadoop-0.20-secondarynamenode   0:off   1:off   2:on    3:on    4:on    5:on    6:off
hadoop-0.20-tasktracker 0:off   1:off   2:on    3:on    4:on    5:on    6:off

Hadoopの動作確認

HDFSの動作確認

# hadoop fs -mkdir /foo
# hadoop fs -ls /
Found 2 items
drwxr-xr-x   - root   supergroup          0 2010-11-14 16:39 /foo
drwxr-xr-x   - mapred supergroup          0 2010-11-14 16:32 /var
# hadoop fs -rmr /foo
Deleted hdfs://localhost/foo
# hadoop fs -ls /
Found 1 items
drwxr-xr-x   - mapred supergroup          0 2010-11-14 16:32 /var

MapReduceの動作確認

# hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar pi 2 100000
Number of Maps  = 2
Samples per Map = 100000
Wrote input for Map #0
Wrote input for Map #1
Starting Job
10/11/14 16:40:24 INFO mapred.FileInputFormat: Total input paths to process : 2
10/11/14 16:40:25 INFO mapred.JobClient: Running job: job_201011031631_0001
10/11/14 16:40:26 INFO mapred.JobClient:  map 0% reduce 0%
10/11/14 16:40:45 INFO mapred.JobClient:  map 50% reduce 0%
10/11/14 16:40:49 INFO mapred.JobClient:  map 100% reduce 0%
10/11/14 16:41:08 INFO mapred.JobClient:  map 100% reduce 100%
10/11/14 16:41:11 INFO mapred.JobClient: Job complete: job_201011031631_0001
10/11/14 16:41:11 INFO mapred.JobClient: Counters: 23
10/11/14 16:41:11 INFO mapred.JobClient:   Job Counters
10/11/14 16:41:11 INFO mapred.JobClient:     Launched reduce tasks=1
10/11/14 16:41:11 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=35030
10/11/14 16:41:11 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
10/11/14 16:41:11 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
10/11/14 16:41:11 INFO mapred.JobClient:     Launched map tasks=2
10/11/14 16:41:11 INFO mapred.JobClient:     Data-local map tasks=2
10/11/14 16:41:11 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=20012
10/11/14 16:41:11 INFO mapred.JobClient:   FileSystemCounters
10/11/14 16:41:11 INFO mapred.JobClient:     FILE_BYTES_READ=50
10/11/14 16:41:11 INFO mapred.JobClient:     HDFS_BYTES_READ=468
10/11/14 16:41:11 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=170
10/11/14 16:41:11 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=215
10/11/14 16:41:11 INFO mapred.JobClient:   Map-Reduce Framework
10/11/14 16:41:11 INFO mapred.JobClient:     Reduce input groups=2
10/11/14 16:41:11 INFO mapred.JobClient:     Combine output records=0
10/11/14 16:41:11 INFO mapred.JobClient:     Map input records=2
10/11/14 16:41:11 INFO mapred.JobClient:     Reduce shuffle bytes=56
10/11/14 16:41:11 INFO mapred.JobClient:     Reduce output records=0
10/11/14 16:41:11 INFO mapred.JobClient:     Spilled Records=8
10/11/14 16:41:11 INFO mapred.JobClient:     Map output bytes=36
10/11/14 16:41:11 INFO mapred.JobClient:     Map input bytes=48
10/11/14 16:41:11 INFO mapred.JobClient:     Combine input records=0
10/11/14 16:41:11 INFO mapred.JobClient:     Map output records=4
10/11/14 16:41:11 INFO mapred.JobClient:     SPLIT_RAW_BYTES=232
10/11/14 16:41:11 INFO mapred.JobClient:     Reduce input records=4
Job Finished in 46.664 seconds
Estimated value of Pi is 3.14118000000000000000

ブラウザ(Web UI)で様子を見てみる

HDFSの様子を見てみる

  • http://hdpstd001:50070/

namenode_single.png

MapReduceの様子を見てみる

  • http://hdpstd001:50030/

jobtracker_single.png

参考にしたモノ


Comments