Hadoop 初始化

本文最后更新于:2024年3月18日 凌晨

Hadoop 初始化

绪论

  • 使用Docker搭建Hadoop技术平台,包括安装Docker, Java, Scala, Hadoop, Hbase, Spark
  • 集群共有5台机器,主机名分别为 hadoop01, hadoop02, hadoop03, hadoop04, hadoop05,其中 hadoop01 为 master,其他的为 slave
  • JDK 1.8
  • Scala 2.11.6
  • Hadoop 3.2.1
  • Hbase 2.1.3
  • Spark 2.4.0

准备工作

配置Docker

  • 现在的 Docker 网络能够提供 DNS 解析功能,我们可以使用如下命令为接下来的 Hadoop 集群单独构建一个虚拟的网络。
1
$ docker network create --driver=bridge hadoop
  • 以上命令创建了一个名为 Hadoop 的虚拟桥接网络,该虚拟网络内部提供了自动的DNS解析服务。
  • 使用下面这个命令查看 Docker 中的网络,可以看到刚刚创建的名为 hadoop 的虚拟桥接网络。
1
2
3
4
5
6
7
$ docker network ls

NETWORK ID NAME DRIVER SCOPE
06548c9440f8 bridge bridge local
b21dba8dc351 hadoop bridge local
eb48a64969d1 host host local
3e8c9d771ec8 none null local
  • 根据镜像启动ubuntu 16.04
1
$ docker run -it --name temp-container ubuntu:16.04 /bin/bash

安装 Scala 与 Java

  • 直接输入命令安装 jdk 1.8
1
$ apt install openjdk-8-jdk
  • 测试一下安装结果。
1
2
3
4
5
$ java -version

openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.16.04.1-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
  • 再输入命令安装 Scala
1
$ apt install scala
  • 测试一下安装结果。
1
2
3
4
5
6
7
$ scala

Welcome to Scala version 2.11.6 (OpenJDK 64-Bit Server VM, Java 1.8.0_191).
Type in expressions to have them evaluated.
Type :help for more information.

scala>

安装 Vim 与网络工具包

  • 安装 vim,用来编辑文件。
1
$ apt install vim
  • 安装 net-tools
1
$ apt install net-tools

安装 SSH

  • 安装 SSH,并配置免密登录,由于后面的容器之间是由一个镜像启动的,就像同一个磨具出来的 5 把锁与钥匙,可以互相开锁,所以在当前容器里配置 SSH 自身免密登录就 OK 了。
  • 安装 SSH
1
$ apt-get install openssh-server
  • 安装 SSH 的客户端。
1
$ apt-get install openssh-client
  • 生成密钥,不用输入,一直回车就行,生成的密钥在当前用户根目录下的 .ssh 文件夹中。
1
2
$ cd ~
$ ssh-keygen -t rsa -P ""
  • 将公钥追加到 authorized_keys 文件中。
1
$ cat .ssh/id_rsa.pub >> .ssh/authorized_keys
  • 启动 SSH 服务。
1
2
$ service ssh start
* Starting OpenBSD Secure Shell server sshd [ OK ]
  • 测试免密登录自己。
1
2
3
4
5
6
7
8
$ ssh 127.0.0.1

Welcome to Ubuntu 16.04.6 LTS (GNU/Linux 4.15.0-45-generic x86_64)

* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
Last login: Tue Mar 19 07:46:14 2019 from 127.0.0.1
  • 修改 ~/.bashrc 文件,启动 shell 的时候,自动启动 SSH 服务。
  • 用 vim 打开 ~/.bashrc 文件。
1
$ vim ~/.bashrc
  • 添加命令。
1
2
3
4
5
6
7
8
9
10
11
if [ -f ~/.bash_aliases ]; then
. ~/.bash_aliases
fi

# enable programmable completion features (you don't need to enable
# this, if it's already enabled in /etc/bash.bashrc and /etc/profile
# sources /etc/bash.bashrc).
#if [ -f /etc/bash_completion ] && ! shopt -oq posix; then
# . /etc/bash_completion
#fi
service ssh start

安装 Hadoop

下载解压Hadoop

1
$ wget http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
  • 解压到 /usr/local 目录下面并重命名文件夹。
1
2
3
$ tar -zxvf hadoop-3.2.1.tar.gz -C /usr/local/
$ cd /usr/local/
$ mv hadoop-3.2.1 hadoop

修改配置文件

  • 修改 /etc/profile 文件,添加一下环境变量到文件中。
  • 先用 vim 打开 /etc/profile
1
$ vim /etc/profile
  • 追加以下内容。

JAVA_HOME 为 JDK 安装路径,使用 apt 安装就是这个,用 update-alternatives --config java 可查看。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#java
export JAVA_HOME=/usr/lib/jvm/java-8-openJDK-amd64
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH
#hadoop
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_CONF_DIR=$HADOOP_HOME
export HADOOP_LIBEXEC_DIR=$HADOOP_HOME/libexec
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export HDFS_DATANODE_USER=root
export HDFS_DATANODE_SECURE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export HDFS_NAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
  • 使环境变量生效。
1
$ source /etc/profile
  • 在目录 /usr/local/hadoop/etc/hadoop 下。
  • 修改 hadoop-env.sh 文件,在文件末尾添加一下信息。
1
2
3
4
5
6
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
  • 修改 core-site.xml,修改为。
1
2
3
4
5
6
7
8
9
10
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop01:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop3/hadoop/tmp</value>
</property>
</configuration>
  • 修改hdfs-site.xml,修改为。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hadoop3/hadoop/hdfs/name</value>
</property>
<property>
<name>dfs.namenode.data.dir</name>
<value>/home/hadoop3/hadoop/hdfs/data</value>
</property>
</configuration>
  • 修改mapred-site.xml,修改为。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>
/usr/local/hadoop/etc/hadoop,
/usr/local/hadoop/share/hadoop/common/*,
/usr/local/hadoop/share/hadoop/common/lib/*,
/usr/local/hadoop/share/hadoop/hdfs/*,
/usr/local/hadoop/share/hadoop/hdfs/lib/*,
/usr/local/hadoop/share/hadoop/mapreduce/*,
/usr/local/hadoop/share/hadoop/mapreduce/lib/*,
/usr/local/hadoop/share/hadoop/yarn/*,
/usr/local/hadoop/share/hadoop/yarn/lib/*
</value>
</property>
</configuration>
  • 修改yarn-site.xml,修改为。
1
2
3
4
5
6
7
8
9
10
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop01</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
  • 修改workers为。
1
2
3
4
5
hadoop01
hadoop02
hadoop03
hadoop04
hadoop05

在 Docker 中启动集群

  • 先将当前容器导出为镜像,并查看当前镜像。
1
$ docker commit -m "haddop" -a "hadoop" temp-container lushan123888/hadoop:3.2.1
  • 启动 5 个终端,分别执行这几个命令。
1
2
3
4
$ docker run -d --network hadoop -p 9870:9870 -p 8088:8088 \
--hostname "hadoop01" \
--name "hadoop01" \
lushan123888/hadoop:3.2.1 /bin/bash
  • 第一条命令启动的是 hadoop01 是做 master 节点的,所以暴露了端口,以供访问 web 页面。
  • 其余的四条命令就是几乎一样的。
1
2
3
4
$ docker run -d --network hadoop -h "hadoop02" --name "hadoop02" lushan123888/hadoop:3.2.1 /bin/bash
$ docker run -d --network hadoop -h "hadoop03" --name "hadoop03" lushan123888/hadoop:3.2.1 /bin/bash
$ docker run -d --network hadoop -h "hadoop04" --name "hadoop04" lushan123888/hadoop:3.2.1 /bin/bash
$ docker run -d --network hadoop -h "hadoop05" --name "hadoop05" lushan123888/hadoop:3.2.1 /bin/bash
  • 接下来,在 hadoop01 主机中,启动 Haddop 集群。
1
$ docker exec -it hadoop01 /bin/bash
  • 先进行格式化操作。
1
$ ./hadoop namenode -format
  • 进入 hadoop 的 sbin 目录。
1
$ cd /usr/local/hadoop/sbin/
  • 启动。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ ./start-all.sh

Starting namenodes on [hadoop01]
hadoop01: Warning: Permanently added 'hadoop01,172.18.0.7' (ECDSA) to the list of known hosts.
Starting datanodes
hadoop05: Warning: Permanently added 'hadoop05,172.18.0.11' (ECDSA) to the list of known hosts.
hadoop02: Warning: Permanently added 'hadoop02,172.18.0.8' (ECDSA) to the list of known hosts.
hadoop03: Warning: Permanently added 'hadoop03,172.18.0.9' (ECDSA) to the list of known hosts.
hadoop04: Warning: Permanently added 'hadoop04,172.18.0.10' (ECDSA) to the list of known hosts.
hadoop03: WARNING: /usr/local/hadoop/logs does not exist. Creating.
hadoop05: WARNING: /usr/local/hadoop/logs does not exist. Creating.
hadoop02: WARNING: /usr/local/hadoop/logs does not exist. Creating.
hadoop04: WARNING: /usr/local/hadoop/logs does not exist. Creating.
Starting secondary namenodes [hadoop01]
Starting resourcemanager
Starting nodemanagers
  • 访问本机的 8088 与 9870 端口就可以看到监控信息了。
  • 使用命令 ./hadoop dfsadmin -report 可查看分布式文件系统的状态。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
$ ./hadoop dfsadmin -report

WARNING: Use of this script to execute dfsadmin is deprecated.
WARNING: Attempting to execute replacement "hdfs dfsadmin" instead.

Configured Capacity: 5893065379840 (5.36 TB)
Present Capacity: 5237598752768 (4.76 TB)
DFS Remaining: 5237598629888 (4.76 TB)
DFS Used: 122880 (120 KB)
DFS Used%: 0.00%
Replicated Blocks:
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0
Erasure Coded Block Groups:
Low redundancy block groups: 0
Block groups with corrupt internal blocks: 0
Missing block groups: 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0

-------------------------------------------------
Live datanodes (5):

Name: 172.18.0.10:9866 (hadoop03.hadoop)
Hostname: hadoop03
Decommission Status : Normal
Configured Capacity: 1178613075968 (1.07 TB)
DFS Used: 24576 (24 KB)
Non DFS Used: 71199543296 (66.31 GB)
DFS Remaining: 1047519793152 (975.58 GB)
DFS Used%: 0.00%
DFS Remaining%: 88.88%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Tue Mar 19 09:12:13 UTC 2019
Last Block Report: Tue Mar 19 09:10:46 UTC 2019
Num of Blocks: 0


Name: 172.18.0.11:9866 (hadoop02.hadoop)
Hostname: hadoop02
Decommission Status : Normal
Configured Capacity: 1178613075968 (1.07 TB)
DFS Used: 24576 (24 KB)
Non DFS Used: 71199625216 (66.31 GB)
DFS Remaining: 1047519711232 (975.58 GB)
DFS Used%: 0.00%
DFS Remaining%: 88.88%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Tue Mar 19 09:12:13 UTC 2019
Last Block Report: Tue Mar 19 09:10:46 UTC 2019
Num of Blocks: 0


Name: 172.18.0.7:9866 (hadoop01)
Hostname: hadoop01
Decommission Status : Normal
Configured Capacity: 1178613075968 (1.07 TB)
DFS Used: 24576 (24 KB)
Non DFS Used: 71199633408 (66.31 GB)
DFS Remaining: 1047519703040 (975.58 GB)
DFS Used%: 0.00%
DFS Remaining%: 88.88%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Tue Mar 19 09:12:13 UTC 2019
Last Block Report: Tue Mar 19 09:10:46 UTC 2019
Num of Blocks: 0


Name: 172.18.0.8:9866 (hadoop05.hadoop)
Hostname: hadoop05
Decommission Status : Normal
Configured Capacity: 1178613075968 (1.07 TB)
DFS Used: 24576 (24 KB)
Non DFS Used: 71199625216 (66.31 GB)
DFS Remaining: 1047519711232 (975.58 GB)
DFS Used%: 0.00%
DFS Remaining%: 88.88%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Tue Mar 19 09:12:13 UTC 2019
Last Block Report: Tue Mar 19 09:10:46 UTC 2019
Num of Blocks: 0


Name: 172.18.0.9:9866 (hadoop04.hadoop)
Hostname: hadoop04
Decommission Status : Normal
Configured Capacity: 1178613075968 (1.07 TB)
DFS Used: 24576 (24 KB)
Non DFS Used: 71199625216 (66.31 GB)
DFS Remaining: 1047519711232 (975.58 GB)
DFS Used%: 0.00%
DFS Remaining%: 88.88%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Tue Mar 19 09:12:13 UTC 2019
Last Block Report: Tue Mar 19 09:10:46 UTC 2019
Num of Blocks: 0
  • Hadoop 集群已经构建好了。

运行内置WordCount例子

  • 把license作为需要统计的文件。
1
$ cat LICENSE.txt > file1.txt
  • 在 HDFS 中创建 input 文件夹。
1
$ ./hadoop fs -mkdir /input
  • 上传 file1.txt 文件到 HDFS 中。
1
$ ./hadoop fs -put ../file1.txt /input
  • 查看 HDFS 中 input 文件夹里的内容。
1
2
3
4
$ ./hadoop fs -ls /input

Found 1 items
-rw-r--r-- 2 root supergroup 150569 2019-03-19 11:13 /input/file1.txt
  • 运行wordcount 例子程序。
1
$ ./hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount /input /output
  • 输出如下。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
2019-03-19 11:18:23,953 INFO client.RMProxy: Connecting to ResourceManager at hadoop01/172.18.0.7:8032
2019-03-19 11:18:24,381 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1552986653954_0001
2019-03-19 11:18:24,659 INFO input.FileInputFormat: Total input files to process : 1
2019-03-19 11:18:25,095 INFO mapreduce.JobSubmitter: number of splits:1
2019-03-19 11:18:25,129 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2019-03-19 11:18:25,208 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1552986653954_0001
2019-03-19 11:18:25,210 INFO mapreduce.JobSubmitter: Executing with tokens: []
2019-03-19 11:18:25,368 INFO conf.Configuration: resource-types.xml not found
2019-03-19 11:18:25,368 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2019-03-19 11:18:25,797 INFO impl.YarnClientImpl: Submitted application application_1552986653954_0001
2019-03-19 11:18:25,836 INFO mapreduce.Job: The url to track the job: http://hadoop01:8088/proxy/application_1552986653954_0001/
2019-03-19 11:18:25,837 INFO mapreduce.Job: Running job: job_1552986653954_0001
2019-03-19 11:18:33,990 INFO mapreduce.Job: Job job_1552986653954_0001 running in uber mode : false
2019-03-19 11:18:33,991 INFO mapreduce.Job: map 0% reduce 0%
2019-03-19 11:18:39,067 INFO mapreduce.Job: map 100% reduce 0%
2019-03-19 11:18:45,106 INFO mapreduce.Job: map 100% reduce 100%
2019-03-19 11:18:46,124 INFO mapreduce.Job: Job job_1552986653954_0001 completed successfully
2019-03-19 11:18:46,227 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=46852
FILE: Number of bytes written=537641
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=150665
HDFS: Number of bytes written=35324
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=3129
Total time spent by all reduces in occupied slots (ms)=3171
Total time spent by all map tasks (ms)=3129
Total time spent by all reduce tasks (ms)=3171
Total vcore-milliseconds taken by all map tasks=3129
Total vcore-milliseconds taken by all reduce tasks=3171
Total megabyte-milliseconds taken by all map tasks=3204096
Total megabyte-milliseconds taken by all reduce tasks=3247104
Map-Reduce Framework
Map input records=2814
Map output records=21904
Map output bytes=234035
Map output materialized bytes=46852
Input split bytes=96
Combine input records=21904
Combine output records=2981
Reduce input groups=2981
Reduce shuffle bytes=46852
Reduce input records=2981
Reduce output records=2981
Spilled Records=5962
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=111
CPU time spent (ms)=2340
Physical memory (bytes) snapshot=651853824
Virtual memory (bytes) snapshot=5483622400
Total committed heap usage (bytes)=1197998080
Peak Map Physical memory (bytes)=340348928
Peak Map Virtual memory (bytes)=2737307648
Peak Reduce Physical memory (bytes)=311504896
Peak Reduce Virtual memory (bytes)=2746314752
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=150569
File Output Format Counters
Bytes Written=35324
  • 查看 HDFS 中的 /output 文件夹的内容。
1
2
3
4
5
$ ./hadoop fs -ls /output

Found 2 items
-rw-r--r-- 2 root supergroup 0 2019-03-19 11:18 /output/_SUCCESS
-rw-r--r-- 2 root supergroup 35324 2019-03-19 11:18 /output/part-r-00000
  • 查看 part-r-00000 文件的内容。
1
$ ./hadoop fs -cat /output/part-r-00000

安装 Hbase

  • 在 Hadoop 集群的基础上安装 Hbase

下载解压Hbase

1
$ wget https://downloads.apache.org/hbase/2.3.3/hbase-2.3.3-bin.tar.gz
  • 解压到 /usr/local 目录下面。
1
$ tar -zxvf hbase-2.3.3-bin.tar.gz -C /usr/local/

修改配置文件

  • 修改 /etc/profile 环境变量文件,添加 Hbase 的环境变量,追加下述代码。
1
2
export HBASE_HOME=/usr/local/hbase-2.3.3
export PATH=$PATH:$HBASE_HOME/bin
  • 使环境变量配置文件生效。
1
$ source /etc/profile
  • 使用 ssh hadoop02 可进入其他四个容器,依次修改。

  • 即是每个容器都要在 /etc/profile 文件后追加那两行环境变量。

  • 在目录 /usr/local/hbase-2.1.3/conf 修改配置。

  • 修改hbase-env.sh,追加。

1
2
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HBASE_MANAGES_ZK=true
  • 修改hbase-site.xml为。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://hadoop01:9000/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.master</name>
<value>hadoop01:60000</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>hadoop01,hadoop02,hadoop03,hadoop04,hadoop05</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hadoop/zoodata</value>
</property>
</configuration>
  • 修改 regionservers 文件为。
1
2
3
4
5
hadoop01
hadoop02
hadoop03
hadoop04
hadoop05
  • 使用 scp 命令将配置好的 Hbase 复制到其他 4 个容器中。
1
2
3
4
$ scp -r /usr/local/hbase-2.3.3 root@hadoop02:/usr/local/
$ scp -r /usr/local/hbase-2.3.3 root@hadoop03:/usr/local/
$ scp -r /usr/local/hbase-2.3.3 root@hadoop04:/usr/local/
$ scp -r /usr/local/hbase-2.3.3 root@hadoop05:/usr/local/

启动 Hbase

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
$  ./start-hbase.sh

WARNING: log4j.properties is not found. HADOOP_CONF_DIR may be incomplete.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hbase-2.1.3/lib/client-facing-thirdparty/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
WARNING: log4j.properties is not found. HADOOP_CONF_DIR may be incomplete.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hbase-2.1.3/lib/client-facing-thirdparty/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
hadoop04: running zookeeper, logging to /usr/local/hbase-2.1.3/bin/../logs/hbase-root-zookeeper-hadoop04.out
hadoop01: running zookeeper, logging to /usr/local/hbase-2.1.3/bin/../logs/hbase-root-zookeeper-hadoop01.out
hadoop05: running zookeeper, logging to /usr/local/hbase-2.1.3/bin/../logs/hbase-root-zookeeper-hadoop05.out
hadoop03: running zookeeper, logging to /usr/local/hbase-2.1.3/bin/../logs/hbase-root-zookeeper-hadoop03.out
hadoop02: running zookeeper, logging to /usr/local/hbase-2.1.3/bin/../logs/hbase-root-zookeeper-hadoop02.out
running master, logging to /usr/local/hbase-2.1.3/logs/hbase--master-hadoop01.out
WARNING: log4j.properties is not found. HADOOP_CONF_DIR may be incomplete.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hbase-2.1.3/lib/client-facing-thirdparty/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
hadoop03: running regionserver, logging to /usr/local/hbase-2.1.3/bin/../logs/hbase-root-regionserver-hadoop03.out
hadoop01: running regionserver, logging to /usr/local/hbase-2.1.3/bin/../logs/hbase-root-regionserver-hadoop01.out
hadoop04: running regionserver, logging to /usr/local/hbase-2.1.3/bin/../logs/hbase-root-regionserver-hadoop04.out
hadoop05: running regionserver, logging to /usr/local/hbase-2.1.3/bin/../logs/hbase-root-regionserver-hadoop05.out
hadoop02: running regionserver, logging to /usr/local/hbase-2.1.3/bin/../logs/hbase-root-regionserver-hadoop02.out
  • 打开 Hbase 的 shell
1
2
3
4
5
6
$ hbase shell

hbase(main):001:0> whoami
root (auth:SIMPLE)
groups: root
Took 0.0134 seconds

安装 Spark

  • 在 Hadoop 的基础上安装 Spark

下载解压Spark

  • 下载 Spark 2.4.0
1
$ wget https://apache.website-solution.net/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
  • 解压到 /usr/local 目录下面。
1
$ tar -zxvf spark-2.4.7-bin-hadoop2.7.tgz  -C /usr/local/
  • 修改文件夹的名字。
1
2
$ cd /usr/local/
$ mv spark-2.4.7-bin-hadoop2.7 spark-2.4.0

修改配置文件

  • 修改 /etc/profile 环境变量文件,添加 Hbase 的环境变量,追加下述代码。
1
2
export SPARK_HOME=/usr/local/spark-2.4.0
export PATH=$PATH:$SPARK_HOME/bin
  • 使环境变量配置文件生效。
1
$ source /etc/profile
  • 使用 ssh hadoop02 可进入其他四个容器,依次修改。

  • 即是每个容器都要在 /etc/profile 文件后追加那两行环境变量。

  • 在目录 /usr/local/spark-2.4.0/conf 修改配置。

  • 修改文件名。

1
$ mv spark-env.sh.template spark-env.sh
  • 修改spark-env.sh,追加。
1
2
3
4
5
6
7
8
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export SCALA_HOME=/usr/share/scala

export SPARK_MASTER_HOST=hadoop01
export SPARK_MASTER_IP=hadoop01
export SPARK_WORKER_MEMORY=4g
  • 修改文件名。
1
$ mv slaves.template slaves
  • 修改slaves如下。
1
2
3
4
5
hadoop01
hadoop02
hadoop03
hadoop04
hadoop05
  • 使用 scp 命令将配置好的Spark复制到其他 4 个容器中。
1
2
3
4
$ scp -r /usr/local/spark-2.4.7 root@hadoop02:/usr/local/
$ scp -r /usr/local/spark-2.4.7 root@hadoop03:/usr/local/
$ scp -r /usr/local/spark-2.4.7 root@hadoop04:/usr/local/
$ scp -r /usr/local/spark-2.4.7 root@hadoop05:/usr/local/

启动 Spark

1
2
3
4
5
6
7
8
$ ./start-all.sh

starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark-2.4.0/logs/spark--org.apache.spark.deploy.master.Master-1-hadoop01.out
hadoop04: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark-2.4.0/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-hadoop04.out
hadoop05: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark-2.4.0/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-hadoop05.out
hadoop01: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark-2.4.0/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-hadoop01.out
hadoop03: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark-2.4.0/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-hadoop03.out
hadoop02: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark-2.4.0/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-hadoop02.out

其他

HDFS 重格式化问题

参考 https://blog.csdn.net/gis_101/article/details/52821946

  • 重新格式化意味着集群的数据会被全部删除,格式化前需考虑数据备份或转移问题。
  • 先删除主节点(即namenode节点), Hadoop的临时存储目录tmp, namenode存储永久性元数据目录dfs/name, Hadoop系统日志文件目录log 中的内容(注意是删除目录下的内容不是目录)
  • 删除所有数据节点(即datanode节点), Hadoop的临时存储目录tmp, namenode存储永久性元数据目录dfs/name, Hadoop系统日志文件目录log 中的内容。
  • 格式化一个新的分布式文件系统:
1
$ ./hadoop namenode -format

注意事项

  • Hadoop的临时存储目录tmp(即core-site.xml配置文件中的hadoop.tmp.dir属性,默认值是/tmp/hadoop-${user.name}),如果没有配置hadoop.tmp.dir属性,那么hadoop格式化时将会在/tmp目录下创建一个目录,例如在cloud用户下安装配置hadoop,那么Hadoop的临时存储目录就位于/tmp/hadoop-cloud目录下。
  • Hadoop的namenode元数据目录(即hdfs-site.xml配置文件中的dfs.namenode.name.dir属性,默认值是${hadoop.tmp.dir}/dfs/name),同样如果没有配置该属性,那么hadoop在格式化时将自行创建,必须注意的是在格式化前必须清楚所有子节点(即DataNode节点)dfs/name下的内容,否则在启动hadoop时子节点的守护进程会启动失败,这是由于,每一次format主节点namenode, dfs/name/current目录下的VERSION文件会产生新的clusterID, namespaceID,但是如果子节点的dfs/name/current仍存在, hadoop格式化时就不会重建该目录,因此形成子节点的clusterID, namespaceID与主节点(即namenode节点)的clusterID, namespaceID不一致,最终导致hadoop启动失败。

本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!