Docker搭建Hadoop集群

喜欢ヅ旅行 2022-12-03 01:33 258阅读 0赞

文章转自: https://blog.csdn.net/lizongti/article/details/102756472

目录

环境准备

依赖

安装Docker

单例模式(Without Docker)

安装

安装JDK

安装Hadoop

配置

环境变量

设置免密登录

修改 hadoop-env.sh

HDFS

创建目录

修改core-site.xml

修改hdfs-site.xml

格式化HDFS

启动HDFS

HDFS Web

HDFS 测试

YARN

修改mapred-site.xml

修改yarn-site.xml

启动Yarn

Yarn Web

Yarn 测试

集群搭建(Without Docker)

准备

配置master

配置单例

修改hdfs-site.xml

修改masters

删除slaves

拷贝到slaves节点

修改slaves

HDFS

清空目录

格式化HDFS

启动HDFS

测试HDFS

YARN

启动YARN

测试YARN


环境准备

依赖

CentOS7.6

安装Docker

参照安装(点击)

单例模式(Without Docker)

安装

安装JDK

去官网上下载1.8版本的tar.gz ,如果使用yum安装或者下载rpm包安装,则会缺少Scala2.11需要的部分文件。

  1. tar xf jdk-8u221-linux-x64.tar -C /usr/lib/jvm
  2. rm -rf /usr/bin/java
  3. ln -s /usr/lib/jvm/jdk1.8.0_221/bin/java /usr/bin/java

编辑文件

  1. vim /etc/profile.d/java.sh

添加

  1. export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_221
  2. export JRE_HOME=${JAVA_HOME}/jre
  3. export CLASSPATH=${JAVA_HOME}/lib:${JRE_HOME}/lib:$CLASSPATH
  4. export PATH=${JAVA_HOME}/bin:$PATH

然后使环境变量生效

  1. source /etc/profile

执行以下命令检查环境变量

  1. [root@vm1 bin]# echo $JAVA_HOME
  2. /usr/lib/jvm/jdk1.8.0_221
  3. [root@vm1 bin]# echo $JAVA_HOME
  4. /usr/lib/jvm/jdk1.8.0_221

安装Hadoop

为了和另一篇的Spark达到版本兼容,使用官网hadoop2.7版本

  1. wget https://archive.apache.org/dist/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz

解压

  1. tar xf hadoop-2.7.7.tar.gz -C /opt/

配置

环境变量

编辑文件

  1. vim /etc/profile.d/hadoop.sh

添加

  1. export HADOOP_HOME=/opt/hadoop-2.7.7
  2. export PATH=$PATH:$HADOOP_HOME/bin

然后使环境变量生效

  1. source /etc/profile

设置免密登录

本机也需要配置免密登录
参照这里

修改 hadoop-env.sh

配置启动脚本内的JAVA_HOME

  1. vi /opt/hadoop-2.7.7/etc/hadoop/hadoop-env.sh

使用

  1. export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_221

替换

  1. export JAVA_HOME=${JAVA_HOME}

HDFS

创建目录

  1. mkdir -p /opt/hadoop-2.7.7/hdfs/name
  2. mkdir -p /opt/hadoop-2.7.7/hdfs/data

修改core-site.xml

配置访问节点

  1. vi /opt/hadoop-2.7.7/etc/hadoop/core-site.xml

替换

  1. <configuration>
  2. </configuration>

为以下配置

  1. <configuration>
  2. <property>
  3. <name>hadoop.tmp.dir</name>
  4. <value>file:/opt/hadoop-2.7.7/tmp</value>
  5. </property>
  6. <property>
  7. <name>fs.defaultFS</name>
  8. <value>hdfs://vm1:9000</value>
  9. </property>
  10. </configuration>
  • 通过hadoop.tmp.dir指定hadoop数据存储的临时文件夹,如没有配置hadoop.tmp.dir参数,此时系统默认的临时目录为:/tmp/hadoop-root。而这个目录在每次重启后都会被删除。
  • 通过fs.defaultFS指定默认访问文件系统的地址,否则默认访问本地文件,而非HDFS上的文件

修改hdfs-site.xml

配置副本个数

  1. vi /opt/hadoop-2.7.7/etc/hadoop/hdfs-site.xml

替换

  1. <configuration>
  2. </configuration>

为以下配置

  1. <configuration>
  2. <property>
  3. <name>dfs.replication</name>
  4. <value>1</value>
  5. </property>
  6. <property>
  7. <name>dfs.name.dir</name>
  8. <value>/opt/hadoop-2.7.7/hdfs/name</value>
  9. </property>
  10. <property>
  11. <name>dfs.data.dir</name>
  12. <value>/opt/hadoop-2.7.7/hdfs/data</value>
  13. </property>
  14. </configuration>
  • 通过dfs.replication指定HDFS的备份因子为1
  • 通过dfs.name.dir指定namenode节点的文件存储目录,这个参数用于确定将HDFS文件系统的元信息保存在什么目录下。如果这个参数设置为多个目录,那么这些目录下都保存着元信息的多个备份。
  • 通过dfs.data.dir指定datanode节点的文件存储目录,这个参数用于确定将HDFS文件系统的数据保存在什么目录下。
    我们可以将这个参数设置为多个分区上目录,即可将HDFS建立在不同分区上

格式化HDFS

  1. cd /opt/hadoop-2.7.7/bin
  2. hdfs namenode -format

启动HDFS

  1. cd /opt/hadoop-2.7.7/sbin
  2. ./start-dfs.sh

运行结果

  1. Starting namenodes on [vm1]
  2. vm1: starting namenode, logging to /opt/hadoop-2.7.7/logs/hadoop-root-namenode-vm1.out
  3. localhost: starting datanode, logging to /opt/hadoop-2.7.7/logs/hadoop-root-datanode-vm1.out
  4. Starting secondary namenodes [0.0.0.0]
  5. 0.0.0.0: starting secondarynamenode, logging to /opt/hadoop-2.7.7/logs/hadoop-root-secondarynamenode-vm1.out

HDFS Web

访问HDFS web界面 http://vm1:50070
在这里插入图片描述

HDFS 测试

生成测试数据

  1. mkdir -p /tmp/input
  2. vi /tmp/input/1

加入

  1. a
  2. b
  3. a
  4. hadoop fs -mkdir -p /tmp/input
  5. hadoop fs -put /tmp/input/1 /tmp/input
  6. hadoop fs -ls /tmp/input
  7. Found 1 items
  8. -rw-r--r-- 1 root supergroup 6 2019-10-28 11:34 /tmp/input/1

YARN

修改mapred-site.xml

设置调度器为yarn

  1. cp /opt/hadoop-2.7.7/etc/hadoop/mapred-site.xml.template /opt/hadoop-2.7.7/etc/hadoop/mapred-site.xml
  2. vi /opt/hadoop-2.7.7/etc/hadoop/mapred-site.xml

替换

  1. <configuration>
  2. </configuration>

为以下配置

  1. <configuration>
  2. <property>
  3. <name>mapreduce.framework.name</name>
  4. <value>yarn</value>
  5. </property>
  6. <property>
  7. <name>mapred.job.tracker</name>
  8. <value>http://vm1:9001</value>
  9. </property>
  10. </configuration>
  • 通过指定mapreduce.framework.name来设置map-reduce任务使用yarn的调度系统。如果设置为local表示本地运行,设置为classic表示经典mapreduce框架。
  • 通过指定mapred.job.tracker来设置map-reduce任务的job tracker的IP和Port。

修改yarn-site.xml

  1. vi /opt/hadoop-2.7.7/etc/hadoop/yarn-site.xml

替换

  1. <configuration>
  2. </configuration>

为以下配置

  1. <configuration>
  2. <property>
  3. <name>yarn.nodemanager.aux-services</name>
  4. <value>mapreduce_shuffle</value>
  5. </property>
  6. <property>
  7. <name>yarn.resourcemanager.hostname</name>
  8. <value>vm1</value>
  9. </property>
  10. </configuration>
  • 通过指定yarn.nodemanager.aux-services为mapreduce_shuffle来避免 “The auxService:mapreduce_shuffle does not exist” 错误
  • 通过指定yarn.resourcemanager.hostname来设置rm所在的主机。

启动Yarn

  1. ./start-yarn.sh

显示

  1. starting yarn daemons
  2. starting resourcemanager, logging to /opt/hadoop-2.7.7/logs/yarn-root-resourcemanager-vm1.out
  3. localhost: starting nodemanager, logging to /opt/hadoop-2.7.7/logs/yarn-root-nodemanager-vm1.out

Yarn Web

访问http://vm1:8088
在这里插入图片描述

Yarn 测试

执行命令

  1. ```shell
  2. hadoop jar /opt/hadoop-2.7.7/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar wordcount /tmp/input /tmp/result

执行日志

  1. 19/10/28 11:34:52 INFO client.RMProxy: Connecting to ResourceManager at vm1/192.168.1.101:8032
  2. 19/10/28 11:34:53 INFO input.FileInputFormat: Total input paths to process : 1
  3. 19/10/28 11:34:53 INFO mapreduce.JobSubmitter: number of splits:1
  4. 19/10/28 11:34:54 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1572232474055_0002
  5. 19/10/28 11:34:54 INFO impl.YarnClientImpl: Submitted application application_1572232474055_0002
  6. 19/10/28 11:34:54 INFO mapreduce.Job: The url to track the job: http://vm1:8088/proxy/application_1572232474055_0002/
  7. 19/10/28 11:34:54 INFO mapreduce.Job: Running job: job_1572232474055_0002
  8. 19/10/28 11:35:06 INFO mapreduce.Job: Job job_1572232474055_0002 running in uber mode : false
  9. 19/10/28 11:35:06 INFO mapreduce.Job: map 0% reduce 0%
  10. 19/10/28 11:35:11 INFO mapreduce.Job: map 100% reduce 0%
  11. 19/10/28 11:35:16 INFO mapreduce.Job: map 100% reduce 100%
  12. 19/10/28 11:35:17 INFO mapreduce.Job: Job job_1572232474055_0002 completed successfully
  13. 19/10/28 11:35:18 INFO mapreduce.Job: Counters: 49
  14. File System Counters
  15. FILE: Number of bytes read=22
  16. FILE: Number of bytes written=245617
  17. FILE: Number of read operations=0
  18. FILE: Number of large read operations=0
  19. FILE: Number of write operations=0
  20. HDFS: Number of bytes read=98
  21. HDFS: Number of bytes written=8
  22. HDFS: Number of read operations=6
  23. HDFS: Number of large read operations=0
  24. HDFS: Number of write operations=2
  25. Job Counters
  26. Launched map tasks=1
  27. Launched reduce tasks=1
  28. Data-local map tasks=1
  29. Total time spent by all maps in occupied slots (ms)=2576
  30. Total time spent by all reduces in occupied slots (ms)=3148
  31. Total time spent by all map tasks (ms)=2576
  32. Total time spent by all reduce tasks (ms)=3148
  33. Total vcore-milliseconds taken by all map tasks=2576
  34. Total vcore-milliseconds taken by all reduce tasks=3148
  35. Total megabyte-milliseconds taken by all map tasks=2637824
  36. Total megabyte-milliseconds taken by all reduce tasks=3223552
  37. Map-Reduce Framework
  38. Map input records=3
  39. Map output records=3
  40. Map output bytes=18
  41. Map output materialized bytes=22
  42. Input split bytes=92
  43. Combine input records=3
  44. Combine output records=2
  45. Reduce input groups=2
  46. Reduce shuffle bytes=22
  47. Reduce input records=2
  48. Reduce output records=2
  49. Spilled Records=4
  50. Shuffled Maps =1
  51. Failed Shuffles=0
  52. Merged Map outputs=1
  53. GC time elapsed (ms)=425
  54. CPU time spent (ms)=1400
  55. Physical memory (bytes) snapshot=432537600
  56. Virtual memory (bytes) snapshot=4235526144
  57. Total committed heap usage (bytes)=304087040
  58. Shuffle Errors
  59. BAD_ID=0
  60. CONNECTION=0
  61. IO_ERROR=0
  62. WRONG_LENGTH=0
  63. WRONG_MAP=0
  64. WRONG_REDUCE=0
  65. File Input Format Counters
  66. Bytes Read=6
  67. File Output Format Counters
  68. Bytes Written=8

查看运行结果

  1. hadoop fs -cat /tmp/result/part-r-00000

显示

  1. a 2
  2. b 1

集群搭建(Without Docker)

准备

  • 部署三台机器vm1, vm2,vm3在一个子网当中。

配置master

配置单例

先在vm1上执行与单例配置完全一样的配置过程

修改hdfs-site.xml

  1. vi /opt/hadoop-2.7.7/etc/hadoop/hdfs-site.xml

替换

  1. <property>
  2. <name>dfs.replication</name>
  3. <value>1</value>
  4. </property>

为以下配置

  1. <property>
  2. <name>dfs.replication</name>
  3. <value>2</value>
  4. </property>

这里的副本数dfs.replication配置成2

修改masters

  1. echo "vm1" > /opt/hadoop-2.7.7/etc/hadoop/masters

删除slaves

  1. rm /opt/hadoop-2.7.7/etc/hadoop/slaves

拷贝到slaves节点

  1. scp -r /opt/hadoop-2.7.7 root@vm2:/opt/
  2. scp -r /opt/hadoop-2.7.7 root@vm3:/opt/

修改slaves

  1. cat > /opt/hadoop-2.7.7/etc/hadoop/slaves <<EOF
  2. vm1
  3. vm2
  4. vm3
  5. EOF

把vm2和vm3写入到slaves里面去

HDFS

清空目录

  1. rm -rf /opt/hadoop-2.7.7/hdfs/data/*
  2. rm -rf /opt/hadoop-2.7.7/hdfs/name/*

格式化HDFS

重新格式化dfs

  1. hadoop namenode -format

启动HDFS

  1. /opt/hadoop-2.7.7/sbin
  2. ./start-dfs.sh

显示

  1. Starting namenodes on [vm1]
  2. vm1: starting namenode, logging to /opt/hadoop-2.7.7/logs/hadoop-root-namenode-vm1.out
  3. vm3: starting datanode, logging to /opt/hadoop-2.7.7/logs/hadoop-root-datanode-vm3.out
  4. vm2: starting datanode, logging to /opt/hadoop-2.7.7/logs/hadoop-root-datanode-vm2.out
  5. vm1: starting datanode, logging to /opt/hadoop-2.7.7/logs/hadoop-root-datanode-vm1.out
  6. Starting secondary namenodes [0.0.0.0]
  7. 0.0.0.0: starting secondarynamenode, logging to /opt/hadoop-2.7.7/logs/hadoop-root-secondarynamenode-vm1.out

比如namenode的日志就在/opt/hadoop-2.7.7/logs/hadoop-root-namenode-vm1.log中

检查master进程

  1. $ jps
  2. 75991 DataNode
  3. 76408 Jps
  4. 76270 SecondaryNameNode

检查slave进程

  1. $ jps
  2. 29379 DataNode
  3. 29494 Jps

查看集群状态

  1. hdfs dfsadmin -report -safemode

显示

  1. [root@vm1 sbin]#
  2. Configured Capacity: 160982630400 (149.93 GB)
  3. Present Capacity: 101929107456 (94.93 GB)
  4. DFS Remaining: 101929095168 (94.93 GB)
  5. DFS Used: 12288 (12 KB)
  6. DFS Used%: 0.00%
  7. Under replicated blocks: 0
  8. Blocks with corrupt replicas: 0
  9. Missing blocks: 0
  10. Missing blocks (with replication factor 1): 0
  11. -------------------------------------------------
  12. Live datanodes (3):
  13. Name: 192.168.1.103:50010 (vm3)
  14. Hostname: vm3
  15. Decommission Status : Normal
  16. Configured Capacity: 53660876800 (49.98 GB)
  17. DFS Used: 4096 (4 KB)
  18. Non DFS Used: 10321448960 (9.61 GB)
  19. DFS Remaining: 43339423744 (40.36 GB)
  20. DFS Used%: 0.00%
  21. DFS Remaining%: 80.77%
  22. Configured Cache Capacity: 0 (0 B)
  23. Cache Used: 0 (0 B)
  24. Cache Remaining: 0 (0 B)
  25. Cache Used%: 100.00%
  26. Cache Remaining%: 0.00%
  27. Xceivers: 1
  28. Last contact: Mon Oct 28 13:42:20 CST 2019
  29. Name: 192.168.1.102:50010 (vm2)
  30. Hostname: vm2
  31. Decommission Status : Normal
  32. Configured Capacity: 53660876800 (49.98 GB)
  33. DFS Used: 4096 (4 KB)
  34. Non DFS Used: 13661077504 (12.72 GB)
  35. DFS Remaining: 39999795200 (37.25 GB)
  36. DFS Used%: 0.00%
  37. DFS Remaining%: 74.54%
  38. Configured Cache Capacity: 0 (0 B)
  39. Cache Used: 0 (0 B)
  40. Cache Remaining: 0 (0 B)
  41. Cache Used%: 100.00%
  42. Cache Remaining%: 0.00%
  43. Xceivers: 1
  44. Last contact: Mon Oct 28 13:42:21 CST 2019
  45. Name: 192.168.1.101:50010 (vm1)
  46. Hostname: vm1
  47. Decommission Status : Normal
  48. Configured Capacity: 53660876800 (49.98 GB)
  49. DFS Used: 4096 (4 KB)
  50. Non DFS Used: 35070996480 (32.66 GB)
  51. DFS Remaining: 18589876224 (17.31 GB)
  52. DFS Used%: 0.00%
  53. DFS Remaining%: 34.64%
  54. Configured Cache Capacity: 0 (0 B)
  55. Cache Used: 0 (0 B)
  56. Cache Remaining: 0 (0 B)
  57. Cache Used%: 100.00%
  58. Cache Remaining%: 0.00%
  59. Xceivers: 1
  60. Last contact: Mon Oct 28 13:42:21 CST 2019

测试HDFS

  1. hadoop fs -mkdir -p /tmp/input
  2. hadoop fs -put /tmp/input/1 /tmp/input
  3. hadoop fs -ls /tmp/input
  4. Found 1 items
  5. -rw-r--r-- 1 root supergroup 6 2019-10-28 11:34 /tmp/input/1

YARN

启动YARN

  1. /opt/hadoop-2.7.7/sbin
  2. ./start-yarn.sh

显示

  1. starting yarn daemons
  2. starting resourcemanager, logging to /opt/hadoop-2.7.7/logs/yarn-root-resourcemanager-vm1.out
  3. vm2: starting nodemanager, logging to /opt/hadoop-2.7.7/logs/yarn-root-nodemanager-vm2.out
  4. vm3: starting nodemanager, logging to /opt/hadoop-2.7.7/logs/yarn-root-nodemanager-vm3.out
  5. vm1: starting nodemanager, logging to /opt/hadoop-2.7.7/logs/yarn-root-nodemanager-vm1.out

比如resourcemanager的日志就在/opt/hadoop-2.7.7/logs/yarn-root-resourcemanager-vm1.log中

检查master进程

  1. $ jps
  2. 100464 ResourceManager
  3. 101746 Jps
  4. 53786 QuorumPeerMain

检查slave进程

  1. $jps
  2. 36893 NodeManager
  3. 37181 Jps

测试YARN

  1. hadoop jar /opt/hadoop-2.7.7/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar wordcount /tmp/input /tmp/result

结果同上。

更多参考:

https://cloud.tencent.com/developer/article/1084166

https://www.jianshu.com/p/7ab2b6168cc9

https://www.cnblogs.com/onetwo/p/6419925.html

转载文章,未经验证,一经验证如有问题将进行不当更正。

发表评论

表情:
评论列表 (有 0 条评论,258人围观)

还没有评论,来说两句吧...

相关阅读

    相关 dockerhadoop

    Hadoop简介 Hadoop是一个由Apache基金会所开发的[分布式系统][Link 1]基础架构。用户可以在不了解分布式底层细节的情况下,开发分布式程序。充分利用集群的

    相关 Hadoop

    Hadoop集群搭建 最近终于忙完了所以把hadoop简单集群的搭建也弄处来给大家参考,这个教程仅仅提供一个最简单的单纯的hadoop集群环境搭建并没有涉及到更多组建搭建

    相关 Hadoop

            前言:搭建hadoop集群的博文很多,不能总是每次搭建时都花时间搜索一篇适合自己机器、文章简练清晰的教程,笔者也是描述自己搭建Hadoop集群的详细过程,以备日