下载hadoop压缩包
设置hadoop环境变量
设置hdfs环境变量
设置yarn环境变量
设置mapreduce环境变量
修改hadoop配置
设置core-site.xml
设置hdfs-site.xml
设置yarn-site.xml
设置mapred-site.xml
设置slave文件
分发配置
启动hdfs
格式化namenode
启动hdfs
检查hdfs启动情况
启动yarn
测试mr任务
hadoop本地库
hdfs yarn和mapreduce参数
下载hadoop压缩包
去hadoop官网下载hadoop-2.8.0压缩包到hadoop1.然后放到/opt下并解压.
$ gunzip hadoop-2.8.0.tar.gz
$ tar -xvf hadoop-2.8.0.tar
然后修改hadoop-2.8.0的目录权限,使hdfs和yarn均有权限读写该目录:
# chown -R hdfs:hadoop /opt/hadoop-2.8.0
设置hadoop环境变量
编辑/etc/profile:
export HADOOP_HOME=/opt/hadoop-2.8.0
export HADOOP_PREFIX=/opt/hadoop-2.8.0
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export LD_LIBRARY_PATH=$JAVA_HOME/jre/lib/amd64/server
export PATH=${HADOOP_HOME}/bin:$PATH
设置hdfs环境变量
编辑/opt/hadoop-2.8.0/ect/hadoop/hadoop-env.sh
#export JAVA_HOME=/usr/local/java/jdk1.8.0_121
#export HADOOP_HOME=/opt/hadoop/hadoop-2.7.3
#hadoop进程的最大heapsize包括namenode/datanode/ secondarynamenode等,默认1000M
#export HADOOP_HEAPSIZE=
#namenode的初始heapsize,默认取上面的值,按需要分配
#export HADOOP_NAMENODE_INIT_HEAPSIZE=""
#JVM启动参数,默认为空
#export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"
#还可以单独配置各个组件的内存:
#export HADOOP_NAMENODE_OPTS=
#export HADOOP_DATANODE_OPTS
#export HADOOP_SECONDARYNAMENODE_OPTS
#设置hadoop日志,默认是$HADOOP_HOME/log
#export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER
export HADOOP_LOG_DIR=/var/log/hadoop/
根据自己系统的规划来设置各个参数.要注意namenode所用的blockmap和namespace空间都在heapsize中,所以生产环境要设较大的heapsize.
注意所有组件使用的内存和,生产给linux系统留5-15%的内存(一般留10G).根据自己系统的规划来设置各个参数.要注意namenode所用的blockmap和namespace空间都在heapsize中,所以生产环境要设较大的heapsize.
注意所有组件使用的内存和,生产给linux系统留5-15%的内存(一般留10G).
设置yarn环境变量
编辑/opt/hadoop-2.8.0/ect/hadoop/yarn-env.sh
#export JAVA_HOME=/usr/local/java/jdk1.8.0_121
#JAVA_HEAP_MAX=-Xmx1000m
#YARN_HEAPSIZE=1000 #yarn 守护进程heapsize
#export YARN_RESOURCEMANAGER_HEAPSIZE=1000 #单独设置RESOURCEMANAGER的HEAPSIZE
#export YARN_TIMELINESERVER_HEAPSIZE=1000 #单独设置TIMELINESERVER(jobhistoryServer)的HEAPSIZE
#export YARN_RESOURCEMANAGER_OPTS= #单独设置RESOURCEMANAGER的JVM选项
#export YARN_NODEMANAGER_HEAPSIZE=1000 #单独设置NODEMANAGER的HEAPSIZE
#export YARN_NODEMANAGER_OPTS= #单独设置NODEMANAGER的JVM选项
export YARN_LOG_DIR=/var/log/yarn #设置yarn的日志目录
根据环境配置,这里不设置,生产环境注意JVM参数及日志文件位置
设置mapreduce环境变量
# export JAVA_HOME=/home/y/libexec/jdk1.6.0/
#export HADOOP_JOB_HISTORYSERVER_HEAPSIZE=1000
#export HADOOP_MAPRED_ROOT_LOGGER=INFO,RFA
#export HADOOP_JOB_HISTORYSERVER_OPTS=
#export HADOOP_MAPRED_LOG_DIR="" # Where log files are stored. $HADOOP_MAPRED_HOME/logs by default.
#export HADOOP_JHS_LOGGER=INFO,RFA # Hadoop JobSummary logger.
#export HADOOP_MAPRED_PID_DIR= # The pid files are stored. /tmp by default.
#export HADOOP_MAPRED_IDENT_STRING= #A string representing this instance of hadoop. $USER by default
#export HADOOP_MAPRED_NICENESS= #The scheduling priority for daemons. Defaults to 0.
export export HADOOP_MAPRED_LOG_DIR=/var/log/yarn
根据环境配置,这里不设置,生产环境注意JVM参数及日志文件位置
修改hadoop配置
参下的设置在官网http://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/ClusterSetup.html 都可以找到
设置core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop1:9000</value>
<description>HDFS 端口</description>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
<property>
<name>fs.trash.interval</name>
<value>1440</value>
<description>启动hdfs回收站,回收站保留时间1440分钟</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop-2.8.0/tmp</value>
<description>默认值/tmp/hadoop-${user.name},修改成持久化的目录</description>
</property>
</configuration>
core-site.xml里有众多的参数,但只修改这两个就能启动,其它参数请参考官方文档.
设置hdfs-site.xml
这里只设置以下只个参数:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>数据块的备份数量,生产建议为3</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hadoop-2.8.0/namenodedir</value>
<description>保存namenode元数据的目录,生产上放在raid中</description>
</property>
<property>
<name>dfs.blocksize</name>
<value>134217728</value>
<description>数据块大小,128M,根据业务场景设置,大文件多就设更大值.</description>
</property>
<property>
<name>dfs.namenode.handler.count</name>
<value>100</value>
<description>namenode处理的rpc请求数,大集群设置更大的值</description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/hadoop-2.8.0/datadir</value>
<description>datanode保存数据目录,生产上设置成每个磁盘的路径,不建议用raid</description>
</property>
</configuration>
设置yarn-site.xml
这里只设置以下只个参数,其它参数请参考官网.
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop1</value>
<description>设置resourcemanager节点</description>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<description>设置nodemanager的aux服务</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>32</value>
<description>每个container的最小大小MB</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>128</value>
<description>每个container的最大大小MB</description>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>1024</value>
<description>为nodemanager分配的最大内存MB</description>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/home/yarn/nm-local-dir</value>
<description>nodemanager本地目录</description>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>1</value>
<description>每个nodemanger机器上可用的CPU,默认为-1,即让yarn自动检测CPU个数,但是当前yarn无法检测,实际上该值是8</description>
</property>
</configuration>
生产上请设置:
ResourceManager的参数:
Parameter | Value | Notes |
---|---|---|
yarn.resourcemanager.address | ResourceManager host:port for clients to submit jobs. | host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. resourcemanager的地址,格式 主机:端口 |
yarn.resourcemanager.scheduler.address | ResourceManager host:port for ApplicationMasters to talk to Scheduler to obtain resources. | host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. 调度器地址 ,覆盖yarn.resourcemanager.hostname |
yarn.resourcemanager.resource-tracker.address | ResourceManager host:port for NodeManagers. | host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. datanode像rm报告的端口, 覆盖 yarn.resourcemanager.hostname |
yarn.resourcemanager.admin.address ResourceManager host:port for administrative commands. | host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. RM管理地址,覆盖 yarn.resourcemanager.hostname | |
yarn.resourcemanager.webapp.address | ResourceManager web-ui host:port. | host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. RM web地址,有默认值 |
yarn.resourcemanager.hostname | ResourceManager host. | host Single hostname that can be set in place of setting allyarn.resourcemanager*address resources. Results in default ports for ResourceManager components. RM的主机,使用默认端口 |
yarn.resourcemanager.scheduler.class | ResourceManager Scheduler class. | CapacityScheduler (recommended), FairScheduler (also recommended), or FifoScheduler |
yarn.scheduler.minimum-allocation-mb | Minimum limit of memory to allocate to each container request at the Resource Manager. | In MBs 最小容器内存(每个container最小内存) |
yarn.scheduler.maximum-allocation-mb | Maximum limit of memory to allocate to each container request at the Resource Manager. | In MBs 最大容器内存(每个container最大内存) |
yarn.resourcemanager.nodes.include-path /yarn.resourcemanager.nodes.exclude-path | List of permitted/excluded NodeManagers. | If necessary, use these files to control the list of allowable NodeManagers. 哪些datanode可以被RM管理 |
NodeManager的参数:
Parameter | Value | Notes |
---|---|---|
yarn.nodemanager.resource.memory-mb | Resource i.e. available physical memory, in MB, for given NodeManager | Defines total available resources on the NodeManager to be made available to running containers Yarn在NodeManager最大内存 |
yarn.nodemanager.vmem-pmem-ratio | Maximum ratio by which virtual memory usage of tasks may exceed physical memory | The virtual memory usage of each task may exceed its physical memory limit by this ratio. The total amount of virtual memory used by tasks on the NodeManager may exceed its physical memory usage by this ratio. 任务使用的虚拟内存超过被允许的推理内存的比率,超过则kill掉 |
yarn.nodemanager.local-dirs | Comma-separated list of paths on the local filesystem where intermediate data is written. | Multiple paths help spread disk i/o. mr运行时中间数据的存放目录,建议用多个磁盘分摊I/O,,默认是HADOOP_YARN_HOME/log |
yarn.nodemanager.log-dirs | Comma-separated list of paths on the local filesystem where logs are written. | Multiple paths help spread disk i/o. mr任务日志的目录,建议用多个磁盘分摊I/O,,默认是HADOOP_YARN_HOME/log/userlog |
yarn.nodemanager.log.retain-seconds | 10800 | Default time (in seconds) to retain log files on the NodeManager Only applicable if log-aggregation is disabled. |
yarn.nodemanager.remote-app-log-dir | /logs | HDFS directory where the application logs are moved on application completion. Need to set appropriate permissions. Only applicable if log-aggregation is enabled. |
yarn.nodemanager.aux-services | mapreduce_shuffle | Shuffle service that needs to be set for Map Reduce applications. shuffle服务类型 |
设置mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>使用yarn来管理mr</description>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop2:10020</value>
<description>jobhistory主机的地址</description>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop2:19888</value>
<description>jobhistory web的主机地址</description>
</property>
<property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>/opt/hadoop/hadoop-2.8.0/mrHtmp</value>
<description>正在的mr任务监控内容的存放目录</description>
</property>
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>/opt/hadoop/hadoop-2.8.0/mrhHdone</value>
<description>执行完毕的mr任务监控内容的存放目录</description>
</property>
</configuration>
设置slave文件
在/opt/hadoop-2.8.0/ect/hadoop/slave中写上从节点
hadoop3
hadoop4
hadoop5
分发配置
将 /etc/profile /opt/* 复制到其它节点上
$ scp [email protected]:/etc/profile /etc
$ scp -r [email protected]:/opt/* /opt/
建议先压缩再传….
启动hdfs
格式化namenode
$HADOOP_HOME/bin/hdfs namenode -format
启动hdfs
以hdfs使用 $HADOOP_HOME/sbin/start-dfs.sh启动整个hdfs集群或者,使用
$HADOOP_HOME/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode #启动单个namenode
$HADOOP_HOME/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start datanode #启动单个datanode
启动日志会写在$HADOOP_HOME/log下,可以在hadoop-env.sh里设置日志路径
检查hdfs启动情况
打 http://hadoop1:50070 或者执行 hdfs dfs -mkdir /test测试
启动yarn
在hadoop1上启resourcemanager:
yarn $ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager
在hadoop3 hadoop4 hadoop5上启动nodemanager:
yarn $ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager
如果设置了slave文件并且以yarn配置了ssh互信,那可以在任意一个节点执行:start-yarn.sh即可启动整个集群
然后打开RM页面:
如果启动有问题,查看在yarn-evn.sh设置的YARN_LOG_DIR下的日志查找原因.注意yarn启动时用的目录的权限.
测试mr任务
[[email protected] hadoop-2.8.0]$ hdfs dfs -mkdir -p /user/hdfs/input
[[email protected] hadoop-2.8.0]$ hdfs dfs -put etc/hadoop/ /user/hdfs/input
[[email protected] hadoop-2.8.0]$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar grep input output ‘dfs[a-z.]+‘
17/06/27 04:16:45 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/hdfs/.staging/job_1498507021248_0003
java.io.IOException: org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=1536, maxMemory=128
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:279)
然后就报错了.请求的最大内存是1536MB,最大内存是128M.1536是MR任务默认请求的最小资源.最大资源是128M?我的集群明明有3G的资源.这里信息应该是错误的,当一个container的最在内存不能满足单个个map任务用的最小内存时会报错,报的是container的内存大小而不是集群的总内存.当前的集群配置是,每个container最小使用32MB内存,最大使用128MB内存,而一个map默认最小使用1024MB的内存.
现在,修改下每个map和reduce任务用的最小资源:
修改mapred-site.xml,添加:
<property>
<name>mapreduce.map.memory.mb</name>
<value>128</value>
<description>map任务最小使用的内存</description>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>128</value>
<description>reduce任务最小使用的内存</description>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>128</value>
<description>mapreduce任务默认使用的内存</description>
</property>
再次执行:
[[email protected] hadoop-2.8.0]$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar grep input output ‘dfs[a-z.]+‘
17/06/27 05:04:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
……..
17/06/27 05:04:36 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/hdfs/.staging/job_1498510574463_0006
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://hadoop1:9000/user/hdfs/grep-temp-2069706136
……..
资源的问题解决了,也验证了我的想法.但是这次又报了一个错误,缺少目录.在2.6.3以及2.7.3中,我都测试过,没发现这个问题,暂且不管个.至于MR的可用性,以后会再用其它方式验证.怀疑jar包有问题.
hadoop本地库
不知道大家注意到没有,每次执行hdfs命令时,都会报:
WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
这是由于不能使用本地库的原因.hadoop依赖于linux上一次本地库,比如zlib等来提高效率.
关于本地库,请看我的另一篇文章:
hdfs yarn和mapreduce参数
关于参数,我会另起一篇介绍比较重要的参数
下一篇,设置HDFS的HA