个人Hadoop集群部署

环境:centos 6.6 x64 (学习用3节点)

软件:jdk 1.7 + hadoop 2.7.3 + hive 2.1.1

环境准备:

1、安装必要工具

yum -y install openssh wget curl tree screen nano lftp htop mysql

2、使用163的yum源:

cd /etc/yum.repo.d/
wget http://mirrors.163.com/.help/CentOS7-Base-163.repo
#备份
mv /etc/yum.repos.d/CentOS-Base.repo /etc/yum.repos.d/CentOS-Base.repo.backup
mv CentOS7-Base-163.repo CentOS-Base.repo
#生成缓存
yum clean all
yum makecache

3、 关闭图形界面init 3:

vim /etc/inittab  #将启动级别5更改为级别3字符界面启动

4、设置静态IP、修改主机名、hosts、

(1)规划

192.168.235.138 node1
192.168.235.139 node2
192.168.235.140 node3

下面在各个节点上,根据当前机器的规划设置IP、主机名、hosts

(2)静态IP(各个节点)

#方式一:使用setup在图形界面下设置
# setup
#方式二:修改网络配置文件,一个完整的设置如下
# cat /etc/sysconfig/network-scripts/ifcfg-Auto_eth1
HWADDR=00:0C:29:2C:9F:4A
TYPE=Ethernet
BOOTPROTO=none
IPADDR=192.168.235.139
PREFIX=24
GATEWAY=192.168.235.1
DNS1=192.168.235.1
DEFROUTE=yes
IPV4_FAILURE_FATAL=yes
IPV6INIT=no
NAME="Auto eth1"
UUID=2753c781-4222-47bd-85e7-44877cde27dd
ONBOOT=yes
LAST_CONNECT=1491415778

(3)主机名(各个节点)

# cat /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=node1      #修改hostname的值

(4)hosts(各个节点)

# cat /etc/hosts
# 在文件末尾添加如下内容
192.168.235.138 node1
192.168.235.139 node2
192.168.235.140 node3

5、关闭防火墙

# service iptables stop
# service iptables status
# chkconfig iptables off

6、建立普通用户

# useradd hadoop
# passwd hadoop
# visudo
在root    ALL=(ALL)       ALL行下面增加:
hadoop  ALL=(ALL)       ALL

7、设置ssh免密码登录

方式一:自动部署脚本

# cat ssh.sh
ERVERS="node1 node2 node3"
PASSWORD=123456
BASE_SERVER=192.168.235.138

yum -y install expect

auto_ssh_copy_id() {
    expect -c "set timeout -1;
        spawn ssh-copy-id $1;
        expect {
            *(yes/no)* {send -- yes\r;exp_continue;}
            *assord:*  {send -- $2\r;exp_continue;}
            eof        {exit 0;}
        }"
}

ssh_copy_id_to_all() {
    for SERVER in $SERVERS
    do
        auto_ssh_copy_id $SERVER $PASSWORD
    done
}

ssh_copy_id_to_all

方式二:手动设置

ssh-keygen -t  rsa  #生成公钥
scp ~/.ssh/id_rsa.pub [email protected]192.168.235.139:~/ #使用scp或scp-copy-id 分发公钥到其他节点上

集群规划与安装

1、节点规划

规划:
node01:NameNode、DataNode、NodeManager、
node02:ResourceManager、DataNode、NodeManager、JobHisotry
node03:SecondaryNameNode、DataNode、NodeManager、

说明:注意节点功能的划分,DataNode存储数据,NodeManager处理数据,需要放在同一节点上,避免占用大量的网络带宽。

此处仅用于个人机器,学习使用。事实上,一个典型的生产环境示例如下

7台节点参考配置hadoop2.x (HA: 高可用)

主机名    IP地址    进程
cloud01    192.168.2.31    namenode    zkfc
cloud02    192.168.2.32    namenode    zkfc
cloud03    192.168.2.33    resourcemanager
cloud04    192.168.2.34    resourcemanager
cloud05    192.168.2.35    journalNode      datanode     nodemanager     QuorumaPeerMain
cloud06    192.168.2.36    journalNode      datanode     nodemanager     QuorumaPeerMain
cloud07    192.168.2.37    journalNode      datanode     nodemanager     QuorumaPeerMain

 备注:   namenode: 管理元数据
              resourcemanager: 用于资源控制
              datanode :用于存储数据
              nodemanager:用于数据计算
              journalNode: 用于共享元数据存储
              zkfc: ZooKeeper failOverSwitch ,namenode失败切换
              QuorumaPeerMain : 是ZooKeeper启动进程

HA+zookeeper可以有效防止单点故障,实现自动故障转移。

摘自:http://blog.csdn.net/shenfuli/article/details/44889757

2、安装jdk、Hadoop

(1)安装JDK、Hadoop

上传软件包到服务器,并在软件包所在目录下编辑、运行如下脚本:

#!/bin/bash

tar -zxvf jdk-7u79-linux-x64.tar.gz -C /opt/
tar -zxvf hadoop-2.7.3.tar.gz /usr/local/

cat >> /etc/profile << EOF
export JAVA_HOME=/opt/jdk1.7.0_79/
export HADOOP_HOME=/usr/local/hadoop-2.7.3
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOE/sbin
EOF

source /etc/profile

(2)预设Hadoop工作目录

  mkdir -p hadoop/tmp
  mkdir -p hadoop/dfs/data
  mkdir -p hadoop/dfs/name
  mkdir -p hadoop/namesecondary

3、配置Hadoop

几个基本的配置文件如下:

# cd /usr/local/hadoop-2.7.3/etc/hadoop/
# ls -l | awk ‘{print $9}‘
core-site.xml
hadoop-env.sh
hdfs-site.xml
mapred-site.xml
slaves
yarn-site.xml

配置内容如下:

(1)core-site.xml

    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://node1:9000</value>
    </property>
    <property>
        <name>io.file.buffer.size</name>
        <value>131072</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/usr/hadoop/tmp</value>
        <description>Abase for other temporary   directories.</description>
    </property>
        <property>
        <name>fs.trash.interval</name>
        <value>1440</value>
    </property>

(2)hadoop-env.sh

export JAVA_HOME=/opt/jdk1.7.0_79/

(3)hdfs-site.xml

    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///usr/hadoop/dfs/name</value>
        <description></description>
    </property>
        <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///usr/hadoop/dfs/data</value>
        <description></description>
    </property>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>node3:9001</value>
        <description></description>
    </property>
    <property>
      <name>dfs.namenode.checkpoint.dir</name>
      <value>file:///usr/hadoop/namesecondary</value>
      <description></description>
    </property>        

    <property>
        <name>dfs.replication</name>
        <value>2</value>
        <description>replication</description>
    </property>
    <property>
        <name>dfs.webhdfs.enabled</name>
        <value>true</value>
        <description></description>
    </property> 

    <property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>
    <property>
        <name>dfs.datanode.max.transfer.threads</name>
        <value>4096</value>
    </property>

(4)mapred-site.xml

    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>node2:10020</value>
        <description>MapReduce JobHistory Server host:port,Default port is 10020.</description>
    </property>
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>node2:19888</value>
        <description>MapReduce JobHistory Server Web UI host:port    Default port is 19888.</description>
    </property>
    <property>
        <name>yarn.app.mapreduce.am.staging-dir</name>
        <value>/history</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.done-dir</name>
        <value>${yarn.app.mapreduce.am.staging-dir}/history/done</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.intermediate-done-dir</name>
        <value>${yarn.app.mapreduce.am.staging-dir}/history/done_intermediate</value>
    </property>
    <property>
        <name>mapreduce.map.log.level</name>
        <value>DEBUG</value>
    </property>
    <property>
        <name>mapreduce.reduce.log.level</name>
        <value>DEBUG</value>
    </property>

(5)slaves

node1
node2
node3

(6)yarn-site.xml

    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>  

    <!--Configurations for ResourceManager and NodeManager:-->
    <!--
    <property>
        <name>yarn.acl.enable</name>
        <value>false</value>
        <description>Enable ACLs? Defaults to false.</description>
    </property>
    <property>
        <name>yarn.admin.acl</name>
        <value>Admin ACL</value>
        <description>ACL to set admins on the cluster. ACLs are of for comma-separated-usersspacecomma-separated-groups. Defaults to special value of * which means anyone. Special value of just space means no one has access.</description>
    </property>
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>false</value>
        <description>Configuration to enable or disable log aggregation</description>
    </property>
    -->

        <!--Configurations for ResourceManager:-->
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>node2:8032</value>
        <description>ResourceManager host:port for clients to submit jobs.host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.</description>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>node2:8030</value>
        <description>ResourceManager host:port for ApplicationMasters to talk to Scheduler to obtain resources.host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.</description>
    </property>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>node2:8031</value>
        <description>ResourceManager host:port for NodeManagers:host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.</description>
    </property>
    <property>
        <name>yarn.resourcemanager.admin.address</name>
        <value>node2:8033</value>
        <description>ResourceManager host:port for administrative commands.:host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.</description>
    </property>
    <property>
        <name>yarn.resourcemanager.webapp.address</name>
        <value>node2:8088</value>
        <description>ResourceManager web-ui host:port.host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.</description>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>node2</value>
        <description>host Single hostname that can be set in place of setting all yarn.resourcemanager*address resources. Results in default ports for ResourceManager components.</description>
    </property>
    <!--
    <property>
        <name>yarn.resourcemanager.scheduler.class</name>
        <value>ResourceManager Scheduler class.</value>
        <description>CapacityScheduler (recommended), FairScheduler (also recommended), or FifoScheduler</description>
    </property> 

    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>Minimum limit of memory to allocate to each container request at the Resource Manager.</value>
        <description>In MBs</description>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>Maximum limit of memory to allocate to each container request at the Resource Manager.</value>
        <description>In MBs</description>
    </property>
    <property>
        <name>yarn.resourcemanager.nodes.include-path/ yarn.resourcemanager.nodes.exclude-path</name>
        <value>List of permitted/excluded NodeManagers.</value>
        <description>If necessary, use these files to control the list of allowable NodeManagers.</description>
    </property>
    -->

    <!--Configurations for NodeManager:-->
    <!--
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>Resource i.e. available physical memory, in MB, for given NodeManager</value>
        <description>Defines total available resources on the NodeManager to be made available to running containers</description>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-pmem-ratio</name>
        <value>Maximum ratio by which virtual memory usage of tasks may exceed physical memory</value>
        <description>The virtual memory usage of each task may exceed its physical memory limit by this ratio. The total amount of virtual memory used by tasks on the NodeManager may exceed its physical memory usage by this ratio.</description>
    </property>
    <property>
        <name>yarn.nodemanager.local-dirs</name>
        <value>Comma-separated list of paths on the local filesystem where intermediate data is written.</value>
        <description>Multiple paths help spread disk i/o.</description>
    </property>
    <property>
        <name>yarn.nodemanager.log-dirs</name>
        <value>Comma-separated list of paths on the local filesystem where logs are written.</value>
        <description>Multiple paths help spread disk i/o.</description>
    </property>
    <property>
        <name>yarn.nodemanager.log.retain-seconds</name>
        <value>10800</value>
        <description>Default time (in seconds) to retain log files on the NodeManager Only applicable if log-aggregation is disabled.</description>
    </property>
    <property>
        <name>yarn.nodemanager.remote-app-log-dir</name>
        <value>/logs</value>
        <description>HDFS directory where the application logs are moved on application completion. Need to set appropriate permissions. Only applicable if log-aggregation is enabled.</description>
    </property>
    <property>
        <name>yarn.nodemanager.remote-app-log-dir-suffix</name>
        <value>logs</value>
        <description>Suffix appended to the remote log dir. Logs will be aggregated to ${yarn.nodemanager.remote-app-log-dir}/${user}/${thisParam} Only applicable if log-aggregation is enabled.</description>
    </property>
    -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
        <description>Shuffle service that needs to be set for Map Reduce applications.    </description>
    </property> 

    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
        <description></description>
    </property>
    <property>
        <name>yarn.log.server.url</name>
        <value>http://node2:19888/jobhistory/logs</value>
        <description></description>
    </property> 

注意:实际配置文件中尽量不要有中文

此配置仅供参考,使用时,去掉注释即可,也可根据自己情况增删配置

4、启动集群

(1)格式化NameNode

hadoop namenode -format  #{HADOOP_HOME}/bin

(2)启动、关闭集群

几个常用启动命令

# ls -l | awk ‘{print $9}‘
start-all.sh/stop-all.sh #启动、关闭所有进程
start-dfs.sh/stop-dfs.sh #启动、关闭hdfs
start-yarn.sh/stop-yarn.sh #启动、关闭yarn
mr-jobhistory-daemon.sh #作业查看
hadoop-daemon.sh / hadoop-daemons.sh
yarn-daemon.sh / yarn-daemons.sh
start-balancer.sh/stop-balancer.sh #更新datanode的文件块分布情况

三种启动方式:

三种启动方式:
方式一:逐一启动(实际生产环境中的启动方式)
hadoop-daemon.sh start|stop namenode|datanode| journalnode
yarn-daemon.sh start |stop resourcemanager|nodemanager
方式二:分开启动
start-dfs.sh
start-yarn.sh
方式三:一起启动
start-all.sh
作业查看服务:
mr-jobhistory-daemon.sh start historyserver

时间: 2024-08-29 11:16:22

个人Hadoop集群部署的相关文章

大数据学习初体验:Linux学习+Shell基础编程+hadoop集群部署

距离上次博客时间已经9天,简单记录下这几天的学习过程 2020-02-15 10:38:47 一.Linux学习 关于Linux命令,我在之前就已经学过一部分了,所以这段时间的linux学习更多的是去学习Linux系统的安装以及相关配置多一些,命令会一些比较常用的就够了,下面记录下安装配置Linux系统时的注意事项. 这里配置的虚拟机的内存为4g 使用的 CentOS-6.5-x86_64-minimal.iso 映射文件 在进入linux系统中时,需要将虚拟机的主机名修改成自己想要的名字,还要

Hadoop集群部署实战

Hadoop 集群搭建 目录 集群简介 服务器准备 环境和服务器设置 JDK环境安装 Hadoop安装部署 启动集群 测试 集群简介 在进行集群搭建前,我们需要大概知道搭建的集群都是些啥玩意. HADOOP集群具体来说包含两个集群:HDFS集群和YARN集群,两者在逻辑上分离,但物理上常在一起(啥意思?就是说:HDFS集群和YARN集群,他们俩是两个不同的玩意,但很多时候都会部署在同一台物理机器上) HDFS集群:负责海量数据的存储,集群中的角色主要有 NameNode (DataNode的管理

Hadoop 集群部署

1.修改所有主机的 机器名[[email protected] ~]# vi /etc/networks hostname=hadoop1 2.做主机和IP映射 [[email protected] ~]# vi /etc/hosts 192.168.5.136 hadoop1 192.168.5.137 hadoop3 192.168.5.138 hadoop2 其中 一台修改后可以 拷贝scp -r /etc/hosts [email protected]\2:/etc 3.SSH免登陆 [

hadoop集群部署

1. 目录/opt/hadoop/etc/hadoop core-site.xml <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://mip:9000</value> </property> </configuration> mip:在主节点的mip就是自己的ip,而所有从节点的mip是主节点的ip. 9000:主节点和从节点配

Hadoop集群部署-Hadoop 运行集群后Live Nodes显示0

可以尝试以下步骤解决: 1 ,分别删除:主节点从节点的  /usr/local/hadoop-2.6.2/etc/tmp   下得所有文件; 2: 编辑cd usr/local/hadoop-2.6.2/etc/hadoop/    vi slaves 删除slaves里面的 localhost 3:然后 hadoop namenode -format 4: 重新启动  start-all.sh

SPARK安装二:HADOOP集群部署

一.hadoop下载 使用2.7.6版本,因为公司生产环境是这个版本 cd /opt wget http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-2.7.6/hadoop-2.7.6.tar.gz 二.配置文件 参考文档:https://hadoop.apache.org/docs/r2.7.6 在$HADOOP_HOME/etc/hadoop目录下需要配置7个文件 1.core-site.xml <?xml version="1

四 hadoop集群部署

1.准备环境 centos 7.4 hadoop hadoop-3.2.1 (http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz) jdk 1.8.x 2.配置环境变量 命令:vi /etc/profile #hadoop #hadoopexport HADOOP_HOME=/opt/module/hadoop-3.2.1export PATH=$PATH:$HADOOP_HOME

Hadoop记录-Apache hadoop+spark集群部署

Hadoop+Spark集群部署指南 (多节点文件分发.集群操作建议salt/ansible) 1.集群规划节点名称 主机名 IP地址 操作系统Master centos1 192.168.0.1 CentOS 7.2Slave1 centos2 192.168.0.2 CentOS 7.2Slave2 centos2 192.168.0.3 Centos 7.22.基础环境配置2.1 hostname配置1)修改主机名在192.168.0.1 root用户下执行:hostnamectl set

Python菜鸟的Hadoop实战——Hadoop集群搭建

Hadoop集群的部署 网上很多关于hadoop集群部署的文章, 我这里重新整理下,毕竟,别人的经历,让你按照着完整走下来,总有或多或少的问题. 小技巧分享: 有些初学者喜欢在自己机器上安装虚拟机来部署hadoop,毕竟,很多同学的学习环境都是比较受限的. 我这里则直接选择了阿里云的机器,买了三台ECS作为学习环境.毕竟,最低配一个月才40多块,学习还是要稍微投入点的. 一. 基础环境准备 Windows不熟练,小主只有选择Linux了. 官方提示如下,Linux所需软件包括: JavaTM1.