VMware creates hadoop clusters from scratch

VMware creates hadoop clusters from scratch

1. Preparation of template virtual machine environment

1) Prepare a template virtual machine Hadoop 100. The virtual machine configuration requirements are as follows:

Note: the Linux system environment in this paper is illustrated by CentOS-7.5-x86-1804

Template virtual machine: 4G memory and 50G hard disk. Install the necessary environment to prepare for the installation of hadoop

[root@hadoop100 ~]# yum install -y epel-release
[root@hadoop100 ~]# yum install -y psmisc nc net-tools rsync vim lrzsz ntp libzstd openssl-static tree iotop git

Using Yum to install requires that the virtual machine can access the Internet normally. You can test the virtual machine networking before installing yum

[root@hadoop100 ~]# ping www.baidu.com
PING www.baidu.com (14.215.177.39) 56(84) bytes of data.
64 bytes from 14.215.177.39 (14.215.177.39): icmp_seq=1 ttl=128 time=8.60 ms
64 bytes from 14.215.177.39 (14.215.177.39): icmp_seq=2 ttl=128 time=7.72 ms

2) Turn off the firewall. Turn off the firewall and start it automatically

[root@hadoop100 ~]# systemctl stop firewalld
[root@hadoop100 ~]# systemctl disable firewalld

3) Create a hadoop user and modify the password of the hadoop user

[root@hadoop100 ~]# useradd hadoop   
[root@hadoop100 ~]# passwd hadoop   

4) Configure hadoop users to have roo authority, which is convenient for sudo to execute commands with root authority later

[root@hadoop100 ~]# vim /etc/sudoers

Modify the / etc/sudoers file, find the following line (line 91), and add a line under root, as shown below:

## Allow root to run any commands anywhere
root    ALL=(ALL)     ALL
hadoop   ALL=(ALL)     NOPASSWD:ALL

5) Create a folder in the / opt directory and modify the owner and group

(1) Create the module and software folders in the / opt directory

[root@hadoop100 ~]# mkdir /opt/module
[root@hadoop100 ~]# mkdir /opt/software

(2) Modify that the owner and group of module and software folders are hadoop users

[root@hadoop100 ~]# chown hadoop:hadoop /opt/module 
[root@hadoop100 ~]# chown hadoop:hadoop /opt/software

(3) View the owner and group of the module and software folders

[root@hadoop100 ~]# cd /opt/
[root@hadoop100 opt]# ll
 Total consumption 12
drwxr-xr-x. 2 hadoop hadoop 4096 5 August 28-17:18 module
drwxr-xr-x. 2 root    root    4096 9 July 2017 rh
drwxr-xr-x. 2 hadoop hadoop 4096 5 August 28-17:18 software

6) Uninstall the open JDK of the virtual machine

[root@hadoop100 ~]# rpm -qa | grep -i java | xargs -n1 rpm -e --nodeps

7) Restart the virtual machine

[root@hadoop10 ~]# reboot

2. Clone virtual machine

1) Using the template machine Hadoop 100, clone three virtual machines: Hadoop 102, Hadoop 103, Hadoop 104

2) Modify the clone machine IP, which is illustrated by Hadoop 102 below

(1) Modify the static IP of the cloned virtual machine

[root@hadoop100 ~]# vim /etc/sysconfig/network-scripts/ifcfg-ens33

Change to:

DEVICE=ens33
TYPE=Ethernet
ONBOOT=yes
BOOTPROTO=static
NAME="ens33"
IPADDR=192.168.1.102
PREFIX=24
GATEWAY=192.168.1.2
DNS1=192.168.1.2

(2) View the virtual network editor of Linux virtual machine, edit - > virtual network editor - > VMnet8

(3) View the IP address of Windows system adapter VMware Network Adapter VMnet8

(4) Ensure that the IP address and virtual network editor address in the ifcfg-ens33 file of Linux system are the same as the VM8 network IP address of Windows system.

3) Modify the host name of the clone machine. The following is an example of Hadoop 102

(1) To modify the host name, choose one of two methods

[root@hadoop102 ~]# hostnamectl --static set-hostname hadoop102	

Or modify the / etc/hostname file

[root@hadoop102 ~]# vim /etc/hostname
hadoop102

(2) Configure the linux clone host name mapping hosts file and open / etc/hosts

[root@hadoop102 ~]# vim /etc/hosts

Add the following

192.168.1.100 hadoop100
192.168.1.101 hadoop101
192.168.1.102 hadoop102
192.168.1.103 hadoop103
192.168.1.104 hadoop104

4) Restart the clone machine Hadoop 102

[root@hadoop102 ~]# reboot

3. Install JDK in Hadoop 102

1) Uninstall existing JDK

[hadoop@hadoop102 ~]$ rpm -qa | grep -i java | xargs -n1 sudo rpm -e --nodeps

2) Use the Xshell tool to import the downloaded JDK into the software folder under the opt directory (jdk1.8.0_212)

3) Unzip the JDK to the / opt/module directory

[hadoop@hadoop102 software]$ tar -zxvf jdk-8u212-linux-x64.tar.gz -C /opt/module/

4) Configure JDK environment variables

(1) Create a new / etc / profile d/my_ env. SH file

[hadoop@hadoop102 ~]$ sudo vim /etc/profile.d/my_env.sh

Add the following

#JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_212
export PATH=$PATH:$JAVA_HOME/bin

(2) Exit after saving

:wq

(3) source click the / etc/profile file to make the new environment variable PATH effective

[hadoop@hadoop102 ~]$ source /etc/profile

5) Test whether the JDK is installed successfully

[hadoop@hadoop102 ~]$ java -version

If you can see the following results, the Java installation is successful.

java version "1.8.0_212"

4. Installing Hadoop on Hadoop 102

Hadoop download address: https://archive.apache.org/dist/hadoop/common/hadoop-3.1.3/

1) Add hadoop-3.1.3. With the Xshell tool tar. GZ is imported into the software folder under the opt directory

2) Unzip the installation file under / opt/module

[hadoop@hadoop102 software]$ tar -zxvf hadoop-3.1.3.tar.gz -C /opt/module/

3) Check whether the decompression is successful

[hadoop@hadoop102 software]$ ls /opt/module/
hadoop-3.1.3

4) Add Hadoop to environment variable

(1) Get Hadoop installation path

[hadoop@hadoop102 hadoop-3.1.3]$ pwd
/opt/module/hadoop-3.1.3

(2) Open / etc / profile d/my_ env. SH file

sudo vim /etc/profile.d/my_env.sh

In my_ env. Add the following at the end of the SH file: (shift+g)

#HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop-3.1.3
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

(3) Exit after saving

:wq

(4) Make the modified document effective

[hadoop@hadoop102 hadoop-3.1.3]$ source /etc/profile

5) Test whether the installation is successful

[hadoop@hadoop102 hadoop-3.1.3]$ hadoop version
Hadoop 3.1.3

5. SSH non secret login configuration

1) Generate public and private keys:

[hadoop@hadoop102 .ssh]$ ssh-keygen -t rsa

Then click (three carriage returns) and two file IDS will be generated_ RSA (private key), id_rsa.pub (public key)

2) Copy the public key to the target machine for password free login

[hadoop@hadoop102 .ssh]$ ssh-copy-id hadoop102
[hadoop@hadoop102 .ssh]$ ssh-copy-id hadoop103
[hadoop@hadoop102 .ssh]$ ssh-copy-id hadoop104

be careful:

You also need to configure the hadoop account on hadoop 103 to log in to hadoop 102, hadoop 103 and hadoop 104 servers without secret.

You also need to configure the hadoop account on hadoop 104 to log in to hadoop 102, hadoop 103 and hadoop 104 servers without secret.

You also need to use the root account on Hadoop 102 to configure non secret login to Hadoop 102, Hadoop 103 and Hadoop 104;

6. Write cluster distribution script xsync

Function: it can synchronize files on the cluster, such as configuration files of environment variables, execution scripts of group play, etc

Script implementation

(1) Create an xsync file in the / home/hadoop/bin directory

[hadoop@hadoop102 opt]$ cd /home/atguigu
[hadoop@hadoop102 ~]$ mkdir bin
[hadoop@hadoop102 ~]$ cd bin
[hadoop@hadoop102 bin]$ vim xsync

Write the following code in this file

#!/bin/bash
#1. Number of judgment parameters
if [ $# -lt 1 ]
then
  echo Not Enough Arguement!
  exit;
fi
#2. Traverse all machines in the cluster
for host in hadoop102 hadoop103 hadoop104
do
  echo ====================  $host  ====================
  #3. Traverse all directories and send them one by one
  for file in $@
  do
    #4. Judge whether the document exists
    if [ -e $file ]
    then
      #5. Get parent directory
      pdir=$(cd -P $(dirname $file); pwd)
      #6. Get the name of the current file
      fname=$(basename $file)
      ssh $host "mkdir -p $pdir"
      rsync -av $pdir/$fname $host:$pdir
    else
      echo $file does not exists!
    fi
  done
done

(2) The modified script xsync has execution permission

[hadoop@hadoop102 bin]$ chmod +x xsync

(3) Copy the script to / bin for global invocation

[hadoop@hadoop102 bin]$ sudo cp xsync /bin/

7. Cluster configuration

1) Cluster deployment planning

Note: NameNode and SecondaryNameNode should not be installed on the same server

Note: resource manager also consumes a lot of memory and should not be configured on the same machine as NameNode and SecondaryNameNode.

hadoop102 hadoop103 hadoop104
HDFS NameNode DataNode DataNode SecondaryNameNode DataNode
YARN NodeManager ResourceManager NodeManager NodeManager

2) Profile description

Hadoop configuration files are divided into two types: default configuration files and user-defined configuration files. Only when users want to modify a default configuration value, they need to modify the user-defined configuration file and change the corresponding attribute value

(1) Default profile:

Default file to get The file is stored in the jar package of Hadoop
[core-default.xml] hadoop-common-3.1.3.jar/ core-default.xml
[hdfs-default.xml] hadoop-hdfs-3.1.3.jar/ hdfs-default.xml
[yarn-default.xml] hadoop-yarn-common-3.1.3.jar/ yarn-default.xml
[mapred-default.xml] hadoop-mapreduce-client-core-3.1.3.jar/ mapred-default.xml

(2) Custom profile:

​ core-site.xml,hdfs-site.xml,yarn-site.xml,mapred-site.xml four configuration files are stored in $Hadoop_ On the path of home / etc / Hadoop, users can modify the configuration again according to the project requirements.

(3) Common port number Description

Daemon App Hadoop2 Hadoop3
NameNode Port Hadoop HDFS NameNode 8020 / 9000 9820
Hadoop HDFS NameNode HTTP UI 50070 9870
Secondary NameNode Port Secondary NameNode 50091 9869
Secondary NameNode HTTP UI 50090 9868
DataNode Port Hadoop HDFS DataNode IPC 50020 9867
Hadoop HDFS DataNode 50010 9866
Hadoop HDFS DataNode HTTP UI 50075 9864

3) Configure cluster

(1) Core profile

Configure core site xml

[atguigu@hadoop102 ~]$ cd $HADOOP_HOME/etc/hadoop
[atguigu@hadoop102 hadoop]$ vim core-site.xml

The contents of the document are as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
	<!-- appoint NameNode Address of -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop102:9820</value>
	</property>
<!-- appoint hadoop Storage directory of data -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/module/hadoop-3.1.3/data</value>
	</property>

<!-- to configure HDFS The static user used for web page login is atguigu -->
    <property>
        <name>hadoop.http.staticuser.user</name>
        <value>atguigu</value>
	</property>

<!-- hadoop(superUser)Host nodes that are allowed to be accessed through proxy -->
    <property>
        <name>hadoop.proxyuser.atguigu.hosts</name>
        <value>*</value>
	</property>
<!-- hadoop(superUser)Allow groups to which users belong through proxy -->
    <property>
        <name>hadoop.proxyuser.atguigu.groups</name>
        <value>*</value>
	</property>
<!-- hadoop(superUser)Allow users through proxy-->
    <property>
        <name>hadoop.proxyuser.atguigu.groups</name>
        <value>*</value>
	</property>
</configuration>

(2) HDFS profile

Configure HDFS site xml

[hadoop@hadoop102 hadoop]$ vim hdfs-site.xm

The contents of the document are as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
	<!-- nn web End access address-->
	<property>
        <name>dfs.namenode.http-address</name>
        <value>hadoop102:9870</value>
    </property>
	<!-- 2nn web End access address-->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>hadoop104:9868</value>
    </property>
</configuration>

(3) YARN profile

Configure yarn site xml

[hadoop@hadoop102 hadoop]$ vim yarn-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
	<!-- appoint MR go shuffle -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
	</property>
<!-- appoint ResourceManager Address of-->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop103</value>
</property>
<!-- Inheritance of environment variables -->
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        						  				<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
<!-- yarn Maximum and minimum memory allowed to be allocated by the container -->
    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>512</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>4096</value>
</property>
<!-- yarn The amount of physical memory the container allows to manage -->
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>4096</value>
</property>
<!-- close yarn Limit check on physical memory and virtual memory -->
    <property>
        <name>yarn.nodemanager.pmem-check-enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
    <!-- Enable log aggregation -->
	<property>
    	<name>yarn.log-aggregation-enable</name>
    	<value>true</value>
	</property>
<!-- Set log aggregation server address -->
	<property>  
    	<name>yarn.log.server.url</name>  
    	<value>http://hadoop102:19888/jobhistory/logs</value>
	</property>
<!-- Set the log retention time to 7 days -->
	<property>
    	<name>yarn.log-aggregation.retain-seconds</name>
    	<value>604800</value>
	</property>

</configuration>

(4) MapReduce profile

Configure mapred site xml

[hadoop@hadoop102 hadoop]$ vim mapred-site.xml

The contents of the document are as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
	<!-- appoint MapReduce The program runs on Yarn upper -->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <!-- Historical server address -->
	<property>
    	<name>mapreduce.jobhistory.address</name>
    	<value>hadoop102:10020</value>
	</property>

<!-- History server web End address -->
	<property>
    	<name>mapreduce.jobhistory.webapp.address</name>
   	 	<value>hadoop102:19888</value>
	</property>
</configuration>

4) Distribute the configured Hadoop configuration file on the cluster

[hadoop@hadoop102 hadoop]$ xsync /opt/module/hadoop-3.1.3/etc/hadoop/

8. Group together

1) Configure workers

[hadoop@hadoop102 hadoop]$ vim /opt/module/hadoop-3.1.3/etc/hadoop/workers

Add the following contents to the document:

hadoop102
hadoop103
hadoop104

Note: no space is allowed at the end of the content added in the file, and no blank line is allowed in the file.

Synchronize all node profiles

[hadoop@hadoop102 hadoop]$ xsync /opt/module/hadoop-3.1.3/etc

2) Start cluster

(1) If the cluster is started for the first time, The namenode needs to be formatted in Hadoop 102 node (note that formatting the namenode will generate a new cluster id, resulting in the inconsistency between the cluster IDs of namenode and datanode, and the cluster cannot find the past data. If the cluster reports an error during operation and needs to reformat the namenode, the namenode and datanode processes must be stopped first, and the data and logs directories of all machines must be deleted before formatting.)

[hadoop@hadoop102 ~]$ hdfs namenode -format

(2) HDFS startup

[hadoop@hadoop102 hadoop-3.1.3]$ sbin/start-dfs.sh

(3) Start YARN on the node (Hadoop 103) where the resource manager is configured

[hadoop@hadoop103 hadoop-3.1.3]$ sbin/start-yarn.sh

(4) View the NameNode of HDFS on the Web side

​ (a) Enter in the browser: http://hadoop102:9870

(b) view the data information stored on HDFS

(5) View YARN's ResourceManager on the Web

​ (a) Enter in the browser: http://hadoop103:8088

​ (b) View Job information running on YARN

(5) Web side JobHistory server

​ (a) Enter in the browser: http://hadoop102:19888/jobhistory

9. Summary of cluster start / stop modes

1) Each service component starts / stops one by one

(1) Start / stop HDFS components respectively

hdfs --daemon start/stop namenode/datanode/secondarynamenode

(2) Start / stop YARN

yarn --daemon start/stop  resourcemanager/nodemanager

2) Each module starts / stops separately (ssh configuration is the premise)

(1) Overall start / stop HDFS

start-dfs.sh/stop-dfs.sh

(2) Overall start / stop of YARN

start-yarn.sh/stop-yarn.sh

10. Cluster time synchronization

1) Time server configuration (must be root)

(0) view ntpd service status and startup and self startup status of all nodes

[hadoop@hadoop102 ~]$ sudo systemctl status ntpd
[hadoop@hadoop102 ~]$ sudo systemctl is-enabled ntpd

(1) Turn off ntp service and self start on all nodes

[hadoop@hadoop102 ~]$ sudo systemctl stop ntpd
[hadoop@hadoop102 ~]$ sudo systemctl disable ntpd

(2) Modify NTP of Hadoop 102 Conf configuration file

[hadoop@hadoop102 ~]$ sudo vim /etc/ntp.conf

The amendments are as follows

a) Modify 1 (authorize all machines in the 192.168.1.0-192.168.1.255 network segment to query and synchronize time from this machine)

#restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap

by

restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap

b) Modification 2 (cluster in LAN, do not use time on other Internet)

server 0.centos.pool.ntp.org iburst
server 1.centos.pool.ntp.org iburst
server 2.centos.pool.ntp.org iburst
server 3.centos.pool.ntp.org iburst

by

#server 0.centos.pool.ntp.org iburst
#server 1.centos.pool.ntp.org iburst
#server 2.centos.pool.ntp.org iburst
#server 3.centos.pool.ntp.org iburst

c) Add 3 (when the node loses network connection, the local time can still be used as the time server to provide time synchronization for other nodes in the cluster)

server 127.127.1.0
fudge 127.127.1.0 stratum 10

(3) Modify the / etc/sysconfig/ntpd file of Hadoop 102

[hadoop@hadoop102 ~]$ sudo vim /etc/sysconfig/ntpd

Add the following contents (synchronize the hardware time with the system time)

SYNC_HWCLOCK=yes

(4) Restart ntpd service

[hadoop@hadoop102 ~]$ sudo systemctl start ntpd

(5) Set ntpd service startup

[hadoop@hadoop102 ~]$ sudo systemctl enable ntpd

2) Other machine configurations (must be root)

(1) Configure other machines to synchronize with the time server once every 10 minutes

[hadoop@hadoop103 ~]$ sudo crontab -e

The scheduled tasks are as follows:

*/10 * * * * /usr/sbin/ntpdate hadoop102

(2) Modify any machine time

[hadoop@hadoop103 ~]$ sudo date -s "2017-9-11 11:11:11"

(3) Check whether the machine is synchronized with the time server in ten minutes

[hadoop@hadoop103 ~]$ sudo date

Note: when testing, you can adjust 10 minutes to 1 minute to save time.

Special note: if the cluster can be networked, all nodes can use the Internet time directly

And start ntp service and self startup on all nodes

[hadoop@hadoop102 ~]$ sudo systemctl start ntpd
[hadoop@hadoop102 ~]$ sudo systemctl enabled ntpd

fig/ntpd

Add the following contents (synchronize the hardware time with the system time)

```shell
SYNC_HWCLOCK=yes

(4) Restart ntpd service

[hadoop@hadoop102 ~]$ sudo systemctl start ntpd

(5) Set ntpd service startup

[hadoop@hadoop102 ~]$ sudo systemctl enable ntpd

2) Other machine configurations (must be root)

(1) Configure other machines to synchronize with the time server once every 10 minutes

[hadoop@hadoop103 ~]$ sudo crontab -e

The scheduled tasks are as follows:

*/10 * * * * /usr/sbin/ntpdate hadoop102

(2) Modify any machine time

[hadoop@hadoop103 ~]$ sudo date -s "2017-9-11 11:11:11"

(3) Check whether the machine is synchronized with the time server in ten minutes

[hadoop@hadoop103 ~]$ sudo date

Note: when testing, you can adjust 10 minutes to 1 minute to save time.

Special note: if the cluster can be networked, all nodes can use the Internet time directly

And start ntp service and self startup on all nodes

[hadoop@hadoop102 ~]$ sudo systemctl start ntpd
[hadoop@hadoop102 ~]$ sudo systemctl enable ntpd

Tags: Operation & Maintenance Hadoop

Posted by sunwukung on Mon, 23 May 2022 02:40:46 +0300