Recommendation system - Hadoop fully distributed (development focus)

Development focus, Hadoop is fully distributed

1. Copy hadoop100 to 101 and 102

2. ssh password-free login

3. Cluster configuration

4. Make and distribute scripts using xsync (ignorable)

5. Cluster and test

1. Copy hadoop100 to 101 and 102

(1) scp (secure copy) secure copy

scp can realize data copy between server and server. (from server1 to server2)

(2) Basic grammar

                scp    -r          $pdir/$fname                  $user@$host:$pdir/$fname

Commands Recursive file path/name to copy destination user@host:destination path/name

        (3) Case practice (you can push from hadoop100 to 101 and 102, corresponding to the following (a) operations, or you can pull from hadoop100 on 101 and 102, corresponding to the following (b) operations, in fact, a, b are any choose one)

(a) On hadoop100, copy the /opt/module/jdk1.8.0_212 directory in hadoop100 to hadoop101 and hadoop102.

# All operate on hadoop100, push jdk and hadoop to 101 and 102 respectively
sudo scp -r jdk1.8.0_212/ [username]@hadoop101:/opt/module
sudo scp -r hadoop-3.1.3/ [username]@hadoop101:/opt/module

sudo scp -r jdk1.8.0_212/ [username]@hadoop102:/opt/module
sudo scp -r hadoop-3.1.3/ [username]@hadoop102:/opt/module

Note: This place may report an error saying that there is no permission, then we need to log in to the target machine, which is 101, and modify the permissions. 102 Similarly

sudo chmod 777 module
sudo chmod 777 software

(b) On hadoop101 and 102, copy the /opt/module/hadoop-3.1.3 directory in hadoop100 to hadoop101 and 102.

# Operate on hadoop101, pull over jdk and hadoop on 100
scp -r [username]@hadoop100:/opt/module/hadoop-3.1.3 ./
scp -r [username]@hadoop100:/opt/module/hadoop-3.1.3 ./

# Operate on hadoop102, pull over jdk and hadoop on 100
scp -r [username]@hadoop100:/opt/module/hadoop-3.1.3 ./
scp -r [username]@hadoop100:/opt/module/hadoop-3.1.3 ./

2. ssh password-free login

(1) Principle of password-free login

(2) Generate public and private keys

ssh-keygen -t rsa

Then press (three carriage returns), two files id_rsa (private key) and id_rsa.pub (public key) will be generated

(3) Copy the public key to the target machine for password-free login

ssh-copy-id hadoop100
ssh-copy-id hadoop101
ssh-copy-id hadoop102

Notice:

You also need to configure hadoop103 to log in to hadoop102, hadoop103, and hadoop104 servers without password.

You also need to configure hadoop104 to log in to hadoop102, hadoop103, and hadoop104 servers without password.

You also need to configure hadoop102 to log in to hadoop102, hadoop103, and hadoop104 servers without password.

(4) Explanation of file functions in the .ssh folder (~/.ssh)

known_hosts

Record the public key of the computer that ssh has accessed

id_rsa

Generated private key

id_rsa.pub

generated public key

authorized_keys

Store the authorized public key of the passwordless login server

3. Cluster configuration

(1) Cluster deployment planning

         Notice:

1. Do not install NameNode and SecondaryNameNode on the same server

2. ResourceManager also consumes a lot of memory. Do not configure it on the same machine as NameNode and SecondaryNameNode.

hadoop100

hadoop101

hadoop102

HDFS

NameNode

DataNode

DataNode

SecondaryNameNode

DataNode

YARN

NodeManager

ResourceManager

NodeManager

NodeManager

(2) Configuration file description

Hadoop configuration files are divided into two categories: default configuration files and custom configuration files. Only when users want to modify a default configuration value, they need to modify the custom configuration file and change the corresponding attribute value.

(1) Default configuration file:

default file to get

The location where the file is stored in the Hadoop jar package

[core-default.xml]

hadoop-common-3.1.3.jar/core-default.xml

[hdfs-default.xml]

hadoop-hdfs-3.1.3.jar/hdfs-default.xml

[yarn-default.xml]

hadoop-yarn-common-3.1.3.jar/yarn-default.xml

[mapred-default.xml]

hadoop-mapreduce-client-core-3.1.3.jar/mapred-default.xml

(2) Custom configuration file:

Four configuration files, core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml, are stored in the path $HADOOP_HOME/etc/hadoop, and users can re-modify the configuration according to project requirements.

(3) Configure the cluster

(1) Core configuration file: configure core-site.xml

cd /opt/module/hadoop-3.1.3/etc/hadoop
vim core-site.xml
# The configuration content is as follows
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <!-- specify NameNode address, this configuration is the internal communication address between the three servers -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop100:8020</value>
    </property>

    <!-- specify hadoop data storage directory -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/module/hadoop-3.1.3/data</value>
    </property>

    <!-- configure HDFS The static user used for web login is atguigu -->
    <property>
        <name>hadoop.http.staticuser.user</name>
        <value>[username]</value>
    </property>
</configuration>

(2) HDFS configuration file: configure hdfs-site.xml

cd /opt/module/hadoop-3.1.3/etc/hadoop
vim hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
	<!-- nn web terminal access address-->
	<property>
        <name>dfs.namenode.http-address</name>
        <value>hadoop100:9870</value>
    </property>
	<!-- 2nn web terminal access address-->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>hadoop102:9868</value>
    </property>
</configuration>

(3) YARN configuration file: configure yarn-site.xml

cd /opt/module/hadoop-3.1.3/etc/hadoop
vim yarn-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <!-- specify MR go shuffle -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>

    <!-- specify ResourceManager the address of-->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop101</value>
    </property>

    <!-- Inheritance of environment variables -->
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

(4) MapReduce configuration file: configure mapred-site.xml

The default is to run locally, and mapredece runs under YARN's resource scheduling through custom configuration

cd /opt/module/hadoop-3.1.3/etc/hadoop
vim mapred-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
	<!-- specify MapReduce program runs on Yarn superior -->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

At present, we only configure the custom files on hadoop100, but not on 101 and 102, so we need to distribute the configured files to 101 and 102 respectively.

(5) Configure workers

vim /opt/module/hadoop-3.1.3/etc/hadoop/workers

                

Note: In this configuration file, do not add spaces or blank lines

4. Make and use the xsync distribution script (if you want to manually modify these four configuration files on each machine, you can ignore this step, just for the convenience of operation)

We expect to write an xsync script that can be used in any path, that is, the global environment variable, then we just need to print to see what the global environment variables are, and then find a directory to create it.

View all global environment variables: echo $PATH

Find it in our user directory

Script implementation:

Create an xsync file in the /home/[username]/bin directory

cd /home/[username]
mkdir bin
cd bin
vim xsync

Write the following code in this file

#!/bin/bash

#1. Determine the number of parameters
if [ $# -lt 1 ]
then
    echo Not Enough Arguement!
    exit;
fi

#2. Traverse all the machines in the cluster
for host in hadoop100 hadoop101 hadoop102
do
    echo ====================  $host  ====================
    #3. Traverse all directories and send them one by one

    for file in $@
    do
        #4. Determine whether the file exists
        if [ -e $file ]
            then
                #5. Get parent directory
                pdir=$(cd -P $(dirname $file); pwd)

                #6. Get the name of the current file
                fname=$(basename $file)
                ssh $host "mkdir -p $pdir"
                rsync -av $pdir/$fname $host:$pdir
            else
                echo $file does not exists!
        fi
    done
done
# Modify the script xsync to have execute permission
chmod +x xsync

# Test the script, try to distribute a bin directory synchronously, can this custom command be used?
xsync /home/[username]/bin

*!  Note that at this time xsync Although the command can be distributed, if some files that require transfer permissions have not been resolved

After the synchronization distribution script is completed, distribute the configuration files we just modified on hadoop100 to servers 101 and 102 at the same time. At the same time, the modified workers are also distributed to each machine

# Distribute the modified hadoop configuration file
xsync /opt/module/hadoop-3.1.3/etc/hadoop/

# Check whether the distribution is successful, enter on 101 and 102 machines
cat /opt/module/hadoop-3.1.3/etc/hadoop/core-site.xml

# Distribute modified workers
xsync /opt/module/hadoop-3.1.3/etc

5. Form a cluster and test

(1) Start the cluster

                1) If the cluster is started for the first time, you need to format the NameNode on the hadoop100 node (Note: Formatting the NameNode will generate a new cluster id, resulting in inconsistent cluster ids between the NameNode and DataNode, and the cluster cannot find past data. If the cluster is in If an error is reported during operation and the NameNode needs to be reformatted, be sure to stop the namenode and datanode processes first, and delete the data and logs directories of all machines before formatting.)

# format namenode
hdfs namenode -format

2) Start HDFS: enter the hadoop-3.1.3 directory and execute sbin/start-dfs.sh

Note: At this startup stage, an error is reported as shown below:

Obviously, jdk has been configured, and every machine can be found. If you guys encounter the same problem

Solution:

In hadoop-env.sh, re-declare JAVA_HOME explicitly

  success! !!

(3) View the NameNode of HDFS on the Web side

(a) Enter in the browser: http://hadoop100:9870

(b) View data information stored on HDFS

(4) Start YARN on the node where ResourceManager is configured (hadoop101): sbin/start-yarn.sh

(5) View the ResourceManager of YARN on the Web side

(a) Enter in the browser: http://hadoop101:8088

(b) View Job information running on YARN

After we set up, we can compare it with our requirement diagram:

hadoop100

hadoop101

hadoop102

HDFS

NameNode

DataNode

DataNode

SecondaryNameNode

DataNode

YARN

NodeManager

ResourceManager

NodeManager

NodeManager

(2) Basic cluster test

1) Upload files to the cluster

# Create an input in the root directory
hadoop fs -mkdir /wcinput

# upload word.txt
hadoop fs -put /opt/module/hadoop-3.1.3/wcinput/word.txt  /wcinput

Then this is just the display of a page, and the actual data is stored under this path.

3) Execute the wordcount program (take wordcount as an example, but this time we use the cluster method)

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output

Note that this place made another error when running wordcount in a cluster mode. The error message is as follows

This prompt also obviously says that the java environment is not working well, so please provide me with a solution

cd etc/hadoop
vim hadoop-env.sh

# Add a global java environment variable
export JAVA_HOME=/opt/module/jdk1.8.0_212 

 

Tags: Big Data Hadoop Distribution

Posted by Ralf Jones on Wed, 04 May 2022 17:23:48 +0300