Development focus, Hadoop is fully distributed
1. Copy hadoop100 to 101 and 102
2. ssh password-free login
3. Cluster configuration
4. Make and distribute scripts using xsync (ignorable)
5. Cluster and test
1. Copy hadoop100 to 101 and 102
(1) scp (secure copy) secure copy
scp can realize data copy between server and server. (from server1 to server2)
(2) Basic grammar
scp -r $pdir/$fname $user@$host:$pdir/$fname
Commands Recursive file path/name to copy destination user@host:destination path/name
(3) Case practice (you can push from hadoop100 to 101 and 102, corresponding to the following (a) operations, or you can pull from hadoop100 on 101 and 102, corresponding to the following (b) operations, in fact, a, b are any choose one)
(a) On hadoop100, copy the /opt/module/jdk1.8.0_212 directory in hadoop100 to hadoop101 and hadoop102.
# All operate on hadoop100, push jdk and hadoop to 101 and 102 respectively sudo scp -r jdk1.8.0_212/ [username]@hadoop101:/opt/module sudo scp -r hadoop-3.1.3/ [username]@hadoop101:/opt/module sudo scp -r jdk1.8.0_212/ [username]@hadoop102:/opt/module sudo scp -r hadoop-3.1.3/ [username]@hadoop102:/opt/module
Note: This place may report an error saying that there is no permission, then we need to log in to the target machine, which is 101, and modify the permissions. 102 Similarly
sudo chmod 777 module sudo chmod 777 software
(b) On hadoop101 and 102, copy the /opt/module/hadoop-3.1.3 directory in hadoop100 to hadoop101 and 102.
# Operate on hadoop101, pull over jdk and hadoop on 100 scp -r [username]@hadoop100:/opt/module/hadoop-3.1.3 ./ scp -r [username]@hadoop100:/opt/module/hadoop-3.1.3 ./ # Operate on hadoop102, pull over jdk and hadoop on 100 scp -r [username]@hadoop100:/opt/module/hadoop-3.1.3 ./ scp -r [username]@hadoop100:/opt/module/hadoop-3.1.3 ./
2. ssh password-free login
(1) Principle of password-free login
(2) Generate public and private keys
ssh-keygen -t rsa
Then press (three carriage returns), two files id_rsa (private key) and id_rsa.pub (public key) will be generated
(3) Copy the public key to the target machine for password-free login
ssh-copy-id hadoop100 ssh-copy-id hadoop101 ssh-copy-id hadoop102
Notice:
You also need to configure hadoop103 to log in to hadoop102, hadoop103, and hadoop104 servers without password.
You also need to configure hadoop104 to log in to hadoop102, hadoop103, and hadoop104 servers without password.
You also need to configure hadoop102 to log in to hadoop102, hadoop103, and hadoop104 servers without password.
(4) Explanation of file functions in the .ssh folder (~/.ssh)
known_hosts | Record the public key of the computer that ssh has accessed |
id_rsa | Generated private key |
id_rsa.pub | generated public key |
authorized_keys | Store the authorized public key of the passwordless login server |
3. Cluster configuration
(1) Cluster deployment planning
Notice:
1. Do not install NameNode and SecondaryNameNode on the same server
2. ResourceManager also consumes a lot of memory. Do not configure it on the same machine as NameNode and SecondaryNameNode.
hadoop100 | hadoop101 | hadoop102 | |
HDFS | NameNode DataNode | DataNode | SecondaryNameNode DataNode |
YARN | NodeManager | ResourceManager NodeManager | NodeManager |
(2) Configuration file description
Hadoop configuration files are divided into two categories: default configuration files and custom configuration files. Only when users want to modify a default configuration value, they need to modify the custom configuration file and change the corresponding attribute value.
(1) Default configuration file:
default file to get | The location where the file is stored in the Hadoop jar package |
[core-default.xml] | hadoop-common-3.1.3.jar/core-default.xml |
[hdfs-default.xml] | hadoop-hdfs-3.1.3.jar/hdfs-default.xml |
[yarn-default.xml] | hadoop-yarn-common-3.1.3.jar/yarn-default.xml |
[mapred-default.xml] | hadoop-mapreduce-client-core-3.1.3.jar/mapred-default.xml |
(2) Custom configuration file:
Four configuration files, core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml, are stored in the path $HADOOP_HOME/etc/hadoop, and users can re-modify the configuration according to project requirements.
(3) Configure the cluster
(1) Core configuration file: configure core-site.xml
cd /opt/module/hadoop-3.1.3/etc/hadoop vim core-site.xml
# The configuration content is as follows <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- specify NameNode address, this configuration is the internal communication address between the three servers --> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop100:8020</value> </property> <!-- specify hadoop data storage directory --> <property> <name>hadoop.tmp.dir</name> <value>/opt/module/hadoop-3.1.3/data</value> </property> <!-- configure HDFS The static user used for web login is atguigu --> <property> <name>hadoop.http.staticuser.user</name> <value>[username]</value> </property> </configuration>
(2) HDFS configuration file: configure hdfs-site.xml
cd /opt/module/hadoop-3.1.3/etc/hadoop vim hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- nn web terminal access address--> <property> <name>dfs.namenode.http-address</name> <value>hadoop100:9870</value> </property> <!-- 2nn web terminal access address--> <property> <name>dfs.namenode.secondary.http-address</name> <value>hadoop102:9868</value> </property> </configuration>
(3) YARN configuration file: configure yarn-site.xml
cd /opt/module/hadoop-3.1.3/etc/hadoop vim yarn-site.xml
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- specify MR go shuffle --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <!-- specify ResourceManager the address of--> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop101</value> </property> <!-- Inheritance of environment variables --> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property> </configuration>
(4) MapReduce configuration file: configure mapred-site.xml
The default is to run locally, and mapredece runs under YARN's resource scheduling through custom configuration
cd /opt/module/hadoop-3.1.3/etc/hadoop vim mapred-site.xml
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- specify MapReduce program runs on Yarn superior --> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
At present, we only configure the custom files on hadoop100, but not on 101 and 102, so we need to distribute the configured files to 101 and 102 respectively.
(5) Configure workers
vim /opt/module/hadoop-3.1.3/etc/hadoop/workers
Note: In this configuration file, do not add spaces or blank lines
4. Make and use the xsync distribution script (if you want to manually modify these four configuration files on each machine, you can ignore this step, just for the convenience of operation)
We expect to write an xsync script that can be used in any path, that is, the global environment variable, then we just need to print to see what the global environment variables are, and then find a directory to create it.
View all global environment variables: echo $PATH
Find it in our user directory
Script implementation:
Create an xsync file in the /home/[username]/bin directory
cd /home/[username] mkdir bin cd bin vim xsync
Write the following code in this file
#!/bin/bash #1. Determine the number of parameters if [ $# -lt 1 ] then echo Not Enough Arguement! exit; fi #2. Traverse all the machines in the cluster for host in hadoop100 hadoop101 hadoop102 do echo ==================== $host ==================== #3. Traverse all directories and send them one by one for file in $@ do #4. Determine whether the file exists if [ -e $file ] then #5. Get parent directory pdir=$(cd -P $(dirname $file); pwd) #6. Get the name of the current file fname=$(basename $file) ssh $host "mkdir -p $pdir" rsync -av $pdir/$fname $host:$pdir else echo $file does not exists! fi done done
# Modify the script xsync to have execute permission chmod +x xsync # Test the script, try to distribute a bin directory synchronously, can this custom command be used? xsync /home/[username]/bin *! Note that at this time xsync Although the command can be distributed, if some files that require transfer permissions have not been resolved
After the synchronization distribution script is completed, distribute the configuration files we just modified on hadoop100 to servers 101 and 102 at the same time. At the same time, the modified workers are also distributed to each machine
# Distribute the modified hadoop configuration file xsync /opt/module/hadoop-3.1.3/etc/hadoop/ # Check whether the distribution is successful, enter on 101 and 102 machines cat /opt/module/hadoop-3.1.3/etc/hadoop/core-site.xml # Distribute modified workers xsync /opt/module/hadoop-3.1.3/etc
5. Form a cluster and test
(1) Start the cluster
1) If the cluster is started for the first time, you need to format the NameNode on the hadoop100 node (Note: Formatting the NameNode will generate a new cluster id, resulting in inconsistent cluster ids between the NameNode and DataNode, and the cluster cannot find past data. If the cluster is in If an error is reported during operation and the NameNode needs to be reformatted, be sure to stop the namenode and datanode processes first, and delete the data and logs directories of all machines before formatting.)
# format namenode hdfs namenode -format
2) Start HDFS: enter the hadoop-3.1.3 directory and execute sbin/start-dfs.sh
Note: At this startup stage, an error is reported as shown below:
Obviously, jdk has been configured, and every machine can be found. If you guys encounter the same problem
Solution:
In hadoop-env.sh, re-declare JAVA_HOME explicitly
success! !!
(3) View the NameNode of HDFS on the Web side
(a) Enter in the browser: http://hadoop100:9870
(b) View data information stored on HDFS
(4) Start YARN on the node where ResourceManager is configured (hadoop101): sbin/start-yarn.sh
(5) View the ResourceManager of YARN on the Web side
(a) Enter in the browser: http://hadoop101:8088
(b) View Job information running on YARN
After we set up, we can compare it with our requirement diagram:
hadoop100 | hadoop101 | hadoop102 | |
HDFS | NameNode DataNode | DataNode | SecondaryNameNode DataNode |
YARN | NodeManager | ResourceManager NodeManager | NodeManager |
(2) Basic cluster test
1) Upload files to the cluster
# Create an input in the root directory hadoop fs -mkdir /wcinput # upload word.txt hadoop fs -put /opt/module/hadoop-3.1.3/wcinput/word.txt /wcinput
Then this is just the display of a page, and the actual data is stored under this path.
3) Execute the wordcount program (take wordcount as an example, but this time we use the cluster method)
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output
Note that this place made another error when running wordcount in a cluster mode. The error message is as follows
This prompt also obviously says that the java environment is not working well, so please provide me with a solution
cd etc/hadoop vim hadoop-env.sh # Add a global java environment variable export JAVA_HOME=/opt/module/jdk1.8.0_212