Chapter 1 is written in front and must be read
1.1 brief description of Hadoop ecology
Note: hadoop is just a platform for storing data. mapreduce is a computing framework, which requires programmers to write programs to process data. Then hadoop is an ecosystem, that is, it also runs HBase database, sqoop, shark and other tools, so as to make use of the data stored in hadoop. HBase is a time series database, which can write data to and from hadoop and read data through hadoop. When building these, zookeeper software is also used to manage messages. In short, hadoop ecology involves too many things, which is complex to learn, but it is not difficult. Just look at the official website and clarify the concept.
1.2 purpose of this paper
If you want to build a multi node Hadoop cluster on one computer, the traditional way is to use multiple virtual machines. However, this method takes up more resources, and the number of virtual machines that a laptop can run at the same time is very limited. We can use docker at this time. Docker can be regarded as a lightweight virtual machine, which occupies less resources. It is very similar to traditional virtual machines. When used, it can be compared with VMware or VirtualBox.
This article can't make you proficient in Docker, but it can let you get started quickly and build this cluster. The premise is that you know something about Hadoop and Linux.
Chapter 2 preparation
2.1 prepare Docker host
Currently, Docker can only run on 64 bit Linux with kernel version of 3.10 or above. The Linux system where Docker is installed is called the host of Docker. If your system does not meet the requirements, you can first install a qualified virtual machine, and then use Docker on this virtual machine. My notebook is a Windows system with a 64 bit CentOS 7 3 virtual machine demonstration. Since all 10 nodes must run on one virtual machine, the resources allocated to this virtual machine cannot be too small, otherwise there will be problems. Although Docker saves a lot of resources than virtual machines, Hadoop can't save the resources it needs. I allocated 2 cores and 4G of memory, which is much smaller than using 10 virtual machines. In addition, because only one Linux is required, it is also possible to install a dual system on the laptop.
2.2 preparation of relevant software
JDK uses jdk-8u181-linux-x64 tar. gz
Hadoop uses hadoop-2.8.0 tar. gz
Download and decompress JDK and Hadoop in advance and put them on the Docker host for standby.
Chapter 3 installing Docker
Docker is now divided into community docker CE and enterprise Docker EE. Docker CE is free and Docker EE is charged. Docker CE is divided into Docker CE Edge and Docker CE Stable. Docker CE Edge releases a version every month, and Docker CE Stable releases a version every three months. Stable means stable version, so only Docker CE Stable is introduced below.
3.1 install Yum utils
yum install -y yum-utils
3.2 add yum source for Docker CE
yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
3.3 update yum package index
yum makecache fast
3.4 installing Docker CE
yum install docker-ce -y
3.5 start Docker
systemctl start docker
3.6 viewing Docker version
docker -v
The latest version is installed by default. We can see that the latest version is docker version 17.03.1-ce, build c6d42e.
3.7 remove sudo
For non root users, many Docker commands must be executed with sudo. For example, if you don't use sudo to run the Hello world example just now, the following error message will appear:
It's troublesome to always use sudo. Here are the solutions. When docker is installed, a user group called docker will be generated. As long as you add the users you use to this group, you can avoid sudo.
Now let's join the docker group
useradd zld
usermod -aG docker zld
Then exit the current shell and log in again. You won't need sudo in the future.
3.8 basic operation of docker image
There are many images on Docker server, which are divided into two types: one is the official image of Docker, and the other is uploaded by ordinary users. The place where Docker puts them in the image is called Docker Hub.
1. Search image
docker search centos
[OK] in the OFFICIAL column of the search results list indicates that it is an OFFICIAL image. The first one we see is OFFICIAL.
9.3 create and run containers
docker run -it -h pseudo-distributed --name pseudo-distributed centos
explain:
docker run means to create a container and run it.
-it refers to the command line directly inside the container after the container runs. At this point, the container can be operated like a virtual machine.
-h indicates the hostname of the container, which is the same as the hostname of the virtual machine. If not specified, Docker will use the CONTAINER ID as the hostname.
– name indicates the name of the container. As mentioned earlier, if you don't specify it yourself, Docker will automatically assign a name. But it's more meaningful to specify it yourself.
If the host name is different from the container name, it's OK. I gave the same here. They are pseudo distributed.
The last parameter centos is the image name, which indicates which image the container is created with.
This process is similar to loading a system with ISO files.
3.10 exit the current container and keep it running
Shortcut Ctrl+p+q
3.11 entering the container in operation
docker attach pseudo-distributed
3.12 exit the current container and stop its operation
exit
3.13 start the container that has stopped operation
docker start -i pseudo-distributed
3.14 close the running container
docker stop pseudo-distributed
Chapter 4 build Hadoop pseudo distribution mode
We first use the container created earlier to build Hadoop pseudo distribution mode for testing. After the test is successful, we can build a fully distributed cluster.
4.1 SSH
This centos container can be regarded as a very compact system. Many functions are not available and need to be installed by yourself.
Hadoop requires SSH, but the container does not come with it. We need to install it.
Install SSH
yum -y install openssh-clients openssh-server
Generate 3 key files
ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key
All the way back
ssh-keygen -t ecdsa -f /etc/ssh/ssh_host_ecdsa_key
All the way back
ssh-keygen -t ed25519 -f /etc/ssh/ssh_host_ed25519_key
All the way back
Start sshd
/usr/sbin/sshd
Change the root password
Because we don't know the default password, let's reset it.
passwd root
Set ssh password free login to this machine
ssh-keygen
All the way back
ssh-copy-id localhost
Enter the root password
ssh localhost
Password free login succeeded
exit
Go back to the shell.
4.2 which
The which command is required to run hadoop. Similarly, the container does not come with it, so we need to install it.
yum -y install which
4.3 document copying
Next, we will copy the JDK and Hadoop prepared in advance from the host to the container. Note that the replication operation should be performed on the Docker host.
docker cp /opt/jdk1.8.0_181/ / pseudo-distributed:/root/
docker cp /opt/hadoop-2.8.0/ pseudo-distributed:/root/
In the container, you can see that JDK and Hadoop have been copied in place.
4.4 configure environment variables and start sshd when the container starts
In / etc / profile Create a new run in D SH file
In run Write the following 6 lines in the SH file:
export JAVA_HOME=/root/jdk1.8.0_181
export PATH=PATH:PATH:PATH:JAVA_HOME/bin
export HADOOP_HOME=/root/hadoop-2.8.0
export PATH=PATH:PATH:PATH:HADOOP_HOME/bin:HADOOPHOME/sbinexportHADOOPCONFDIR=HADOOP_HOME/sbin
export HADOOP_CONF_DIR=HADOOPHOME/sbinexportHADOOPCONFDIR=HADOOP_HOME/etc/hadoop
/usr/sbin/sshd
Exit the container with the exit command, restart and enter the container. The environment variables configured above will take effect and sshd will start.
4.5 hadoop pseudo distributed configuration
1. Configure Hadoop env sh
take export JAVA_HOME=${JAVA_HOME}Medium ${JAVA_HOME}Replace with a specific path, here export JAVA_HOME=/root/jdk1.8.0_181.
2. Configure core site xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
3. Configure HDFS site xml
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
4. Configure mapred site xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
5. Configure yarn site xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
6. Start the pseudo distributed cluster and run the wordcount sample program
Prepare test data
Create a new input folder under the / root directory and a new test Txt file, write some words in it, and then copy several more copies of the file. I copied five copies here.
4.6 format namenode
hdfs namenode -format
4.7 start HDFS
start-dfs.sh
4.8 start YARN
start-yarn.sh
4.9 check whether relevant processes are started
jps
If there are the following five processes, it indicates that the startup is successful.
DataNode
NodeManager
NameNode
SecondaryNameNode
ResourceManager
4.10 copy test data into HDFS
hdfs dfs -put /root/input /
4.11 running wordcount sample program
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar wordcount /input /output
4.12 viewing output results
hdfs dfs -cat /output/part-r-00000
The screenshot shows that the output is correct. Pseudo distributed test completed.
Chapter 5 building Hadoop fully distributed cluster
5.1 cluster planning
1 NameNode node
1 SecondaryNameNode node
1 ResourceManager node
1 JobHistory node
5 Slave nodes
1 Client node
The Slave node includes two roles: DataNode and NodeManager.
The Client node is the node used to operate. All operations should be carried out on this node as far as possible.
There are 10 nodes and 7 roles.
5.2 package the above pseudo distributed container into an image
In theory, we just need to copy 10 copies of the above pseudo distributed container, and then change the configuration file. However, Docker containers cannot be copied directly. They need to be packaged into an image, and then use this image to generate 10 new containers.
The command is as follows:
docker commit -a "zld" -m "the pseudo distribution mode built by Hadoop on CentOS." pseudo-distributed hadoop-centos:v1
-a indicates the author.
-m indicates the description of the image.
Pseudo distributed the name of the packaged container
Hadoop CentOS: V1 generates the name and version of the image
It should be noted that since the packaged container is created through centos image, the new image packaged by the container also includes centos image.
5.3 creating a network
docker network create hadoop_nw
Here is a new one called hadoop_nw network, and then add 10 Hadoop node containers to the network to communicate with each other. Moreover, there is no need to configure the hosts file, which can be accessed directly through the container name.
docker network ls
With this command, you can view all networks except Hadoop we just created_ NW network, others are generated automatically when Docker is installed, so don't worry about them in this article.
5.4 create 10 containers with the newly generated image
docker run -itd --network hadoop_nw -h namenode --name namenode -p 50070:50070 hadoop-centos:v1 docker run -itd --network hadoop_nw -h secondarynamenode --name secondarynamenode hadoop-centos:v1 docker run -itd --network hadoop_nw -h resourcemanager --name resourcemanager -p 8088:8088 hadoop-centos:v1 docker run -itd --network hadoop_nw -h jobhistory --name jobhistory -p 19888:19888 hadoop-centos:v1 docker run -itd --network hadoop_nw -h slave1 --name slave1 hadoop-centos:v1 docker run -itd --network hadoop_nw -h slave2 --name slave2 hadoop-centos:v1 docker run -itd --network hadoop_nw -h slave3 --name slave3 hadoop-centos:v1 docker run -itd --network hadoop_nw -h slave4 --name slave4 hadoop-centos:v1 docker run -itd --network hadoop_nw -h slave5 --name slave5 hadoop-centos:v1 docker run -itd --network hadoop_nw -h client --name client hadoop-centos:v1
-itd means opening the terminal but not entering
– network indicates which network to join
-p indicates port mapping
As can be seen from the above, the three nodes namenode, resourcemanager and jobhistory have made port mapping. The function of port mapping is to map a port of the Docker host to a port of the container, so that we can indirectly access the corresponding container port by accessing this port of the Docker host. It's like accessing a machine in the intranet from the Internet. We will use it later when viewing cluster information through the browser.
5.5 modifying Hadoop configuration files
We modify it in the client node and then copy it to other nodes.
1. Configure core site xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://namenode:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/root/hadoop-2.8.0/data</value> </property> </configuration>
2. Configure HDFS site xml
<configuration> <property> <name>dfs.namenode.secondary.http-address</name> <value>secondarynamenode:50090</value> </property> </configuration>
3. Configure mapred site xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>jobhistory:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>jobhistory:19888</value> </property> </configuration>
4. Configure yarn site xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>resourcemanager</value> </property> </configuration>
5. Configure slave files
cat slaves
slave1
slave2
slave3
slave4
slave5
6. Configure SSH password free login between nodes
On the client node:
ssh-copy-id namenode ssh-copy-id secondarynamenode ssh-copy-id resourcemanager ssh-copy-id jobhistory ssh-copy-id slave1 ssh-copy-id slave2 ssh-copy-id slave3 ssh-copy-id slave4 ssh-copy-id slave5
On the resourcemanager node:
ssh-copy-id slave1 ssh-copy-id slave2 ssh-copy-id slave3 ssh-copy-id slave4 ssh-copy-id slave5
7. Copy profile to all nodes
On the client node:
scp -r $HADOOP_HOME/etc/hadoop namenode:$HADOOP_HOME/etc scp -r $HADOOP_HOME/etc/hadoop secondarynamenode:$HADOOP_HOME/etc scp -r $HADOOP_HOME/etc/hadoop resourcemanager:$HADOOP_HOME/etc scp -r $HADOOP_HOME/etc/hadoop jobhistory:$HADOOP_HOME/etc scp -r $HADOOP_HOME/etc/hadoop slave1:$HADOOP_HOME/etc scp -r $HADOOP_HOME/etc/hadoop slave2:$HADOOP_HOME/etc scp -r $HADOOP_HOME/etc/hadoop slave3:$HADOOP_HOME/etc scp -r $HADOOP_HOME/etc/hadoop slave4:$HADOOP_HOME/etc scp -r $HADOOP_HOME/etc/hadoop slave5:$HADOOP_HOME/etc
5.6 start Hadoop cluster
On the client node:
1. Format namenode
ssh namenode "hdfs namenode -format"
2. Start HDFS cluster
start-dfs.sh
3. Start YARN cluster
ssh resourcemanager "start-yarn.sh"
4. Start JobHistory
ssh jobhistory "mr-jobhistory-daemon.sh --config $HADOOP_CONF_DIR start historyserver"
5.7 viewing cluster information in browser
Since the port mapping of the corresponding container has been specified, I can access the container by accessing the corresponding port of the Docker host with a browser on my Windows.
HDFS http://Docker Host IP:50070/
YARN http://Docker Host IP:8088/
jobhistory http://Docker Host IP:19888/
From the web, you can see that the cluster is normal:
5.8 running wordcount sample program
1. Copy test data to HDFS
hdfs dfs -put /root/input /
2. Run the wordcount sample program
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar wordcount /input /output
3. View output results
hdfs dfs -cat /output/part-r-00000
Chapter 6 Docker building zookeeper cluster
6.1 creating centos container
docker run -it -h pseudo-distributed --name pseudo-distributed centos
6.2 entering the container in operation
docker attach pseudo-distributed
6.3 SSH
This centos container can be regarded as a very compact system. Many functions are not available and need to be installed by yourself.
Hadoop requires SSH, but the container does not come with it. We need to install it.
Install SSH
yum -y install openssh-clients openssh-server
Generate 3 key files
ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key
All the way back
ssh-keygen -t ecdsa -f /etc/ssh/ssh_host_ecdsa_key
All the way back
ssh-keygen -t ed25519 -f /etc/ssh/ssh_host_ed25519_key
All the way back
Start sshd
/usr/sbin/sshd
Change the root password
Because we don't know the default password, let's reset it.
passwd root
Set ssh password free login to this machine
ssh-keygen
All the way back
ssh-copy-id localhost
Enter the root password
ssh localhost
Password free login succeeded
exit
Go back to the shell.
6.4 which
The which command is required to run hadoop. Similarly, the container does not come with it, so we need to install it.
yum -y install which
6.5 document copying
Next, we will copy the JDK and Hadoop prepared in advance from the host to the container. Note that the replication operation should be performed on the Docker host.
docker cp /opt/jdk1.8.0_181/ pseudo-distributed:/root/ docker cp /opt/zookeeper-3.4.10/ pseudo-distributed:/root/
In the container, you can see that the JDK and zookeeper have been copied in place.
6.6 configure environment variables and start sshd when the container starts
In / etc / profile Create a new run in D SH file
In run Write the following 6 lines in the SH file:
export JAVA_HOME=/root/jdk1.8.0_181 export PATH=$PATH:$JAVA_HOME/bin:/root/zookeeper-3.4.10/bin export HADOOP_HOME=/root/hadoop-2.8.0 export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop /usr/sbin/sshd
Exit the container with the exit command, restart and enter the container. The environment variables configured above will take effect and sshd will start.
6.7 packaging image
docker commit -a "zld" -m "Hadoop stay centos Pseudo distribution mode built on." pseudo-distributed zookeeper-centos:v1
6.8 creating networks
docker network create hadoop_nw
Here is a new one called hadoop_nw network, and then add 10 Hadoop node containers to the network to communicate with each other. Moreover, there is no need to configure the hosts file, which can be accessed directly through the container name.
docker network ls
With this command, you can view all networks except Hadoop we just created_ NW network, others are generated automatically when Docker is installed, so don't worry about them in this article.
6.9 create 3 containers with the newly generated image
docker run -itd --network hadoop_nw -h zookeeper01 --name zookeeper01zookeeper-centos:v1 docker run -itd --network hadoop_nw -h zookeeper02 --name zookeeper02zookeeper-centos:v1 docker run -itd --network hadoop_nw -h zookeeper03 --name zookeeper03zookeeper-centos:v1
6.10 modify zookeeper configuration file
docker attach zookeeper01 cd /root/zookeeper-3.4.10 mkdir data mkdir logs
vi /root/zookeeper-3.4.10/conf/zoo.cfg # The number of milliseconds of each tick tickTime=2000 # The number of ticks that the initial # synchronization phase can take initLimit=10 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=5 # the directory where the snapshot is stored. # do not use /tmp for storage, /tmp here is just # example sakes. dataDir=/root/zookeeper-3.4.10/data dataLogDir=/root/zookeeper-3.4.10/logs # the port at which the clients will connect clientPort=2181 server.1=zookeeper01:2888:3888 server.2=zookeeper02:2888:3888 server.3=zookeeper03:2888:3888
vi /root/zookeeper-3.4.10/data/myid 1
6.11 copying configuration files to other nodes
scp -r /root/zookeeper-3.4.10/ zookeeper02:/root/
scp -r /root/zookeeper-3.4.10/ zookeeper03:/root/
6.12 modify the id of other nodes
zookeeper02 node
vi /root/zookeeper-3.4.10/data/myid 2
zookeeper03 node
vi /root/zookeeper-3.4.10/data/myid 3
6.13 start zookeeper respectively
zkServer.sh start
6.14 viewing the status of zookeeper
All three hosts execute, and the status is different. One leader and two follower s
zkServer.sh status
So far: the installation of zookeeper cluster is completed!
Chapter 7 build hbase fully distributed cluster
Note: hbase runs on hadoop, so there is no need to start a new container. This paper runs the habse cluster on the slave node of hadoop.
Copy to directory 7.slave
Description: operate on the host
Required installation package
Download installation package
hbase-1.3.0-bin.tar.gz
Extract it to the / opt / directory
Copy hbase directory to slave1
docker cp /opt/hbase-1.3.0/ slave1:/root/
7.2 enter slave1 and change hbase configuration
Note: slave1 operates in the container
vi /root/hbase-1.3.0/conf/hbase-env.sh
#Change HBase env SH configuration, configure java environment variables, turn off the built-in mode and start the cluster mode
export JAVA_HOME=/disk/jdk/ export HBASE_MANAGES_ZK=false
#Note that the environment variable of hadoop's conf path is added here, or the core site in hadoop's conf XML and HDFS site Copy the XML file to the conf of hbase
export HADOOP_HOME=/root/hadoop-2.8.0 export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
#Change HBase site XML configuration (in particular, if the ha mode of hadoop is configured, the configuration of hbase.rootdir should be changed to the root directory under the fs.defaultFS parameter in core-site.xml. For example, the parameter is hdfs://bdcluster So HBase Rootdir should be changed to hdfs://bdcluster/hbase , if it is a single node namenode, HBase Change the configuration of rootdir to HDFS: / / hostname: port / HBase)
<configuration> <property> <name>hbase.rootdir</name> <value>hdfs://namenode:9000/hbase</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>zookeeper01,zookeeper02,zookeeper03</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>/disk/hbase-1.3.0/tmp/zk/data</value> </property> <property> <name>hbase.master.info.port</name> <value>60010</value> </property> <property> <name>hbase.regionserver.info.port</name> <value>60012</value> </property> </configuration>
vim regionservers slave1 slave2 slave3 slave4 slave5
7.3 copy Hbase to other machines
Note: slave1 operates in the container
scp -r /root/hbase-1.3.0/ root@slave2:/root/ scp -r /root/hbase-1.3.0/ root@slave3:/root/ scp -r /root/hbase-1.3.0/ root@slave4:/root/ scp -r /root/hbase-1.3.0/ root@slave5:/root/
7.4 start hbase cluster in slave1 container
start-hbase.sh
To this end, hbase cluster is successfully built. You only need to install opentsdb on each node.
Chapter 8 Building opentsdb
Note: opentsdb is used to write data to hbase. Here, it can be built in one of the slave nodes.
8.1 copy opentsdb directory to slave
Description: operate on the host
Required installation package
Download installation package
opentsdb-2.3.0.tar.gz
Extract it to the / opt / directory
cd /opt/opentsdb-2.2.0
mkdir build
cp -r third_party ./build
./build.sh
cd build/
make install
Copy the opentsdb directory to slave1
docker cp /opt/opentsdb-2.3.0/ slave1:/root/
8.2 table building
env COMPRESSION=NONE HBASE_HOME=/usr/local/hbase-1.2.6.1 /usr/local/opentsdb-2.2.0/src/create_table.sh
8.3 configuration
cp /root/opentsdb-2.3.0/src/opentsdb.conf /root/opentsdb-2.3.0/build/opentsdb.conf vim /root/opentsdb-2.3.0/build/opentsdb.conf tsd.network.port = 8000 tsd.http.staticroot = /disk/opentsdb-2.3.0/build/staticroot tsd.http.cachedir = /tmp/tsd tsd.core.auto_create_metrics = true tsd.storage.hbase.zk_quorum = zookeeper01:2181, zookeeper02:2181, zookeeper03:2181 tsd.http.request.enable_chunked = true tsd.http.request.max_chunk = 67108863
8.4 configuring environment variables of opentasb
cat /etc/profile.d/run.sh export PATH=$PATH:$JAVA_HOME/bin:/root/hbase-1.3.0/bin source /etc/profile.d/run.sh
8.5 startup
tsdb tsd --config=/root/opentsdb-2.3.0/build/opentsdb.conf 2>&1 > /tmp/opentsdb.log &