docker builds hadoop cluster (distributed and fully distributed)

Chapter 1 is written in front and must be read

1.1 brief description of Hadoop ecology

Note: hadoop is just a platform for storing data. mapreduce is a computing framework, which requires programmers to write programs to process data. Then hadoop is an ecosystem, that is, it also runs HBase database, sqoop, shark and other tools, so as to make use of the data stored in hadoop. HBase is a time series database, which can write data to and from hadoop and read data through hadoop. When building these, zookeeper software is also used to manage messages. In short, hadoop ecology involves too many things, which is complex to learn, but it is not difficult. Just look at the official website and clarify the concept.

1.2 purpose of this paper

If you want to build a multi node Hadoop cluster on one computer, the traditional way is to use multiple virtual machines. However, this method takes up more resources, and the number of virtual machines that a laptop can run at the same time is very limited. We can use docker at this time. Docker can be regarded as a lightweight virtual machine, which occupies less resources. It is very similar to traditional virtual machines. When used, it can be compared with VMware or VirtualBox.
This article can't make you proficient in Docker, but it can let you get started quickly and build this cluster. The premise is that you know something about Hadoop and Linux.

Chapter 2 preparation

2.1 prepare Docker host

Currently, Docker can only run on 64 bit Linux with kernel version of 3.10 or above. The Linux system where Docker is installed is called the host of Docker. If your system does not meet the requirements, you can first install a qualified virtual machine, and then use Docker on this virtual machine. My notebook is a Windows system with a 64 bit CentOS 7 3 virtual machine demonstration. Since all 10 nodes must run on one virtual machine, the resources allocated to this virtual machine cannot be too small, otherwise there will be problems. Although Docker saves a lot of resources than virtual machines, Hadoop can't save the resources it needs. I allocated 2 cores and 4G of memory, which is much smaller than using 10 virtual machines. In addition, because only one Linux is required, it is also possible to install a dual system on the laptop.

2.2 preparation of relevant software

JDK uses jdk-8u181-linux-x64 tar. gz
Hadoop uses hadoop-2.8.0 tar. gz
Download and decompress JDK and Hadoop in advance and put them on the Docker host for standby.

Chapter 3 installing Docker

Docker is now divided into community docker CE and enterprise Docker EE. Docker CE is free and Docker EE is charged. Docker CE is divided into Docker CE Edge and Docker CE Stable. Docker CE Edge releases a version every month, and Docker CE Stable releases a version every three months. Stable means stable version, so only Docker CE Stable is introduced below.

3.1 install Yum utils

yum install -y yum-utils

3.2 add yum source for Docker CE

yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo

3.3 update yum package index

yum makecache fast

3.4 installing Docker CE

yum install docker-ce -y

3.5 start Docker

systemctl start docker

3.6 viewing Docker version

docker -v
The latest version is installed by default. We can see that the latest version is docker version 17.03.1-ce, build c6d42e.

3.7 remove sudo

For non root users, many Docker commands must be executed with sudo. For example, if you don't use sudo to run the Hello world example just now, the following error message will appear:

It's troublesome to always use sudo. Here are the solutions. When docker is installed, a user group called docker will be generated. As long as you add the users you use to this group, you can avoid sudo.
Now let's join the docker group
useradd zld
usermod -aG docker zld
Then exit the current shell and log in again. You won't need sudo in the future.

3.8 basic operation of docker image

There are many images on Docker server, which are divided into two types: one is the official image of Docker, and the other is uploaded by ordinary users. The place where Docker puts them in the image is called Docker Hub.

1. Search image

docker search centos

[OK] in the OFFICIAL column of the search results list indicates that it is an OFFICIAL image. The first one we see is OFFICIAL.

9.3 create and run containers

docker run -it -h pseudo-distributed --name pseudo-distributed centos

explain:
docker run means to create a container and run it.
-it refers to the command line directly inside the container after the container runs. At this point, the container can be operated like a virtual machine.
-h indicates the hostname of the container, which is the same as the hostname of the virtual machine. If not specified, Docker will use the CONTAINER ID as the hostname.
– name indicates the name of the container. As mentioned earlier, if you don't specify it yourself, Docker will automatically assign a name. But it's more meaningful to specify it yourself.
If the host name is different from the container name, it's OK. I gave the same here. They are pseudo distributed.
The last parameter centos is the image name, which indicates which image the container is created with.
This process is similar to loading a system with ISO files.

3.10 exit the current container and keep it running

Shortcut Ctrl+p+q

3.11 entering the container in operation

docker attach pseudo-distributed

3.12 exit the current container and stop its operation

exit

3.13 start the container that has stopped operation

docker start -i pseudo-distributed

3.14 close the running container

docker stop pseudo-distributed

Chapter 4 build Hadoop pseudo distribution mode

We first use the container created earlier to build Hadoop pseudo distribution mode for testing. After the test is successful, we can build a fully distributed cluster.

4.1 SSH

This centos container can be regarded as a very compact system. Many functions are not available and need to be installed by yourself.
Hadoop requires SSH, but the container does not come with it. We need to install it.
Install SSH
yum -y install openssh-clients openssh-server

Generate 3 key files
ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key
All the way back

ssh-keygen -t ecdsa -f /etc/ssh/ssh_host_ecdsa_key
All the way back

ssh-keygen -t ed25519 -f /etc/ssh/ssh_host_ed25519_key
All the way back

Start sshd
/usr/sbin/sshd

Change the root password
Because we don't know the default password, let's reset it.
passwd root

Set ssh password free login to this machine
ssh-keygen
All the way back

ssh-copy-id localhost
Enter the root password

ssh localhost
Password free login succeeded
exit
Go back to the shell.

4.2 which

The which command is required to run hadoop. Similarly, the container does not come with it, so we need to install it.
yum -y install which

4.3 document copying

Next, we will copy the JDK and Hadoop prepared in advance from the host to the container. Note that the replication operation should be performed on the Docker host.
docker cp /opt/jdk1.8.0_181/ / pseudo-distributed:/root/
docker cp /opt/hadoop-2.8.0/ pseudo-distributed:/root/
In the container, you can see that JDK and Hadoop have been copied in place.

4.4 configure environment variables and start sshd when the container starts

In / etc / profile Create a new run in D SH file
In run Write the following 6 lines in the SH file:
export JAVA_HOME=/root/jdk1.8.0_181
export PATH=PATH:PATH:PATH:JAVA_HOME/bin
export HADOOP_HOME=/root/hadoop-2.8.0
export PATH=PATH:PATH:PATH:HADOOP_HOME/bin:HADOOPHOME/sbinexportHADOOPCONFDIR=HADOOP_HOME/sbin export HADOOP_CONF_DIR=HADOOPH​OME/sbinexportHADOOPC​ONFD​IR=HADOOP_HOME/etc/hadoop
/usr/sbin/sshd

Exit the container with the exit command, restart and enter the container. The environment variables configured above will take effect and sshd will start.

4.5 hadoop pseudo distributed configuration

1. Configure Hadoop env sh

take export JAVA_HOME=${JAVA_HOME}Medium ${JAVA_HOME}Replace with a specific path, here export JAVA_HOME=/root/jdk1.8.0_181. 

2. Configure core site xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

3. Configure HDFS site xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

4. Configure mapred site xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

5. Configure yarn site xml

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

6. Start the pseudo distributed cluster and run the wordcount sample program

Prepare test data
Create a new input folder under the / root directory and a new test Txt file, write some words in it, and then copy several more copies of the file. I copied five copies here.

4.6 format namenode

hdfs namenode -format

4.7 start HDFS

start-dfs.sh

4.8 start YARN

start-yarn.sh

4.9 check whether relevant processes are started

jps
If there are the following five processes, it indicates that the startup is successful.
DataNode
NodeManager
NameNode
SecondaryNameNode
ResourceManager

4.10 copy test data into HDFS

hdfs dfs -put /root/input /

4.11 running wordcount sample program

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar wordcount /input /output

4.12 viewing output results

hdfs dfs -cat /output/part-r-00000
The screenshot shows that the output is correct. Pseudo distributed test completed.

Chapter 5 building Hadoop fully distributed cluster

5.1 cluster planning

1 NameNode node
1 SecondaryNameNode node
1 ResourceManager node
1 JobHistory node
5 Slave nodes
1 Client node

The Slave node includes two roles: DataNode and NodeManager.
The Client node is the node used to operate. All operations should be carried out on this node as far as possible.
There are 10 nodes and 7 roles.

5.2 package the above pseudo distributed container into an image

In theory, we just need to copy 10 copies of the above pseudo distributed container, and then change the configuration file. However, Docker containers cannot be copied directly. They need to be packaged into an image, and then use this image to generate 10 new containers.
The command is as follows:
docker commit -a "zld" -m "the pseudo distribution mode built by Hadoop on CentOS." pseudo-distributed hadoop-centos:v1
-a indicates the author.
-m indicates the description of the image.
Pseudo distributed the name of the packaged container
Hadoop CentOS: V1 generates the name and version of the image
It should be noted that since the packaged container is created through centos image, the new image packaged by the container also includes centos image.

5.3 creating a network

docker network create hadoop_nw
Here is a new one called hadoop_nw network, and then add 10 Hadoop node containers to the network to communicate with each other. Moreover, there is no need to configure the hosts file, which can be accessed directly through the container name.
docker network ls
With this command, you can view all networks except Hadoop we just created_ NW network, others are generated automatically when Docker is installed, so don't worry about them in this article.

5.4 create 10 containers with the newly generated image

docker run -itd --network hadoop_nw -h namenode --name namenode -p 50070:50070 hadoop-centos:v1

docker run -itd --network hadoop_nw -h secondarynamenode --name secondarynamenode hadoop-centos:v1

docker run -itd --network hadoop_nw -h resourcemanager --name resourcemanager -p 8088:8088 hadoop-centos:v1

docker run -itd --network hadoop_nw -h jobhistory --name jobhistory -p 19888:19888 hadoop-centos:v1

docker run -itd --network hadoop_nw -h slave1 --name slave1 hadoop-centos:v1

docker run -itd --network hadoop_nw -h slave2 --name slave2 hadoop-centos:v1

docker run -itd --network hadoop_nw -h slave3 --name slave3 hadoop-centos:v1

docker run -itd --network hadoop_nw -h slave4 --name slave4 hadoop-centos:v1

docker run -itd --network hadoop_nw -h slave5 --name slave5 hadoop-centos:v1

docker run -itd --network hadoop_nw -h client --name client hadoop-centos:v1

-itd means opening the terminal but not entering
– network indicates which network to join
-p indicates port mapping
As can be seen from the above, the three nodes namenode, resourcemanager and jobhistory have made port mapping. The function of port mapping is to map a port of the Docker host to a port of the container, so that we can indirectly access the corresponding container port by accessing this port of the Docker host. It's like accessing a machine in the intranet from the Internet. We will use it later when viewing cluster information through the browser.

5.5 modifying Hadoop configuration files

We modify it in the client node and then copy it to other nodes.

1. Configure core site xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://namenode:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/root/hadoop-2.8.0/data</value>
    </property>
</configuration>

2. Configure HDFS site xml

<configuration>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>secondarynamenode:50090</value>
    </property>
</configuration>

3. Configure mapred site xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>jobhistory:10020</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>jobhistory:19888</value>
    </property>
</configuration>

4. Configure yarn site xml

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>resourcemanager</value>
    </property>
</configuration>

5. Configure slave files

cat slaves
slave1
slave2
slave3
slave4
slave5

6. Configure SSH password free login between nodes

On the client node:

ssh-copy-id namenode
ssh-copy-id secondarynamenode
ssh-copy-id resourcemanager
ssh-copy-id jobhistory
ssh-copy-id slave1
ssh-copy-id slave2
ssh-copy-id slave3
ssh-copy-id slave4
ssh-copy-id slave5

On the resourcemanager node:

ssh-copy-id slave1
ssh-copy-id slave2
ssh-copy-id slave3
ssh-copy-id slave4
ssh-copy-id slave5

7. Copy profile to all nodes

On the client node:

scp -r $HADOOP_HOME/etc/hadoop namenode:$HADOOP_HOME/etc
scp -r $HADOOP_HOME/etc/hadoop secondarynamenode:$HADOOP_HOME/etc
scp -r $HADOOP_HOME/etc/hadoop resourcemanager:$HADOOP_HOME/etc
scp -r $HADOOP_HOME/etc/hadoop jobhistory:$HADOOP_HOME/etc
scp -r $HADOOP_HOME/etc/hadoop slave1:$HADOOP_HOME/etc
scp -r $HADOOP_HOME/etc/hadoop slave2:$HADOOP_HOME/etc
scp -r $HADOOP_HOME/etc/hadoop slave3:$HADOOP_HOME/etc
scp -r $HADOOP_HOME/etc/hadoop slave4:$HADOOP_HOME/etc
scp -r $HADOOP_HOME/etc/hadoop slave5:$HADOOP_HOME/etc

5.6 start Hadoop cluster

On the client node:

1. Format namenode

ssh namenode "hdfs namenode -format"

2. Start HDFS cluster

start-dfs.sh

3. Start YARN cluster

ssh resourcemanager "start-yarn.sh"

4. Start JobHistory

ssh jobhistory "mr-jobhistory-daemon.sh --config $HADOOP_CONF_DIR start historyserver"

5.7 viewing cluster information in browser

Since the port mapping of the corresponding container has been specified, I can access the container by accessing the corresponding port of the Docker host with a browser on my Windows.
HDFS http://Docker Host IP:50070/
YARN http://Docker Host IP:8088/
jobhistory http://Docker Host IP:19888/
From the web, you can see that the cluster is normal:

5.8 running wordcount sample program

1. Copy test data to HDFS

hdfs dfs -put /root/input /

2. Run the wordcount sample program

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar wordcount /input /output

3. View output results

hdfs dfs -cat /output/part-r-00000

Chapter 6 Docker building zookeeper cluster

6.1 creating centos container

docker run -it -h pseudo-distributed --name pseudo-distributed centos

6.2 entering the container in operation

docker attach  pseudo-distributed

6.3 SSH

This centos container can be regarded as a very compact system. Many functions are not available and need to be installed by yourself.
Hadoop requires SSH, but the container does not come with it. We need to install it.
Install SSH

yum -y install openssh-clients openssh-server

Generate 3 key files

ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key

All the way back

ssh-keygen -t ecdsa -f /etc/ssh/ssh_host_ecdsa_key

All the way back

ssh-keygen -t ed25519 -f /etc/ssh/ssh_host_ed25519_key

All the way back

Start sshd

/usr/sbin/sshd

Change the root password
Because we don't know the default password, let's reset it.

passwd root

Set ssh password free login to this machine

ssh-keygen

All the way back

ssh-copy-id localhost

Enter the root password

ssh localhost

Password free login succeeded
exit
Go back to the shell.

6.4 which

The which command is required to run hadoop. Similarly, the container does not come with it, so we need to install it.

yum -y install which

6.5 document copying

Next, we will copy the JDK and Hadoop prepared in advance from the host to the container. Note that the replication operation should be performed on the Docker host.

docker cp /opt/jdk1.8.0_181/  pseudo-distributed:/root/
docker cp /opt/zookeeper-3.4.10/ pseudo-distributed:/root/

In the container, you can see that the JDK and zookeeper have been copied in place.

6.6 configure environment variables and start sshd when the container starts

In / etc / profile Create a new run in D SH file
In run Write the following 6 lines in the SH file:

export JAVA_HOME=/root/jdk1.8.0_181
export PATH=$PATH:$JAVA_HOME/bin:/root/zookeeper-3.4.10/bin
export HADOOP_HOME=/root/hadoop-2.8.0
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
/usr/sbin/sshd

Exit the container with the exit command, restart and enter the container. The environment variables configured above will take effect and sshd will start.

6.7 packaging image

docker commit -a "zld" -m "Hadoop stay centos Pseudo distribution mode built on." pseudo-distributed zookeeper-centos:v1

6.8 creating networks

docker network create hadoop_nw

Here is a new one called hadoop_nw network, and then add 10 Hadoop node containers to the network to communicate with each other. Moreover, there is no need to configure the hosts file, which can be accessed directly through the container name.

docker network ls

With this command, you can view all networks except Hadoop we just created_ NW network, others are generated automatically when Docker is installed, so don't worry about them in this article.

6.9 create 3 containers with the newly generated image

docker run -itd --network hadoop_nw -h zookeeper01 --name zookeeper01zookeeper-centos:v1
docker run -itd --network hadoop_nw -h zookeeper02 --name zookeeper02zookeeper-centos:v1
docker run -itd --network hadoop_nw -h zookeeper03 --name zookeeper03zookeeper-centos:v1

6.10 modify zookeeper configuration file

docker attach zookeeper01
cd /root/zookeeper-3.4.10
mkdir data
mkdir logs
vi /root/zookeeper-3.4.10/conf/zoo.cfg
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataDir=/root/zookeeper-3.4.10/data
dataLogDir=/root/zookeeper-3.4.10/logs
# the port at which the clients will connect
clientPort=2181

server.1=zookeeper01:2888:3888
server.2=zookeeper02:2888:3888
server.3=zookeeper03:2888:3888
vi /root/zookeeper-3.4.10/data/myid
1

6.11 copying configuration files to other nodes

scp -r /root/zookeeper-3.4.10/ zookeeper02:/root/
scp -r /root/zookeeper-3.4.10/ zookeeper03:/root/

6.12 modify the id of other nodes

zookeeper02 node

vi /root/zookeeper-3.4.10/data/myid
2

zookeeper03 node

vi /root/zookeeper-3.4.10/data/myid
3

6.13 start zookeeper respectively

zkServer.sh start

6.14 viewing the status of zookeeper

All three hosts execute, and the status is different. One leader and two follower s
zkServer.sh status
So far: the installation of zookeeper cluster is completed!

Chapter 7 build hbase fully distributed cluster

Note: hbase runs on hadoop, so there is no need to start a new container. This paper runs the habse cluster on the slave node of hadoop.

Copy to directory 7.slave

Description: operate on the host
Required installation package
Download installation package
hbase-1.3.0-bin.tar.gz
Extract it to the / opt / directory
Copy hbase directory to slave1
docker cp /opt/hbase-1.3.0/ slave1:/root/

7.2 enter slave1 and change hbase configuration

Note: slave1 operates in the container
vi /root/hbase-1.3.0/conf/hbase-env.sh
#Change HBase env SH configuration, configure java environment variables, turn off the built-in mode and start the cluster mode

export JAVA_HOME=/disk/jdk/
export HBASE_MANAGES_ZK=false

#Note that the environment variable of hadoop's conf path is added here, or the core site in hadoop's conf XML and HDFS site Copy the XML file to the conf of hbase

export HADOOP_HOME=/root/hadoop-2.8.0
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

#Change HBase site XML configuration (in particular, if the ha mode of hadoop is configured, the configuration of hbase.rootdir should be changed to the root directory under the fs.defaultFS parameter in core-site.xml. For example, the parameter is hdfs://bdcluster So HBase Rootdir should be changed to hdfs://bdcluster/hbase , if it is a single node namenode, HBase Change the configuration of rootdir to HDFS: / / hostname: port / HBase)

<configuration>
<property> 
<name>hbase.rootdir</name> 
<value>hdfs://namenode:9000/hbase</value> 
</property>
<property> 
<name>hbase.cluster.distributed</name> 
<value>true</value> 
</property> 
<property> 
<name>hbase.zookeeper.quorum</name> 
<value>zookeeper01,zookeeper02,zookeeper03</value> 
</property> 
<property> 
<name>hbase.zookeeper.property.dataDir</name> 
<value>/disk/hbase-1.3.0/tmp/zk/data</value> 
</property>
<property>
<name>hbase.master.info.port</name>
<value>60010</value>
</property>
<property>
<name>hbase.regionserver.info.port</name>
<value>60012</value>
</property>
</configuration>

vim regionservers
slave1
slave2
slave3
slave4
slave5

7.3 copy Hbase to other machines

Note: slave1 operates in the container

scp -r /root/hbase-1.3.0/ root@slave2:/root/
scp -r /root/hbase-1.3.0/ root@slave3:/root/
scp -r /root/hbase-1.3.0/ root@slave4:/root/
scp -r /root/hbase-1.3.0/ root@slave5:/root/

7.4 start hbase cluster in slave1 container

start-hbase.sh
To this end, hbase cluster is successfully built. You only need to install opentsdb on each node.

Chapter 8 Building opentsdb

Note: opentsdb is used to write data to hbase. Here, it can be built in one of the slave nodes.

8.1 copy opentsdb directory to slave

Description: operate on the host
Required installation package
Download installation package
opentsdb-2.3.0.tar.gz
Extract it to the / opt / directory
cd /opt/opentsdb-2.2.0
mkdir build
cp -r third_party ./build
./build.sh
cd build/
make install
Copy the opentsdb directory to slave1
docker cp /opt/opentsdb-2.3.0/ slave1:/root/

8.2 table building

env COMPRESSION=NONE HBASE_HOME=/usr/local/hbase-1.2.6.1 /usr/local/opentsdb-2.2.0/src/create_table.sh

8.3 configuration

cp /root/opentsdb-2.3.0/src/opentsdb.conf /root/opentsdb-2.3.0/build/opentsdb.conf
vim /root/opentsdb-2.3.0/build/opentsdb.conf
tsd.network.port = 8000
tsd.http.staticroot = /disk/opentsdb-2.3.0/build/staticroot
tsd.http.cachedir = /tmp/tsd
tsd.core.auto_create_metrics = true
tsd.storage.hbase.zk_quorum = zookeeper01:2181, zookeeper02:2181,
zookeeper03:2181
tsd.http.request.enable_chunked = true
tsd.http.request.max_chunk = 67108863

8.4 configuring environment variables of opentasb

cat /etc/profile.d/run.sh
export PATH=$PATH:$JAVA_HOME/bin:/root/hbase-1.3.0/bin
source /etc/profile.d/run.sh

8.5 startup

tsdb tsd --config=/root/opentsdb-2.3.0/build/opentsdb.conf 2>&1 > /tmp/opentsdb.log &

Tags: Docker Hadoop Zookeeper hdfs

Posted by dvt85 on Fri, 20 May 2022 17:54:54 +0300