Hadoop+HBase+Spark+Hive environment construction

This article comes from: https://www.cnblogs.com/cheyunhua/p/10037162.html


0. Prepare the installation package

The system image, big data software installation package and development environment software installation package required in this article can be downloaded from my baidu cloud disk.
Link: System image and various big data software
Password: n2cn

1. Install Ubuntu dual system under Windows

Hadoop and other big data open source frameworks do not support Windows systems, so you need to install a Linux dual system first. Of course, if you have a separate computer to install Ubuntu, you don't need to install dual systems.

Dual system installation
Please refer to the installation guide:
Step 1: Make system startup U SB flash disk
Step 2: Install dual system

2. Build Hadoop platform

Hadoop is a reliable, scalable and distributed open source software developed by Apache Company. Taking Hadoop distributed file system (HDFS) and distributed computing programming framework (MapReduce) as the core, it allows the distributed processing of large data sets using a simple programming model on the cluster server. Next, please follow the author to build your own Hadoop platform step by step.

2.1 update source

Run the following shell command in bash terminal, set the password of root user and switch to root user

#Set root password
sudo passwd
#Switch to root
su root

Update source

apt-get update

Install vim compiler

apt-get install vim

Back up the original official source

cp /etc/apt/sources.list /etc/apt/sources.list.bak

Delete original official source

rm /etc/apt/sources.list

Run the following shell command to recreate sources List file

vim /etc/apt/sources.list

Press i to enter the editing mode of vim and copy the following Tsinghua source to sources List file, then press esc to exit the editing mode, finally enter: + wq, press enter to save (or press shift + zz to save).

# The source image is annotated by default to improve apt update speed. You can cancel the annotation if necessary
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ artful main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ artful main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ artful-updates main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ artful-updates main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ artful-backports main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ artful-backports main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ artful-security main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ artful-security main restricted universe multiverse

# Pre release software source, not recommended
# deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ artful-proposed main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ artful-proposed main restricted universe multiverse

Run the following shell command to complete the update of the source

apt-get update

2.2 install SSH, configure SSH and log in without password

SSH client is installed by default for Ubuntu. In addition, SSH server needs to be installed:

sudo apt-get install openssh-server

After installation, modify sshd_config configuration

vim /etc/ssh/sshd_config

Set the following properties in the file: (press / to enter the search mode, press esc to exit the search mode)

PubkeyAuthentication yes
PermitRootLogin yes

Restart ssh service

sudo /etc/init.d/ssh restart

After restart, you can log in to the machine with the following command, but you need a password to log in at this time:

ssh localhost

First, exit ssh and return to our original terminal window. Then use ssh keygen to generate the key and add the key to the authorization:

exit                           # Exit ssh localhost just now
cd ~/.ssh/                     # If there is no such directory, please execute ssh localhost once first
ssh-keygen -t rsa              # There will be a prompt. Just press enter
cat ./id_rsa.pub >> ./authorized_keys  # Join authorization

In the Linux system, ~ represents the user's home folder (except for the root user), that is, the directory of "/ home / user name". If your user name is ubuntu, then ~ represents "/ home/ubuntu /". If you are a root user, then ~ stands for / root. In addition, the text after the # command is a comment. You only need to enter the previous command.

At this time, you can log in directly without entering a password by using the ssh localhost command.

2.3 installing JAVA environment

Oracle JDK is recommended for Java environment. First, prepare the file jdk-8u162-linux-x64.tar.gz , and then move the file to the / usr/local directory:

mv jdk-8u162-linux-x64.tar.gz /usr/local

Unzip file

tar -zxvf jdk-8u162-linux-x64.tar.gz

Rename folder to java

mv jdk-1.8.0_162 java

Open / etc/profile file with vim (the file for configuring system environment variables under Linux)

vim /etc/profile

Press i to enter the editing mode, and add the following JAVA environment variables at the end of the file

export JAVA_HOME=/usr/local/java
export JRE_HOME=/usr/local/java/jre
export CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib:$JRE_HOME/lib
export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin

After adding environment variables, the result is as shown in the figure below. Press esc to exit editing mode, then enter: + wq, and press enter to save (or press shift + zz to save).

 

 

Configure environment variables

Finally, you need to make the environment variable effective and execute the following code:

source /etc/profile

Verify that JAVA is installed successfully

echo $JAVA_HOME     # Test variable value
java -version
java
javac

If the setting is correct, java -version will output the version information of java, and java and javac will output the instructions for the use of commands.

2.4 installing Hadoop

Download hadoop-2.7.6.tar.gz File, and then move the file to the / usr/local directory

mv hadoop-2.7.6.tar.gz /usr/local

decompression

tar -zxvf hadoop-2.7.6.tar.gz 

Rename the folder to hadoop

mv hadoop-2.7.6 hadoop

Configure the environment variables, open the file / etc/profile, and add the following Hadoop environment variables

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:/usr/local/hadoop/bin:/usr/local/hadoop/sbin

Similarly, you need to make the environment variable effective, and execute the following code:

source /etc/profile

Enter the following command to check whether Hadoop is available. If successful, the Hadoop version information will be displayed:

hadoop version

2.5 Hadoop configuration (single machine pseudo distributed mode)

Hadoop can run in a pseudo distributed way on a single node. Hadoop process runs in a separate Java process. The node acts as both NameNode and DataNode. At the same time, it reads the files in HDFS.

Modify the configuration file core site XML (GEDIT / usr / local / Hadoop / etc / Hadoop / core site. XML), and

<configuration>
</configuration>

Modify to the following configuration:

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/usr/local/hadoop/tmp</value>
        <description>Abase for other temporary directories.</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

Similarly, modify the configuration file HDFS site xml(gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml):

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/usr/local/hadoop/tmp/dfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/usr/local/hadoop/tmp/dfs/data</value>
    </property>
</configuration>

Modify the file Hadoop env Sh (GEDIT / usr / local / Hadoop / etc / Hadoop / Hadoop env. SH), add Hadoop and Java environment variables at the beginning of the file.

export JAVA_HOME=/usr/local/java
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:/usr/local/hadoop/bin

Hadoop configuration file description

The operation mode of Hadoop is determined by the configuration file (the configuration file will be read when running Hadoop). Pseudo distributed only needs to configure FS Defaultfs and DFS Replication can be run (as in the official tutorial), but if Hadoop is not configured tmp. Dir parameter, the default temporary directory is / TMP / hadoo Hadoop, and this directory may be cleaned up by the system during restart, resulting in the need to re execute format. So we set it and specify DFS namenode. name. Dir and DFS datanode. data. Dir, otherwise errors may occur in the next steps.

After configuration, execute the format of NameNode:

/usr/local/hadoop/bin/hdfs namenode -format

Start hadoop

./usr/local/hadoop/sbin/start-all.sh

After successful startup, run the jps command

source /etc/profile
jps

If the installation is successful, the following process will occur

 

Process after hadoop starts successfully


After successful startup, you can access the Web interface http://localhost:50070 View NameNode and Datanode information, and you can also view files in HDFS online.

 

2.6 Hadoop configuration (cluster mode)

2.6.1 set static IP (take the master node as an example)

Edit the file / etc/network/interfaces

vim /etc/network/interfaces

Add the following configuration information after the file (eth0 is the network card name, which needs to be changed according to the actual situation)

auto eth0 #adapter name
iface eth0 inet static
address 192.168.1.2 #Static IP (can be set freely according to the actual situation)
netmask 255.255.255.0 #Subnet mask
gateway 192.168.1.1 #gateway
dns-nameservers 192.168.1.1 #DNS server address, which is the same as the gateway

Edit the file / etc / resolve conf

vim /etc/resolve.conf

Add the following configuration information to the file

nameserver 192.168.1.1

This dns will fail after system restart. Edit the file / etc / resolvconf / resolv conf.d/base

vim /etc/resolvconf/resolv.conf.d/base

Add the following to permanently save the DNS configuration

nameserver 192.168.1.1

Run the following command to restart the network

/etc/init.d/networking restart

If it is invalid after restart, restart the system.

If the network card cannot be found after restart, enable the system managed network card

vim /etc/NetworkManager/NetworkManager.conf

modify

managed=false

by

managed=true

Run the following command to restart the network

/etc/init.d/networking restart

If it is invalid after restart, restart the system.

2.6.2 configure hosts file (each host should be configured)

Modify host name

vim /etc/hostname

Tip: the master node is set to master, and the slave node is set to slave1, slave2, and so on.

Edit the file / etc/hosts

vim /etc/hosts

Copy the following data into each host of the cluster

192.168.1.2     master
192.168.1.11    slave1

Note: if another slave is added, the slave2 information will be added

Use the following instructions to test in the master host. You can use similar instructions to test on slave1:

ping slave1

If the ping is enabled, the network connection is normal. Otherwise, please check whether the network connection or IP information is correct.

2.6.3 SSH password free login node (configured on the master)

This operation is to enable the master node to log in to each slave node without password SSH.

First, generate the public key of the master node and execute it in the terminal of the master node (because the host name has been changed, the original one needs to be deleted and regenerated again):

cd ~/.ssh               # If there is no such directory, first execute ssh localhost
rm ./id_rsa*            # Delete the previously generated public key (if any)
ssh-keygen -t rsa       # Just press enter all the time

To enable the master node to SSH locally without password, execute the following on the master node:

cat ./id_rsa.pub >> ./authorized_keys

After completion, you can execute ssh master to verify (you may need to enter yes. After success, execute exit to return to the original terminal). Then transfer the public key on the master node to the slave 1 node:

scp ~/.ssh/id_rsa.pub root@slave1:/root/

scp is short for secure copy. It is used to copy files remotely under Linux. It is similar to the cp command, but cp can only be copied locally. When scp is executed, you will be asked to enter the password of the root user on slave1.

Then, on the slave 1 node, add the ssh public key to the authorization

mkdir /root/.ssh       # If the folder does not exist, you need to create it first. If it already exists, it will be ignored
cat /root/id_rsa.pub >> /root/.ssh/authorized_keys
rm /root/id_rsa.pub    # You can delete it after you use it

If there are other slave nodes, transfer the master key to the slave node and add authorization to the slave node.

In this way, you can SSH to each slave node without password on the master node. You can execute the following commands on the master node to verify:

ssh root@slave1

If no password is required, the configuration is successful.

2.6.4 modify Hadoop configuration file (configured on master)

Modify the configuration file core site XML (GEDIT / usr / local / Hadoop / etc / Hadoop / core site. XML), and

<configuration>
</configuration>

Modify to the following configuration:

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/usr/local/hadoop/tmp</value>
        <description>Abase for other temporary directories.</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master:9000</value>
    </property>
</configuration>

Similarly, modify the configuration file HDFS site xml(gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml):

<configuration>
        <property>
                <name>dfs.namenode.secondary.http-address</name>
                <value>Master:50090</value>
        </property>
        <property>
                <name>dfs.replication</name>
                <value>2</value>
        </property>
        <property>
                <name>dfs.namenode.name.dir</name>
                <value>file:/usr/local/hadoop/tmp/dfs/name</value>
        </property>
        <property>
                <name>dfs.datanode.data.dir</name>
                <value>file:/usr/local/hadoop/tmp/dfs/data</value>
        </property>
</configuration>

Modify the file mapred site XML (you may need to rename it first, and the default file name is mapred-site.xml.template), and then modify the configuration as follows:

<configuration>
        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
        <property>
                <name>mapreduce.jobhistory.address</name>
                <value>master:10020</value>
        </property>
        <property>
                <name>mapreduce.jobhistory.webapp.address</name>
                <value>master:19888</value>
        </property>
</configuration>

Configure yarn site xml(gedit /usr/local/hadoop/etc/hadoop/yarn-site.xml)

<configuration>
<property>
     <name>yarn.resourcemanager.hostname</name>
     <value>master</value>
</property>
<property>
     <name>yarn.nodemanager.resource.memory-mb</name>
     <value>10240</value>
</property>
<property>
     <name>yarn.nodemanager.aux-services</name>
     <value>mapreduce_shuffle</value>
</property>
</configuration>

Modify the file Hadoop env Sh (GEDIT / usr / local / Hadoop / etc / Hadoop / Hadoop env. SH), add Hadoop and Java environment variables at the beginning of the file.

export JAVA_HOME=/usr/local/java
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:/usr/local/hadoop/bin

Configure slaves(gedit /usr/local/hadoop/etc/hadoop/slaves), delete the default localhost, and add a slave node:

slave1

Note: if you add another slave, add slave2

After configuration, copy the / usr/local/hadoop folder on the master to each node.

sudo rm -rf /usr/local/hadoop/tmp     # Delete Hadoop temporary files
sudo rm -rf /usr/local/hadoop/logs   # Delete log file
scp -r /usr/local/hadoop slave1:/usr/local

Note: each slave needs to configure Hadoop environment variables

Start hadoop on the master node

./usr/local/hadoop/bin/hdfs namenode -format
./usr/local/hadoop/sbin/start-all.sh

After successful startup, run the jps command

source /etc/profile
jps

If the installation is successful, the master node will have a NameNode process and the slave node will have a DataNode process.

After successful startup, you can access the Web interface http://master:50070 View NameNode and Datanode information, and you can also view files in HDFS online.

3. Install HBase database

HBase is a distributed and column oriented open source database, which comes from a Google paper "BigTable: a distributed storage system for structured data". HBase stores data in the form of a table, which is composed of rows and columns. The columns are divided into several column families. For official information on HBase, please visit HBase official website . There are three modes of HBase operation: single machine mode, pseudo distributed mode and distributed mode.
Stand alone mode: HBase is installed and used on one computer without involving distributed storage of data; Pseudo distributed mode: simulate a small cluster on a computer; Distributed mode: use multiple computers to realize distributed storage in the physical sense. For learning purposes, we only focus on pseudo distributed patterns.

3.1. HBase installation

Download hbase-2.0.0-bin.tar.gz File and move the file to the / usr/local directory

mv hbase-2.0.0-bin.tar.gz /usr/local

decompression

tar -zxvf hbase-2.0.0-bin.tar.gz

folders renaming

mv hbase-2.0.0 hbase

Add the bin directory under hbase to the path, so that you don't need to go to / usr/local/hbase directory to start hbase, which greatly facilitates the use of hbase. The following part of the tutorial still switches to the / usr/local/hbase directory operation, which is helpful for beginners to understand the running process. After being proficient, you don't have to switch.
Edit / etc/profile file

vim /etc/profile

Add the following at the end of the / etc/profile file:

export HBASE_HOME=/usr/local/hbase
export PATH=$HBASE_HOME/bin:$PATH
export HBASE_MANAGES_ZK=true

After editing, press esc to exit the editing mode, then enter: + wq, press enter to save (or press shift + zz to save), and finally execute the source command to make the above configuration take effect immediately on the current terminal. The commands are as follows:

source /etc/profile

Check the HBase version and confirm that the HBase installation is successful. The command is as follows:

hbase version

3.2. HBase pseudo distribution mode configuration

Configure / usr / local / HBase / conf / HBase site XML, open and edit HBase site XML, the command is as follows:

gedit /usr/local/hbase/conf/hbase-site.xml

Before starting HBase, you need to set the attribute HBase Rootdir, used to specify the storage location of HBase data, because if it is not set, HBase Rootdir defaults to / tmp/hbase-${user.name}, which means that data will be lost every time the system is restarted. Set here as the HBase tmp folder under the HBase installation directory (/ usr / local / HBase / HBase TMP), and add the configuration as follows:

<configuration>
        <property>
                <name>hbase.rootdir</name>
                <value>hdfs://localhost:9000/hbase</value>
        </property>
        <property>
                <name>hbase.cluster.distributed</name>
                <value>true</value>
        </property>
</configuration>

Open the file (GEDIT / usr / local / HBase / conf / HBase env. SH) and add the java environment variable

export JAVA_HOME=/usr/local/java
export HBASE_HOME=/usr/local/hbase
export PATH=$PATH/usr/local/hbase/bin

3.3 HBase cluster mode configuration

Modify the master node configuration file HBase site xml(gedit /usr/local/hbase/conf/hbase-site.xml)

   <configuration>
        <property>
                <name>hbase.rootdir</name>
                <value>hdfs://master:9000/hbase</value>
        </property>
        <property>
                <name>hbase.cluster.distributed</name>
                <value>true</value>
        </property>
        <property>
                <name>hbase.zookeeper.quorum</name>
                <value>master,slave1</value>
        </property>
        <property>
                <name>hbase.temp.dir</name>
                <value>/usr/local/hbase/tmp</value>
        </property>
        <property>
                <name>hbase.zookeeper.property.dataDir</name>
                <value>/usr/local/hbase/tmp/zookeeper</value>
        </property>
        <property>
                <name>hbase.master.info.port</name>
                <value>16010</value>
        </property>
</configuration>

Note: if you add another slave, HBase zookeeper. Quorum add slave2

Modify the configuration file regionservers(gedit /usr/local/hbase/conf/regionservers), delete the localhosts in it, and change it to:

master
slave1

If you add another slave, add slave2

Transfer HBase to other slave nodes (the slave does not need to download the installation package, but can be transferred by the host, and the slave environment variable needs to be configured), that is, transfer the configured HBase folder to the corresponding location of each node:

scp -r /usr/local/hbase root@slave1:/usr/local/

Note: each slave needs to configure the environment variable of HBase

3.4 test run

First switch to the HBase installation directory / usr/local/hbase; Restart HBase. The command is as follows:

/usr/local/hadoop/sbin/start-all.sh  #Start hadoop. If it is started, you don't need to execute this command
/usr/local/hbase/start-hbase.sh     #Start hbase
hbase shell                           #Enter the hbase shell. If you can, it means that the HBase installation is successful

Stop HBase operation, and the command is as follows:

bin/stop-hbase.sh

If hbase is started successfully, the following process will appear using the jps command

 

 

Master node process

 

 

Slave node process

4. Install Spark memory computing engine

Apache Spark is an emerging general engine for big data processing, which provides distributed memory abstraction. The biggest feature of spark is fast, which is 100 times faster than Hadoop MapReduce. Spark is based on Hadoop environment, Hadoop YARN provides spark with resource scheduling framework, and Hadoop HDFS provides spark with underlying distributed file storage.

4.1 Spark installation

The installation process of Spark is relatively simple. On the premise that Hadoop has been installed, it can be used after simple configuration. Download it first spark-2.3.0-bin-hadoop2.7.tgz File and move the file to the / usr/local directory

mv spark-2.3.0-bin-hadoop2.7.tgz /usr/local

decompression

cd /usr/local
tar -zxvf spark-2.3.0-bin-hadoop2.7.tgz

folders renaming

mv spark-2.3.0 spark

Edit the / etc/profile file and add environment variables

vim /etc/profile

Add the following at the end of the / etc/profile file:

export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

After editing, save and exit, and then execute the source command to make the above configuration take effect immediately on the current terminal. The command is as follows:

source /etc/profile

4.2. Spark stand-alone configuration

Configuration file spark env sh

cd /usr/local/spark
cp ./conf/spark-env.sh.template ./conf/spark-env.sh

Edit spark env SH file (VIM. / conf / spark env. SH), add the following configuration information on the first line:

export JAVA_HOME=/usr/local/java
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_HDFS_HOME=/usr/local/hadoop
export SPARK_HOME=/usr/local/spark
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)
SPARK_MASTER_WEBUI_PORT=8079

4.3. Spark cluster configuration

Configure the file spark env. On the master sh

cd /usr/local/spark
cp ./conf/spark-env.sh.template ./conf/spark-env.sh

Edit spark env SH file (VIM. / conf / spark env. SH), add the following configuration information on the first line:

export JAVA_HOME=/usr/local/java
export SCALA_HOME=/usr/local/scala
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_HDFS_HOME=/usr/local/hadoop
export SPARK_HOME=/usr/local/spark
export SPARK_MASTER_IP=master
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_HOST=master
export SPARK_WORKER_CORES=2
export SPARK_WORKER_PORT=8901
export SPARK_WORKER_INSTANCES=1
export SPARK_WORKER_MEMORY=2g
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)
export SPARK_MASTER_WEBUI_PORT=8079

Save and refresh the configuration:

source spark-env.sh

Configure slave list:

cp slaves.template slaves
gedit slaves

Add at the end:

master
slave1

Copy the spark folder of the host to the slave, and the copy script is as follows:

scp -r /usr/local/spark root@slave1:/usr/local

Note: the environment variables of Spark need to be configured on each slave

4.4 verifying Spark installation and configuration

Verify whether Spark is successfully installed by running the example provided by Spark.

cd /usr/local/spark
./sbin/start-all.sh
bin/run-example SparkPi 2>&1 | grep "Pi is"

The operation results are shown in the figure below, and the approximate 14 decimal places of π can be obtained:

 

Spark Pi calculation results


Enter in the browser of the host http://master:8079 (cluster mode) or http://localhost:8079 (stand-alone mode) you can see that there are two nodes on the spark cluster.

 

5. Install hive

Hive is a basic data warehouse tool based on Hadoop, which is used to process structured data and provide convenience for big data query and analysis. Originally, hive was developed by Facebook and later by the Apache Software Foundation. As a further development, it is regarded as an open source project under the name of Apache Hive. Hive is not a relational database, nor is it a language designed for real-time queries and row level updates in the office of online Affairs (OLTP). In short, hive is a layer of SQL interface on Hadoop, which can translate SQL into MapReduce and execute it on Hadoop. In this way, data developers and analysts can easily use SQL to complete the statistics and analysis of massive data without the trouble of developing MapReduce with programming language.

5.1. Hive installation

Download apache-hive-1.2.2-bin.tar.gz File and move the file to the / usr/local directory

mv apache-hive-1.2.2-bin.tar.gz /usr/local

decompression

tar -zxvf apache-hive-1.2.2-bin.tar.gz

folders renaming

mv apache-hive-1.2.2 hive

Edit the / etc/profile file and configure the environment variables

vim /etc/profile

Add the following at the end of the / etc/profile file:

export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin

After editing, save and exit, and then execute the source command to make the above configuration take effect immediately on the current terminal. The command is as follows:

source /etc/profile

5.2. Install and configure MySQL

We use MySQL database to store Hive's metadata, instead of using Hive's own derby to store metadata. The installation of MySQL under ubuntu is relatively simple. Run the following commands directly. During the installation process, the user name and password will be required to be configured, which must be remembered.

apt-get install mysql-server

Start and login mysql shell

service mysql start
mysql -u root -p  #Login shell interface

New hive database

#This hive database is linked to hive site The hive correspondence of localhost:3306/hive in XML is used to save hive metadata
mysql> create database hive; 

Set the character encoding of hive database to latin1 (important)

mysql> alter database hive character set latin1;

5.3. Hive configuration

Modify hive site under / usr/local/hive/conf XML, execute the following command:

cd /usr/local/hive/conf
mv hive-default.xml.template hive-default.xml

The above command is to set hive default xml. Rename template to hive default XML, and then create a new configuration file hive site. XML using vim editor XML, the command is as follows:

cd /usr/local/hive/conf
vim hive-site.xml

At hive site XML, where USERNAME and PASSWORD are the user name and PASSWORD of MySQL.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
    <description>JDBC connect string for a JDBC metastore</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>USERNAME</value>
    <description>username to use against metastore database</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>PASSWORD</value>
    <description>password to use against metastore database</description>
  </property>
</configuration>

Then, press the "ESC" key on the keyboard to exit the vim editing state, and then enter: wq, save and exit the vim editor. Because Hive needs to connect to MySQL JDBC Driver Therefore, you first need to download the corresponding version of the driver, and then move the driver to / usr/local/hive/lib.

#decompression
tar -zxvf mysql-connector-java-5.1.47.tar.gz
#Mysql-connector-java-5.1.47 tar. GZ copy to / usr/local/hive/lib directory
cp mysql-connector-java-5.1.47/mysql-connector-java-5.1.47-bin.jar /usr/local/hive/lib

Start hive (please start hadoop cluster before starting hive).

./usr/local/hadoop/sbin/start-all.sh #Start hadoop. If it has been started, you don't need to execute this command
hive  #Start hive

5.4. Integration of spark and Hive

Hive's computing engine is MapReduce by default. If you want to use Spark as hive's computing engine, you can refer to the article Compile Spark source code, support Hive and deploy

6. Conclusion

This paper introduces the construction process of big data environment, which aims to make students less step on the pit. Later, the author will offer you how to use Java+Scala to develop big data applications. If you think the article is useful, don't forget to praise O(∩ ∩) O ~!

Tags: Hadoop HBase hive Spark

Posted by louisp on Mon, 25 Apr 2022 11:19:12 +0300