hadoop+hive notes on deploying stand-alone version

preface

For the deployment test conducted on the Ubuntu 18 version of the native simulator, refer to the official document:

Version used:

  • hadoop: 3.2.1
  • hive: 3.1.2

The whole process is configured with the root account.

hadoop installation configuration

hadoop uses a virtual cluster, that is, a single machine simulation cluster. datanode and namenode are on the same node.

1. Download and install

Download the latest version of the official compressed package and unzip it to the server

2. Environment variable configuration

These variables are used here. I configured them directly in / etc/profile.
Here are also the variables used by hive, which are set together. This content will not be mentioned in the subsequent hive.

  • export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
  • export PDSH_RCMD_TYPE=ssh
    -This variable is very important. Since hadoop is connected to the datanode node, pdsh will be used by default, and it can't be connected when it's not configured at the beginning
  • export HIVE_HOME=/root/apache-hive-3.1.2-bin
  • export PATH=$HIVE_HOME/bin:$PATH
  • export HADOOP_HOME=/root/hadoop-3.2.1

3. SSH configuration

Start SSH service

service ssh start

Make sure you can connect to this computer without a password. If not, generate an ssh key

  $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
  $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  $ chmod 0600 ~/.ssh/authorized_keys

4. Modify the configuration file

In fact, it's OK to completely refer to the official documents, but since Hadoop uses the mapreduce framework by default, I want to switch to Yan, so I need to modify it a little. It's also easy to find on the Internet.

Mainly modify these documents:

core-site.xml:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://DESKTOP-61VV394:9000</value>
    </property>
    <!-- The following configuration is for Hive If you add it or not, you will find Hive The command line permission of is not enough to connect -->
	<property>
	  <name>hadoop.proxyuser.root.hosts</name>
	  <value>*</value>
	</property>
	<property>
	  <name>hadoop.proxyuser.root.groups</name>
	  <value>*</value>
	</property>
</configuration>
# hadoop-env.sh
# These XXXX_USER is not mentioned in the official document, but if it is not configured, an error will be reported when starting hadoop
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PDSH_RCMD_TYPE=ssh
export HADOOP_HOME=/root/hadoop-3.2.1
export YARN_HOME=/root/hadoop-3.2.1

hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
<!-- The following two configurations are for access outside the virtual machine hadoop of UI Page, otherwise it can only be viewed locally in the virtual machine -->
<property>
     <name>dfs.http.address</name>
    <value>0.0.0.0:50070</value>
</property>
<property>
     <name>dfs.secondary.http.address</name>
     <value>0.0.0.0:50090</value>
</property>
</configuration>

mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
        <!-- Use this configuration to specify Yarn As Hadoop Scheduling framework -->
        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
<!-- The following three configurations run if not specified mapreduce When the time comes, you won't find the corresponding one jar route -->
<property>
        <name>yarn.app.mapreduce.am.env</name>
        <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
        <name>mapreduce.map.env</name>
        <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
        <name>mapreduce.reduce.env</name>
        <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
</configuration>

yarn-site.xml:

<?xml version="1.0"?>
<configuration>

<!-- Site specific YARN configuration properties -->
          <!-- NodeManager The way to get data is shuffle-->
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
          <!-- Configure the local`hostname` -->
        <property>
                <name>yarn.resourcemanager.hostname</name>
                <value>DESKTOP-6123456</value>
        </property>

<property>

5. Start hadoop

Enter the hadoop directory and format the namenode

 $ bin/hdfs namenode -format

Start hadoop

 $ sbin/start-all.sh

After successful startup, you can observe the startup process through jps. There should be 5:

5233 NameNode
5462 DataNode
5994 ResourceManager
5740 SecondaryNameNode
6269 NodeManager

6. Test function

You can test with the simplest directory command

$ bin/hdfs dfs -ls
$ bin/hdfs dfs -mkdir -p /user/root/test

hive installation configuration

This is also a single node hive

1. Download and unzip

Download the latest Hive from the official website and decompress it. There is no need to be special

2. Mysql environment configuration

For hive storage, hdfs is used as data storage, and meta store metadata is also required.

By default, hive uses derby without additional configuration. I use mysql here. The specific mysql installation and configuration will not be introduced.

It is mainly to ensure that a mysql database is created and authorized for hive. The name of the database I created here is hive, and the user name and password are hive.
Note that the corresponding connection library jar package of MySQL (I use mysql-connector-java-5.1.45.jar) needs to be placed in the lib directory of hive.

3. Modification of configuration file

# hive-env.sh
export HIVE_CONF_DIR=/root/apache-hive-3.1.2-bin/conf

hive-site.xml

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- If this configuration does not match, it will be reported when starting Tez Startup error -->
<property>
    <name>hive.server2.active.passive.ha.enable</name>
    <value>true</value>
</property>
<!-- hive Relevant HDFS Storage path configuration -->
 <property>
    <name>hive.exec.scratchdir</name>
    <value>/tmp/hive</value>
</property>
<property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
</property>
<property>
    <name>hive.querylog.location</name>
    <value>/user/hive/log</value>
</property>
<!-- Here is mysql Related configuration -->
<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://{mysql server address}: 3306 / hive? createDatabaseIfNotExist=true& characterEncoding=UTF-8& useSSL=false</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>hive</value>
</property>
<!-- hive UI If the address is not configured, it will only be accessible on the local machine by default -->
  <property>
    <name>hive.server2.thrift.bind.host</name>
    <value>0.0.0.0</value>
  </property>
</configuration>

4. Start hiveserver2

Enter the hive directory and initialize the MySql metastore data

./bin/schematool -dbType mysql -initSchema

Start Hiverserver2

nohup $HIVE_HOME/bin/hiveserver2 &

Start the command line tool that comes with hive

$HIVE_HOME/bin/beeline -u jdbc:hive2://localhost:10000 -n root

Then you can test such as show tables on the command line; Wait for orders

5. Test

The tests here are completely from official documents, loading external data, and processing data with custom mapreduce scripts.

  1. In the command line of the previous step, create a test table
CREATE TABLE u_data (
  userid INT,
  movieid INT,
  rating INT,
  unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
  1. Download the test data and unzip it
wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
unzip ml-100k.zip
  1. From the command line, load the test data
LOAD DATA LOCAL INPATH '<path>/u.data'
OVERWRITE INTO TABLE u_data;
  1. Check the data volume. There should be 10w entries
SELECT COUNT(*) FROM u_data;
  1. In the file system, create a new python script file

Please check your Python version. The print here is python2 Syntax of X

import sys
import datetime

for line in sys.stdin:
  line = line.strip()
  userid, movieid, rating, unixtime = line.split('\t')
  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
  print '\t'.join([userid, movieid, rating, str(weekday)])
  1. Load the script in the command line, process the test data through the script and store it in the table
CREATE TABLE u_data_new (
  userid INT,
  movieid INT,
  rating INT,
  weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

add FILE weekday_mapper.py;

INSERT OVERWRITE TABLE u_data_new
SELECT
  TRANSFORM (userid, movieid, rating, unixtime)
  USING 'python weekday_mapper.py'
  AS (userid, movieid, rating, weekday)
FROM u_data;

SELECT weekday, COUNT(*)
FROM u_data_new
GROUP BY weekday;

finish

Tags: Database Big Data Hadoop hive

Posted by anshu.sah on Thu, 05 May 2022 05:22:18 +0300