preface
For the deployment test conducted on the Ubuntu 18 version of the native simulator, refer to the official document:
- hadoop: Link address
- hive: Link address
Version used:
- hadoop: 3.2.1
- hive: 3.1.2
The whole process is configured with the root account.
hadoop installation configuration
hadoop uses a virtual cluster, that is, a single machine simulation cluster. datanode and namenode are on the same node.
1. Download and install
Download the latest version of the official compressed package and unzip it to the server
2. Environment variable configuration
These variables are used here. I configured them directly in / etc/profile.
Here are also the variables used by hive, which are set together. This content will not be mentioned in the subsequent hive.
- export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
- export PDSH_RCMD_TYPE=ssh
-This variable is very important. Since hadoop is connected to the datanode node, pdsh will be used by default, and it can't be connected when it's not configured at the beginning - export HIVE_HOME=/root/apache-hive-3.1.2-bin
- export PATH=$HIVE_HOME/bin:$PATH
- export HADOOP_HOME=/root/hadoop-3.2.1
3. SSH configuration
Start SSH service
service ssh start
Make sure you can connect to this computer without a password. If not, generate an ssh key
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys $ chmod 0600 ~/.ssh/authorized_keys
4. Modify the configuration file
In fact, it's OK to completely refer to the official documents, but since Hadoop uses the mapreduce framework by default, I want to switch to Yan, so I need to modify it a little. It's also easy to find on the Internet.
Mainly modify these documents:
core-site.xml:
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://DESKTOP-61VV394:9000</value> </property> <!-- The following configuration is for Hive If you add it or not, you will find Hive The command line permission of is not enough to connect --> <property> <name>hadoop.proxyuser.root.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.root.groups</name> <value>*</value> </property> </configuration>
# hadoop-env.sh # These XXXX_USER is not mentioned in the official document, but if it is not configured, an error will be reported when starting hadoop export HDFS_NAMENODE_USER=root export HDFS_DATANODE_USER=root export HDFS_SECONDARYNAMENODE_USER=root export YARN_RESOURCEMANAGER_USER=root export YARN_NODEMANAGER_USER=root export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export PDSH_RCMD_TYPE=ssh export HADOOP_HOME=/root/hadoop-3.2.1 export YARN_HOME=/root/hadoop-3.2.1
hdfs-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <!-- The following two configurations are for access outside the virtual machine hadoop of UI Page, otherwise it can only be viewed locally in the virtual machine --> <property> <name>dfs.http.address</name> <value>0.0.0.0:50070</value> </property> <property> <name>dfs.secondary.http.address</name> <value>0.0.0.0:50090</value> </property> </configuration>
mapred-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- Use this configuration to specify Yarn As Hadoop Scheduling framework --> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <!-- The following three configurations run if not specified mapreduce When the time comes, you won't find the corresponding one jar route --> <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value> </property> </configuration>
yarn-site.xml:
<?xml version="1.0"?> <configuration> <!-- Site specific YARN configuration properties --> <!-- NodeManager The way to get data is shuffle--> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <!-- Configure the local`hostname` --> <property> <name>yarn.resourcemanager.hostname</name> <value>DESKTOP-6123456</value> </property> <property>
5. Start hadoop
Enter the hadoop directory and format the namenode
$ bin/hdfs namenode -format
Start hadoop
$ sbin/start-all.sh
After successful startup, you can observe the startup process through jps. There should be 5:
5233 NameNode 5462 DataNode 5994 ResourceManager 5740 SecondaryNameNode 6269 NodeManager
6. Test function
You can test with the simplest directory command
$ bin/hdfs dfs -ls $ bin/hdfs dfs -mkdir -p /user/root/test
hive installation configuration
This is also a single node hive
1. Download and unzip
Download the latest Hive from the official website and decompress it. There is no need to be special
2. Mysql environment configuration
For hive storage, hdfs is used as data storage, and meta store metadata is also required.
By default, hive uses derby without additional configuration. I use mysql here. The specific mysql installation and configuration will not be introduced.
It is mainly to ensure that a mysql database is created and authorized for hive. The name of the database I created here is hive, and the user name and password are hive.
Note that the corresponding connection library jar package of MySQL (I use mysql-connector-java-5.1.45.jar) needs to be placed in the lib directory of hive.
3. Modification of configuration file
# hive-env.sh export HIVE_CONF_DIR=/root/apache-hive-3.1.2-bin/conf
hive-site.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- If this configuration does not match, it will be reported when starting Tez Startup error --> <property> <name>hive.server2.active.passive.ha.enable</name> <value>true</value> </property> <!-- hive Relevant HDFS Storage path configuration --> <property> <name>hive.exec.scratchdir</name> <value>/tmp/hive</value> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> </property> <property> <name>hive.querylog.location</name> <value>/user/hive/log</value> </property> <!-- Here is mysql Related configuration --> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://{mysql server address}: 3306 / hive? createDatabaseIfNotExist=true& characterEncoding=UTF-8& useSSL=false</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>hive</value> </property> <!-- hive UI If the address is not configured, it will only be accessible on the local machine by default --> <property> <name>hive.server2.thrift.bind.host</name> <value>0.0.0.0</value> </property> </configuration>
4. Start hiveserver2
Enter the hive directory and initialize the MySql metastore data
./bin/schematool -dbType mysql -initSchema
Start Hiverserver2
nohup $HIVE_HOME/bin/hiveserver2 &
Start the command line tool that comes with hive
$HIVE_HOME/bin/beeline -u jdbc:hive2://localhost:10000 -n root
Then you can test such as show tables on the command line; Wait for orders
5. Test
The tests here are completely from official documents, loading external data, and processing data with custom mapreduce scripts.
- In the command line of the previous step, create a test table
CREATE TABLE u_data ( userid INT, movieid INT, rating INT, unixtime STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
- Download the test data and unzip it
wget http://files.grouplens.org/datasets/movielens/ml-100k.zip unzip ml-100k.zip
- From the command line, load the test data
LOAD DATA LOCAL INPATH '<path>/u.data' OVERWRITE INTO TABLE u_data;
- Check the data volume. There should be 10w entries
SELECT COUNT(*) FROM u_data;
- In the file system, create a new python script file
Please check your Python version. The print here is python2 Syntax of X
import sys import datetime for line in sys.stdin: line = line.strip() userid, movieid, rating, unixtime = line.split('\t') weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print '\t'.join([userid, movieid, rating, str(weekday)])
- Load the script in the command line, process the test data through the script and store it in the table
CREATE TABLE u_data_new ( userid INT, movieid INT, rating INT, weekday INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; add FILE weekday_mapper.py; INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python weekday_mapper.py' AS (userid, movieid, rating, weekday) FROM u_data; SELECT weekday, COUNT(*) FROM u_data_new GROUP BY weekday;