Hive chapter of big data development 5-Hive data query language

Remark:
Hive version 2.1.1

1. Overview of Hive SELECT (Data Query Language)

The select statement is the most frequently used statement in Hive, and it is also the statement with the most complex syntax. Many syntaxes of select statements are similar to traditional relational databases, which also facilitates the transition from traditional databases to big data hive data warehouses.

grammar:

[WITH CommonTableExpression (, CommonTableExpression)*]    (Note: Only available starting with Hive 0.13.0)
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
  FROM table_reference
  [WHERE where_condition]
  [GROUP BY col_list]
  [ORDER BY col_list]
  [CLUSTER BY col_list
    | [DISTRIBUTE BY col_list] [SORT BY col_list]
  ]
 [LIMIT [offset,] rows]

Test data preparation:

drop table if exists dept;
drop table if exists emp;
drop table if exists bonus;
drop table if exists salgrade;


create table DEPT
( deptno INT,
  dname  VARCHAR(14),
  loc    VARCHAR(13)
); 

insert into DEPT(deptno, dname, loc) values ('10', 'ACCOUNTING', 'NEW YORK');

insert into DEPT(deptno, dname, loc) values ('20', 'RESEARCH', 'DALLAS');

insert into DEPT(deptno, dname, loc) values ('30', 'SALES', 'CHICAGO');

insert into DEPT(deptno, dname, loc) values ('40', 'OPERATIONS', 'BOSTON');



-- Create table
create table EMP
(
  empno    INT,
  ename    VARCHAR(10),
  job      VARCHAR(9),
  mgr      INT,
  hiredate DATE,
  sal      decimal(7,2),
  comm     decimal(7,2),
  deptno   INT
) ;

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7369', 'SMITH', 'CLERK', '7902','1980-12-17', '800', null, '20');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7499', 'ALLEN', 'SALESMAN', '7698', '1981-02-20', '1600', '300', '30');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7521', 'WARD', 'SALESMAN', '7698', '1981-02-22', '1250', '500', '30');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7566', 'JONES', 'MANAGER', '7839', '1981-04-02', '2975', null, '20');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7654', 'MARTIN', 'SALESMAN', '7698', '1981-09-28', '1250', '1400', '30');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7698', 'BLAKE', 'MANAGER', '7839', '1981-05-01', '2850', null, '30');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7782', 'CLARK', 'MANAGER', '7839', '1981-06-09', '2450', null, '10');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7788', 'SCOTT', 'ANALYST', '7566', '1987-06-13', '3000', null, '20');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7839', 'KING', 'PRESIDENT', null, '1981-11-17', '5000', null, '10');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7844', 'TURNER', 'SALESMAN', '7698', '1981-09-08', '1500', '0', '30');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7876', 'ADAMS', 'CLERK', '7788', '1987-06-13', '1100', null, '20');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7900', 'JAMES', 'CLERK', '7698', '1981-12-03', '950', null, '30');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7902', 'FORD', 'ANALYST', '7566', '1981-12-03', '3000', null, '20');

insert into EMP(empno, ename, job, mgr, hiredate, sal, comm, deptno)
values ('7934', 'MILLER', 'CLERK', '7782', '1982-01-23', '1300', null, '10');



create table SALGRADE
(
  grade INT,
  losal INT,
  hisal INT
) ;

insert into SALGRADE(grade, losal, hisal)
values ('1', '700', '1200');

insert into SALGRADE(grade, losal, hisal)
values ('2', '1201', '1400');

insert into SALGRADE(grade, losal, hisal)
values ('3', '1401', '2000');

insert into SALGRADE(grade, losal, hisal)
values ('4', '2001', '3000');

insert into SALGRADE(grade, losal, hisal)
values ('5', '3001', '9999');


create table BONUS
(
  ename VARCHAR(10),
  job   VARCHAR(9),
  sal   INT,
  comm  INT
) ;

2. Several simple examples of Select

Turn on hive's column header output:

set hive.cli.print.header=true;

Here are a few simple examples to understand the select statement

2.1 Table aliases

When there are multiple tables in the select, you can give the table an alias. Some visualization tools, such as DBeaver, have aliases that can directly bring out the corresponding columns.

select ename,sal from emp;
select e.ename,e.sal from emp e;

Test Record:

hive> 
    > select ename,sal from emp;
OK
ename   sal
SMITH   800.00
ALLEN   1600.00
ADAMS   1100.00
JAMES   950.00
FORD    3000.00
MILLER  1300.00
WARD    1250.00
JONES   2975.00
MARTIN  1250.00
BLAKE   2850.00
CLARK   2450.00
SCOTT   3000.00
KING    5000.00
TURNER  1500.00
Time taken: 0.073 seconds, Fetched: 14 row(s)
hive> select e.ename,e.sal from emp e;
OK
e.ename e.sal
SMITH   800.00
ALLEN   1600.00
ADAMS   1100.00
JAMES   950.00
FORD    3000.00
MILLER  1300.00
WARD    1250.00
JONES   2975.00
MARTIN  1250.00
BLAKE   2850.00
CLARK   2450.00
SCOTT   3000.00
KING    5000.00
TURNER  1500.00
Time taken: 0.07 seconds, Fetched: 14 row(s)

2.2 Calculation of field values

In the actual data development process, some calculations are sometimes performed on the columns of the table.

Code:

SELECT count(*), avg(sal) FROM emp;

Test Record:

hive> 
    > SELECT count(*), avg(sal) FROM emp;
Query ID = root_20201204164822_8e00e473-82d2-406c-af03-f6236729d963
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1606698967173_0088, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0088/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0088
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2020-12-04 16:48:30,047 Stage-1 map = 0%,  reduce = 0%
2020-12-04 16:48:36,234 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.25 sec
2020-12-04 16:48:41,388 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.31 sec
MapReduce Total cumulative CPU time: 6 seconds 310 msec
Ended Job = job_1606698967173_0088
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 6.31 sec   HDFS Read: 17740 HDFS Write: 114 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 310 msec
OK
_c0     _c1
14      2073.214286
Time taken: 19.467 seconds, Fetched: 1 row(s)

2.3 Field Aliases

We saw in the above example that no alias was added to the field, which caused the system to randomly give _c0 _c1.
In addition to aliasing tables, you can also alias columns.

Code:

SELECT count(*) as emp_count, avg(sal) as avg_salary FROM emp;

Test Record:

hive> 
    > SELECT count(*) as emp_count, avg(sal) as avg_salary FROM emp;
Query ID = root_20201204165143_ce11cb13-1464-4b7c-8e65-327395c82bed
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1606698967173_0089, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0089/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0089
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2020-12-04 16:51:49,569 Stage-1 map = 0%,  reduce = 0%
2020-12-04 16:51:55,759 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.17 sec
2020-12-04 16:52:01,934 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.4 sec
MapReduce Total cumulative CPU time: 6 seconds 400 msec
Ended Job = job_1606698967173_0089
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 6.4 sec   HDFS Read: 17733 HDFS Write: 114 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 400 msec
OK
emp_count       avg_salary
14      2073.214286
Time taken: 19.514 seconds, Fetched: 1 row(s)

2.4 LIMIT

LIMIT is used to limit the output lines, for example, if I only want to output 10 lines, use limit 10

Code:

SELECT ename, sal FROM emp LIMIT 10;

Test Record:

hive> 
    > SELECT ename, sal FROM emp LIMIT 10;
OK
ename   sal
SMITH   800.00
ALLEN   1600.00
ADAMS   1100.00
JAMES   950.00
FORD    3000.00
MILLER  1300.00
WARD    1250.00
JONES   2975.00
MARTIN  1250.00
BLAKE   2850.00
Time taken: 0.077 seconds, Fetched: 10 row(s)

2.5 FROM subqueries

Sometimes the logic is relatively complex and requires the use of the from subquery statement.

Code:

SELECT ename,sal
from 
(select ename,sal from emp) e;

Test Record:

hive> 
    > SELECT ename,sal
    > from 
    > (select ename,sal from emp) e;
OK
ename   sal
SMITH   800.00
ALLEN   1600.00
ADAMS   1100.00
JAMES   950.00
FORD    3000.00
MILLER  1300.00
WARD    1250.00
JONES   2975.00
MARTIN  1250.00
BLAKE   2850.00
CLARK   2450.00
SCOTT   3000.00
KING    5000.00
TURNER  1500.00
Time taken: 0.069 seconds, Fetched: 14 row(s)

2.6 case when judgment

Suppose I need to judge the salary level according to the salary at this time, which can be judged by the case when statement.

Code:

select ename,
       sal,
       case when sal >= 3000 then 'High SAL'
            when sal >= 2000 and sal < 3000 then 'Middle SAL'
            else 'Low SAL'
       end as sal_grade
  from emp;

Test Record:

hive> 
    > select ename,
    >        sal,
    >        case when sal >= 3000 then 'High SAL'
    >             when sal >= 2000 and sal < 3000 then 'Middle SAL'
    >             else 'Low SAL'
    >        end as sal_grade
    >   from emp;
Query ID = root_20201204165914_7229d9ea-b045-423b-a240-b04e6e8276e1
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0090, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0090/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0090
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-04 16:59:21,504 Stage-1 map = 0%,  reduce = 0%
2020-12-04 16:59:28,775 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.75 sec
MapReduce Total cumulative CPU time: 6 seconds 750 msec
Ended Job = job_1606698967173_0090
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 6.75 sec   HDFS Read: 13014 HDFS Write: 660 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 750 msec
OK
ename   sal     sal_grade
ALLEN   1600.00 Low SAL
ADAMS   1100.00 Low SAL
MILLER  1300.00 Low SAL
WARD    1250.00 Low SAL
MARTIN  1250.00 Low SAL
BLAKE   2850.00 Middle SAL
SCOTT   3000.00 High SAL
SMITH   800.00  Low SAL
JAMES   950.00  Low SAL
FORD    3000.00 High SAL
JONES   2975.00 Middle SAL
CLARK   2450.00 Middle SAL
KING    5000.00 High SAL
TURNER  1500.00 Low SAL
Time taken: 15.382 seconds, Fetched: 14 row(s)
hive> 

2.7 where filtering

At this time, I only want to see the employee information with a dept of 10, and I don't want to see all of them. At this time, I can filter through the where clause.

Code:

select ename,sal from emp where deptno = '10';

Test Record:

hive> select ename,sal from emp where deptno = '10';
Query ID = root_20201204170244_78b7dc0a-5e43-4183-bd47-9379092687cc
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0091, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0091/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0091
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-04 17:02:52,315 Stage-1 map = 0%,  reduce = 0%
2020-12-04 17:02:58,494 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.73 sec
MapReduce Total cumulative CPU time: 6 seconds 730 msec
Ended Job = job_1606698967173_0091
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 6.73 sec   HDFS Read: 12664 HDFS Write: 252 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 730 msec
OK
ename   sal
MILLER  1300.00
CLARK   2450.00
KING    5000.00
Time taken: 14.775 seconds, Fetched: 3 row(s)

2.8 group by grouping

In addition, I want to know the average salary of each department at this time, which can be achieved through the group by group statement at this time

Code:

select deptno,avg(sal) from emp group by deptno;

Test Record:

hive> 
    > select deptno,avg(sal) from emp group by deptno;
Query ID = root_20201204170424_288f6ce3-a3ee-4f7c-99b9-5634269bb613
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1606698967173_0092, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0092/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0092
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2020-12-04 17:04:31,900 Stage-1 map = 0%,  reduce = 0%
2020-12-04 17:04:39,100 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.19 sec
2020-12-04 17:04:44,246 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.29 sec
MapReduce Total cumulative CPU time: 6 seconds 290 msec
Ended Job = job_1606698967173_0092
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 6.29 sec   HDFS Read: 17311 HDFS Write: 168 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 290 msec
OK
deptno  _c1
10      2916.666667
20      2175.000000
30      1566.666667
Time taken: 20.501 seconds, Fetched: 3 row(s)
hive> 

2.9 Having Clause

HAVING is used to constrain the result set and only give results that meet the HAVING condition

Code:

select deptno,avg(sal) avg_sal from emp group by deptno having avg(sal) > 2000;

Test Record:

hive> 
    > select deptno,avg(sal) avg_sal from emp group by deptno having avg(sal) > 2000;
Query ID = root_20201204170622_ee515280-33b8-4cf7-af56-a1cdb9731d38
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1606698967173_0093, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0093/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0093
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2020-12-04 17:06:29,633 Stage-1 map = 0%,  reduce = 0%
2020-12-04 17:06:36,835 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.31 sec
2020-12-04 17:06:43,012 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.56 sec
MapReduce Total cumulative CPU time: 7 seconds 560 msec
Ended Job = job_1606698967173_0093
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 7.56 sec   HDFS Read: 17746 HDFS Write: 141 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 560 msec
OK
deptno  avg_sal
10      2916.666667
20      2175.000000
Time taken: 21.566 seconds, Fetched: 2 row(s)
hive> 

3. join of Hive

Hive's join is similar to that of relational databases, but it should be noted that hive does not support non-equivalent joins.

grammar:

join_table:
    table_reference [INNER] JOIN table_factor [join_condition]
  | table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
  | table_reference LEFT SEMI JOIN table_reference join_condition
  | table_reference CROSS JOIN table_reference [join_condition] (as of Hive 0.10)

table_reference:
    table_factor
  | join_table

table_factor:
    tbl_name [alias]
  | table_subquery alias
  | ( table_references )

join_condition:
    ON expression

hive supports the following connections:

INNER JOIN
LEFT OUTER JOIN
RIGHT OUTER JOIN
FULL OUTER JOIN
LEFT SEMI JOIN

Join MapReduce Implementation

Test Data:

create table t1(id int);
create table t2(id int);

insert into t1 values(1);
insert into t1 values(2);

insert into t2 values(2);
insert into t2 values(3);

3.1 Inner join -- find intersection

select t1.id,t2.id id2
from t1
inner join t2
     on t1.id = t2.id;
--or
select t1.id,t2.id id2
from t1,t2
where t1.id = t2.id;
hive> 
    > select t1.id,t2.id id2
    > from t1
    > inner join t2
    >      on t1.id = t2.id;
Query ID = root_20201204191120_3861be11-a271-4bc4-b240-a5ec96b9794e
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0114, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0114/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0114
Hadoop job information for Stage-3: number of mappers: 2; number of reducers: 0
2020-12-04 19:11:34,556 Stage-3 map = 0%,  reduce = 0%
2020-12-04 19:11:42,800 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 6.34 sec
MapReduce Total cumulative CPU time: 6 seconds 340 msec
Ended Job = job_1606698967173_0114
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 2   Cumulative CPU: 6.34 sec   HDFS Read: 12447 HDFS Write: 190 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 340 msec
OK
t1.id   id2
2       2
Time taken: 23.749 seconds, Fetched: 1 row(s)
hive> select t1.id,t2.id id2
    > from t1,t2
    > where t1.id = t2.id;
Query ID = root_20201204191146_2e3e53c9-5f29-4d17-84f6-8222f5b4c742
Total jobs = 1
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0115, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0115/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0115
Hadoop job information for Stage-3: number of mappers: 2; number of reducers: 0
2020-12-04 19:12:01,580 Stage-3 map = 0%,  reduce = 0%
2020-12-04 19:12:07,771 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 6.67 sec
MapReduce Total cumulative CPU time: 6 seconds 670 msec
Ended Job = job_1606698967173_0115
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 2   Cumulative CPU: 6.67 sec   HDFS Read: 12773 HDFS Write: 190 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 670 msec
OK
t1.id   id2
2       2
Time taken: 22.941 seconds, Fetched: 1 row(s)
hive> 

3.2 Left join -- find all of A

select t1.id,t2.id id2
from t1
left join t2
  on t1.id = t2.id;

hive> select t1.id,t2.id id2
    > from t1
    > left join t2
    >   on t1.id = t2.id;
Query ID = root_20201204191035_0e063217-a8b4-4669-8a30-5e1be3e903eb
Total jobs = 1
WARNING: Use "yarn jar" to launch YARN applications.
2020-12-04 19:10:42     Uploaded 1 File to: file:/tmp/root/e8ed4bca-fbbf-4db0-b223-33d23b3bbc3a/hive_2020-12-04_19-10-35_645_5302676112925292414-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile91--.hashtable (296 bytes)
2020-12-04 19:10:42     End of local task; Time Taken: 0.963 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0113, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0113/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0113
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2020-12-04 19:10:49,515 Stage-3 map = 0%,  reduce = 0%
2020-12-04 19:10:56,723 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.98 sec
MapReduce Total cumulative CPU time: 1 seconds 980 msec
Ended Job = job_1606698967173_0113
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 1   Cumulative CPU: 1.98 sec   HDFS Read: 6112 HDFS Write: 120 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 980 msec
OK
t1.id   id2
1       NULL
2       2
Time taken: 22.147 seconds, Fetched: 2 row(s)

3.3 Left join -- implement the difference set of A-B

select t1.id,t2.id id2
from t1
left join t2
  on t1.id = t2.id
where t2.id is null;
hive> select t1.id,t2.id id2
    > from t1
    > left join t2
    >   on t1.id = t2.id
    > where t2.id is null;
Query ID = root_20201204190954_8ea563bb-c5e6-4d00-8262-ed1264c1c1cc
Total jobs = 1
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0112, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0112/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0112
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2020-12-04 19:10:08,983 Stage-3 map = 0%,  reduce = 0%
2020-12-04 19:10:15,161 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 3.0 sec
MapReduce Total cumulative CPU time: 3 seconds 0 msec
Ended Job = job_1606698967173_0112
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 1   Cumulative CPU: 3.0 sec   HDFS Read: 6586 HDFS Write: 104 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 0 msec
OK
t1.id   id2
1       NULL
Time taken: 22.658 seconds, Fetched: 1 row(s)

3.4 Full connection – A union B finds a collection

select t1.id id1,t2.id id2
from t1
full join t2
on t1.id = t2.id
hive> 
    > select t1.id id1,t2.id id2
    > from t1
    > full join t2
    > on t1.id = t2.id
    > ;
Query ID = root_20201204190853_888f4198-8453-4c53-b8ce-bc06c59ebc6a
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1606698967173_0111, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0111/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0111
Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1
2020-12-04 19:09:00,901 Stage-1 map = 0%,  reduce = 0%
2020-12-04 19:09:07,088 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 3.77 sec
2020-12-04 19:09:10,186 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.55 sec
2020-12-04 19:09:15,339 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.94 sec
MapReduce Total cumulative CPU time: 7 seconds 940 msec
Ended Job = job_1606698967173_0111
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 3  Reduce: 1   Cumulative CPU: 7.94 sec   HDFS Read: 17569 HDFS Write: 137 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 940 msec
OK
id1     id2
1       NULL
2       2
NULL    3
Time taken: 22.535 seconds, Fetched: 3 row(s)

3.5 Full connection implementation - de-intersection

select t1.id id1,t2.id id2
from t1
left join t2
on t1.id = t2.id
where t2.id is null
union all
select t1.id id1,t2.id id2
from t1
right join t2
on t1.id = t2.id
where t1.id is null;
hive> 
    > select t1.id id1,t2.id id2
    > from t1
    > left join t2
    > on t1.id = t2.id
    > where t2.id is null
    > union all
    > select t1.id id1,t2.id id2
    > from t1
    > right join t2
    > on t1.id = t2.id
    > where t1.id is null;
Query ID = root_20201204190745_d1e37397-4a04-44b5-920b-cc9e3327d6ac
Total jobs = 1
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0110, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0110/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0110
Hadoop job information for Stage-2: number of mappers: 3; number of reducers: 0
2020-12-04 19:07:59,931 Stage-2 map = 0%,  reduce = 0%
2020-12-04 19:08:08,176 Stage-2 map = 67%,  reduce = 0%, Cumulative CPU 5.88 sec
2020-12-04 19:08:12,287 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 8.79 sec
MapReduce Total cumulative CPU time: 8 seconds 790 msec
Ended Job = job_1606698967173_0110
MapReduce Jobs Launched: 
Stage-Stage-2: Map: 3   Cumulative CPU: 8.79 sec   HDFS Read: 23996 HDFS Write: 295 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 790 msec
OK
_u1.id1 _u1.id2
1       NULL
NULL    3
Time taken: 27.58 seconds, Fetched: 2 row(s)

3.6 Right join implementation - B-A difference set

select t1.id,t2.id id2
from t1
right join t2
  on t1.id = t2.id
where t1.id is null;
hive> select t1.id,t2.id id2
    > from t1
    > right join t2
    >   on t1.id = t2.id
    > where t1.id is null;
Query ID = root_20201204190148_850cffa0-f440-4feb-b85f-6d014f9c6c3f
Total jobs = 1
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0105, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0105/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0105
Hadoop job information for Stage-3: number of mappers: 2; number of reducers: 0
2020-12-04 19:02:03,141 Stage-3 map = 0%,  reduce = 0%
2020-12-04 19:02:09,326 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 6.04 sec
MapReduce Total cumulative CPU time: 6 seconds 40 msec
Ended Job = job_1606698967173_0105
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 2   Cumulative CPU: 6.04 sec   HDFS Read: 13009 HDFS Write: 191 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 40 msec
OK
t1.id   id2
NULL    3
Time taken: 22.345 seconds, Fetched: 1 row(s)
hive> 

3.7 Right join -- find all of B

select t1.id,t2.id id2
from t1
right join t2
 on t1.id = t2.id;
hive> select t1.id,t2.id id2
    > from t1
    > right join t2
    >  on t1.id = t2.id;
Query ID = root_20201204190106_2d049d88-62e4-4e51-88e4-f005248cff60
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2020-12-04 19:01:12     Starting to launch local task to process map join;      maximum memory = 1908932608
2020-12-04 19:01:13     Dump the side-table for tag: 0 with group count: 2 into file: file:/tmp/root/e8ed4bca-fbbf-4db0-b223-33d23b3bbc3a/hive_2020-12-04_19-01-06_491_8753533712871347988-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile40--.hashtable
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0104, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0104/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0104
Hadoop job information for Stage-3: number of mappers: 2; number of reducers: 0
2020-12-04 19:01:20,696 Stage-3 map = 0%,  reduce = 0%
2020-12-04 19:01:26,882 Stage-3 map = 50%,  reduce = 0%, Cumulative CPU 1.97 sec
2020-12-04 19:01:27,911 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 3.95 sec
MapReduce Total cumulative CPU time: 3 seconds 950 msec
Ended Job = job_1606698967173_0104
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 2   Cumulative CPU: 3.95 sec   HDFS Read: 12061 HDFS Write: 207 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 950 msec
OK
t1.id   id2
2       2
NULL    3
Time taken: 22.507 seconds, Fetched: 2 row(s)
hive> 

3.8 left semi join

For one-to-many cases, in and exists are often used, and left semi join can also be used in hive

SELECT a.* FROM a WHERE a.key IN
(SELECT b.key FROM b WHERE b.value > 100)
Equivalent to
SELECT a.* FROM a LEFT SEMI JOIN b ON
(a.key = b.key AND b.value > 100)

4. Sorting clause

The sort clause has
1) ORDER BY
Consistent with traditional RDBMS SQL semantics, globally sort the result set
For the implementation of MapReduce, it is necessary to shuffle the result set to a reducer
If the amount of data is very large, it will cause the reducer execution time to be very long
Usually LIMIT is added to limit the number of sorted results

2) SORT BY
Hive-specific, reducer local sorting, not global sorting
When reducer = 1, the semantics are the same as Order By, otherwise the result set obtained is different

3) DISTRIBUTE BY
Controls the shuffle key for map output
The default is the hashcode according to the key
Generally used to control the shuffling of data with the same key to the same reducer

4) CLUSTER BY
Equal to DISTRIBUTE BY … SORT BY …
DISTRIBUTE BY and SORT BY have the same fields and are sorted in positive order

5. Sampling (TABLESAMPLE)

Sampling is especially important when the amount of data is particularly large and it is difficult to process the entire data. Sampling can estimate and infer the characteristics of the whole from the extracted data. It is a cost-effective work and research method commonly used in scientific experiments, quality inspections, and social surveys.

Hive supports bucket table sampling and block sampling. The so-called bucket table refers to the table with the bucket created using the CLUSTERED BY clause when the table is created. The syntax for bucket table sampling is as follows:

table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname])

The TABLESAMPLE clause allows users to write queries that sample data rather than the entire table. This clause appears in the FROM clause and can be used on any table. Bucket numbering starts from 1, colname indicates the column from which the sample was drawn, which can be any of the non-partitioned columns, or use rand() to indicate that the sample is drawn across the entire row rather than a single column. Rows bucketed on colname go randomly into 1 to y buckets, returning rows belonging to bucket x. The following example returns rows in the 3rd bucket of 32 buckets:

SELECT *
FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s;

Normally, TABLESAMPLE will scan the entire table and then take samples, which is obviously not very efficient. Alternatively, since the bucketed columns are specified when using CLUSTERED BY, TABLESAMPLE scans only the required partitions in the table if the columns specified in the TABLESAMPLE clause match the columns in the CLUSTERED BY clause when sampling. If in the above example, the source table was created with CLUSTEREDBY id INTO 32 BUCKETS, then the following statement will return the rows in the 3rd and 19th clusters, because each bucket consists of (32/16)=2 cluster composition. Why 3 and 19? Because the third bucket is to be returned, and each bucket consists of the original 2 clusters, and the third bucket consists of the original 3rd and 19th clusters. According to the simple Ha Greek algorithm (3%16=19%16).

TABLESAMPLE(BUCKET 3 OUT OF 16 ON id)

Instead, the following statement will return half of the 3rd cluster, since each bucket consists of (32/64)=1/2 clusters.

TABLESAMPLE(BUCKET 3 OUT OF 64 ON id)  

Block sampling is available since Hive-0.8, the syntax is:

block_sample: TABLESAMPLE (n PERCENT)  

This statement allows to extract at least n% of the data size (not the number of rows, but the size of the data) as input. CombineHiveInputFormat is supported and some special compression formats cannot be processed. If sampling fails, the input of the MapReduce job will be the entire surface. Since sampling is done at the HDFS block level, the sampling granularity is the size of the block, for example if the block size is 256MB, even if n% of the input is only 100MB, you will get 256MB of data. 0.1% or more of the input in the example below will be used for the query:

SELECT *  
ROM source TABLESAMPLE(0.1 PERCENT) s;  

If you want to extract the same data in different blocks, you can change the following parameters:

set hive.sample.seednumber=<INTEGER>;  

You can also specify the length of the read data. This method has the same restrictions as PERCENT sampling. The reason for the same restrictions is that the syntax only changes the percentage to a specific value, but does not change the premise of block-based sampling. The syntax is:

block_sample: TABLESAMPLE (ByteLengthLiteral)  

ByteLengthLiteral : (Digit)+ ('b' | 'B' | 'k' | 'K' | 'm' | 'M' | 'g' | 'G')  

100M or more entered in the example below will be used for the query:

SELECT *  
FROM source TABLESAMPLE(100M) s; 

Hive also supports input limits based on the number of rows, when the effect is different from the two introduced above. First of all, CombineHiveInputFormat is not required, which means it can be used in non-native tables. The second row number is used in each split. So the total number of rows varies greatly depending on the number of split s entered. The syntax format is:

block_sample: TABLESAMPLE (n ROWS) 

For example the following query will extract 10 rows from each split:

SELECT * FROM source TABLESAMPLE(10 ROWS);  

refer to:

1.https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select

Tags: Big Data hive

Posted by sarika on Sat, 30 Apr 2022 08:55:28 +0300