Chapter 6 Simple queries of hive (where, groupby, join, order by, etc.)

6.1 Basic query

data emp.txt

7369	SMITH	CLERK	7902	1980-12-17	800.00		20
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.00	300.00	30
7521	WARD	SALESMAN	7698	1981-2-22	1250.00	500.00	30
7566	JONES	MANAGER	7839	1981-4-2	2975.00		20
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.00	1400.00	30
7698	BLAKE	MANAGER	7839	1981-5-1	2850.00		30
7782	CLARK	MANAGER	7839	1981-6-9	2450.00		10
7788	SCOTT	ANALYST	7566	1987-4-19	3000.00		20
7839	KING	PRESIDENT		1981-11-17	5000.00		10
7844	TURNER	SALESMAN	7698	1981-9-8	1500.00	0.00	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.00		20
7900	JAMES	CLERK	7698	1981-12-3	950.00		30
7902	FORD	ANALYST	7566	1981-12-3	3000.00		20
7934	MILLER	CLERK	7782	1982-1-23	1300.00		10

dept.txt

10	ACCOUNTING	1700
20	RESEARCH	1800
30	SALES	1900
40	OPERATIONS	1700

6.1.1 Create department table

create table if not exists dept(
deptno int,
dname string,
loc int
)
row format delimited fields terminated by '\t';

successfully created

Create employee table

create table if not exists emp(
empno int ,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
) row format delimited fields terminated by '\t';


Import Data

load data local inpath '/opt/module/datas/dept.txt' into table dept;
load data local inpath '/opt/module/datas/emp.txt' into table emp;


Full table query

  select * from emp;

specific query

  select empno,ename from emp;

Remark:

1.sql Language case insensitive
2.sql Can be written on one or more lines
3.Keywords cannot be abbreviated or separated
4.Each sentence should generally be written in separate lines
5.Use indentation to improve the readability of statements

6.1.2 Column Aliases

1. Rename a column
2. It is convenient for calculation (because some aliases are too long, or do not fit the context at the time)
3. Follow the column name, and also add the keyword as between the column name and the alias

Case:
Look up name and department

select ename as name,deptno dn from emp;

6.1.3 Arithmetic operators

https://issues.apache.org/jira/browse/HIVE-2810

6.1.4 Common functions

1. Find the total number of rows (count)

select count(*) as cnt from emp;

2. Find the maximum salary (max)

select max(sal) as max_sal from emp;

3. Find the minimum salary (min)

select min(sal) as min_sal from emp;

4. Find the sum of wages (sum)

select sum(sal) as sum_sal from emp;

5. Find the average salary (avg)

select avg(sal)as avg_sal from emp;

limit statement
Role: used to limit the number of rows returned

 select * from emp limit 5;

6.2 where statement

1. Use the where sentence to filter rows that do not meet the conditions
2. The where clause immediately follows the from clause

Case:
Query all employees with salary greater than 1000

 select * from emp where sal > 1000;
 

6.2.1 Comparison Operators (Between/In/ Is Null)

1) The predicate operators are described in the table below, and these operators can also be used in JOIN...ON and HAVING statements.
Case practice
(1) Query all employees whose salary is equal to 5000

hive (default)> select * from emp where sal =5000;

(2) Query the information of employees whose salary is between 500 and 1000

hive (default)> select * from emp where sal between 500 and 1000;

(3) Query the information of all employees whose comm is empty

hive (default)> select * from emp where comm is null;

(4) Query the information of employees whose salary is 1500 or 5000

hive (default)> select * from emp where sal IN (1500, 5000);

6.2.2 Like and RLike

1)use LIKE Operates to select similar values
2)Selection criteria can contain characters or numbers:
% represents zero or more characters(any number of characters). 
_ represents a character.
3)RLIKE clause is Hive An extension of this functionality in the Java Regular expressions are a more powerful language to specify matching conditions.

4) Case practice
(1) Find the information of employees whose salary starts with 2

hive (default)> select * from emp where sal LIKE '2%';
(2)Find employee information with a second salary of 2
hive (default)> select * from emp where sal LIKE '_2%';
(3)Find employee information with 2 in their salary
hive (default)> select * from emp where sal RLIKE '[2]';

6.2.3 Logical Operators (And/Or/Not)

operator	meaning
AND	logically
OR	logical or
NOT	logical no

Case practice
(1) The query salary is greater than 1000, and the department is 30

hive (default)> select * from emp where sal>1000 and deptno=30;
(2)Query salary greater than 1000, or department is 30
hive (default)> select * from emp where sal>1000 or deptno=30;
(3)Query employee information except for 20 departments and 30 departments
hive (default)> select * from emp where deptno not IN(30, 20);

6.3 Grouping

6.3.1 Group By Statement

The GROUP BY statement is often used with aggregate functions to group by one or more queued results, and then perform an aggregate operation on each group.
Case practice:
(1) Calculate the average salary of each department in the emp table
hive (default)> select t.deptno, avg(t.sal) avg_sal from emp t group by t.deptno;
(2) Calculate the maximum salary of each position in each department of emp
hive (default)> s

elect t.deptno, t.job, max(t.sal) max_sal from emp t group by
 t.deptno, t.job

;

6.3.2 Having Statement

1. The difference between having and where
(1) The grouping function cannot be written after where, but the grouping function can be used after having.
(2) having is only used for group by grouping statistical statements.
2. Case practice
(1) Find the departments in which the average salary of each department is greater than 2000
Find the average salary for each department
hive (default)> select deptno, avg(sal) from emp group by deptno;
Find the departments where the average salary of each department is greater than 2000
hive (default)>

select deptno, avg(sal) avg_sal from emp group by deptno having
 avg_sal > 2000;

6.4 Join Statement

6.4.1 Equivalent Join

Hive supports the usual SQL JOIN statement, but only supports equi-join, not non-equi-join.
Case practice
(1) According to the equality of the department numbers in the employee table and the department table, query the employee number, employee name and department name;
hive (default)> select e.empno, e.ename, d.deptno, d.dname from emp e join dept d on e.deptno = d.deptno;

6.4.2 Aliases for tables

1. benefit
(1) Use aliases to simplify queries.
(2) Using the table name prefix can improve the execution efficiency.
2. Case practice
Merge employee table and department table
hive (default)>

select e.empno, e.ename, d.deptno from emp e join dept d on e.deptno
 = d.deptno;

6.4.3 Inner join

Inner join: Only data that matches the join condition exists in both tables being joined and will be preserved.
hive (default)>

 select e.empno, e.ename, d.deptno from emp e join dept d on e.deptno
 = d.deptno;

6.4.4 Left Outer Join

Left outer join: All records in the left table of the JOIN operator that meet the WHERE clause will be returned.
hive (default)> select e.empno, e.ename, d.deptno from emp e left join dept d on e.deptno = d.deptno;
6.4.5 Right Outer Join
Right Outer Join: All records in the table on the right side of the JOIN operator that match the WHERE clause will be returned.

hive (default)> select e.empno, e.ename, d.deptno from emp e right join dept d on e.deptno = d.deptno;

6.4.6 Full outer join

Full outer join: will return all tables that match WHERE All records for statement conditions. If the specified field of any table has no matching value, then use NULL value substitution.

hive (default)>

select e.empno, e.ename, d.deptno from emp e full join dept d on e.deptno
 = d.deptno;

6.4.7 Multi-table join

Note: To join n tables, at least n-1 join conditions are required. For example: to join three tables, at least two join conditions are required.
data preparation
location.txt

1700	Beijing
1800	London
1900	Tokyo

1. Create a location table

create table if not exists location(
loc int,
loc_name string
)
row format delimited fields terminated by '\t';

2. Import Data
hive (default)> load data local inpath '/opt/module/datas/location.txt' into table location;
3. Multi-table join query
hive (default)>

SELECT e.ename, d.dname, l.loc_name
FROM   emp e 
JOIN   dept d
ON     d.deptno = e.deptno 
JOIN   location l
ON     d.loc = l.loc;

In most cases, Hive will start a MapReduce job for each pair of JOIN objects. In this example, a MapReduce job will be started first to connect table e and table d, and then a MapReduce job will be started to connect the output of the first MapReduce job with table l;.

Note: why not a table d and table l(location)Connect first? This is because Hive Always execute from left to right.
Optimization: when performing on 3 or more tables join When connecting, if each on clauses all use the same join key, then only one MapReduce job

6.4.8 Cartesian product

1.  A Cartesian assembly occurs under the following conditions
(1)Omit join condition
(2)Invalid join condition
(3)All rows in all tables are connected to each other

2. Case practice
hive (default)> select empno, dname from emp, dept;
6.4.9 or is not supported in join predicates
hive join does not currently support using the predicate or in the on clause

hive (default)> select e.empno, e.ename, d.deptno from emp e join dept d on e.deptno
= d.deptno or e.ename=d.ename;   Incorrect

6.5 Sorting

6.5.1 Global Sort (Order By)

Order By: Global sorting, only one Reducer
1. Sort using the ORDER BY clause
ASC (ascend): ascending order (default)
DESC (descend): descending order
2. ORDER BY clause at the end of the SELECT statement
3. Case practice
(1) Query employee information in ascending order of salary
hive (default)> select * from emp order by sal;
(2) Query employee information in descending order of salary
hive (default)> select * from emp order by sal desc;

6.5.2 Sort by alias

Sort by 2 times employee salary
hive (default)> select ename, sal*2 twosal from emp order by twosal;

6.5.3 Multiple Column Sorting

Sort by department and salary in ascending order
hive (default)> select ename, deptno, sal from emp order by deptno, sal ;

6.5.4 Each MapReduce Internal Sort (Sort By)

Sort By: The efficiency of order by for large-scale datasets is very low. In many cases, global sorting is not required, and sort by can be used in this case.
Sort by produces a sorted file for each reducer. Sorting is done internally in each Reducer, not on the global result set.
1. Set the number of reduce rs
hive (default)> set mapreduce.job.reduces=3;
2. View the number of set reduce rs
hive (default)> set mapreduce.job.reduces;
3. View employee information in descending order by department number
hive (default)> select * from emp sort by deptno desc;
4. Import the query results into a file (sort by department number in descending order)
hive (default)>

insert overwrite local directory '/opt/module/datas/sortby-result'
 select * from emp sort by deptno desc;

6.5.5 Distribute By

Distribute By: In some cases, we need to control which reducer a particular row should go to, usually for subsequent aggregation operations. The distribute by clause can do this. distribute by is similar to partition (custom partition) in MR, and it is used in combination with sort by.
For the test of distribute by, be sure to allocate multiple reduce rs for processing, otherwise the effect of distribute by cannot be seen.
Case practice:
(1) Divide by department number first, and then sort by employee number in descending order.
hive (default)> set mapreduce.job.reduces=3;
hive (default)> insert overwrite local directory '/opt/module/datas/distribute-result' select * from emp distribute by deptno sort by empno desc;

Notice:
1. 	distribute by The partition rules are based on the partition field hash code with reduce After modulo division of the number of , those with the same remainder are divided into one area.
2. 	Hive Require DISTRIBUTE BY statement to be written in SORT BY before the statement.

6.5.6 Cluster By

When the distribute by and sorts by fields are the same, the cluster by method can be used.
In addition to the function of distribute by, cluster by also has the function of sort by. However, the sorting can only be in ascending order, and the sorting rule cannot be specified as ASC or DESC.
1) The following two ways of writing are equivalent
hive (default)> select * from emp cluster by deptno;
hive (default)> select * from emp distribute by deptno sort by deptno;
Note: The division according to the department number is not necessarily a fixed value. It can be divided into a division for the 20th and 30th departments.

6.6 Bucketing and Sampling Query

6.6.1 Bucket Table Data Storage

Partitions provide a convenient way to isolate data and optimize queries. However, not all datasets form reasonable partitions. For a table or partition, Hive can be further organized into buckets, that is, a more fine-grained data range division.
Bucketing is another technique for breaking a dataset into more manageable parts.
Partitioning is for data storage paths; bucketing is for data files.
1. Create a bucket table first, by directly importing data files
(1) Data preparation
student.txt

1001	ss1
1002	ss2
1003	ss3
1004	ss4
1005	ss5
1006	ss6
1007	ss7
1008	ss8
1009	ss9
1010	ss10
1011	ss11
1012	ss12
1013	ss13
1014	ss14
1015	ss15
1016	ss16

(2) Create a bucket table

create table stu_buck(id int, name string)
clustered by(id) 
into 4 buckets
row format delimited fields terminated by '\t';

(3) View the table structure
hive (default)> desc formatted stu_buck;

Num Buckets:            4     

(4) Import data into the bucket table
hive (default)>

 load data local inpath '/opt/module/datas/student.txt' into table
 stu_buck;

(5) Check whether the created bucket table is divided into 4 buckets, as shown in Figure 6-7

It was found that it was not divided into 4 buckets. What is the reason?
2. When creating a bucket table, data is imported through subqueries
(1) First build an ordinary stu table

create table stu(id int, name string)
row format delimited fields terminated by '\t';

(2) Import data into the ordinary stu table

load data local inpath '/opt/module/datas/student.txt' into table stu;

(3) Clear the data in the stu_buck table

truncate table stu_buck;
select * from stu_buck;

(4) Import data into the bucket table by sub-query

insert into table stu_buck
select id, name from stu;

(5) It is found that there is still only one bucket, as shown in Figure 6-8

(6) A property needs to be set
hive (default)> set hive.enforce.bucketing=true;
hive (default)> set mapreduce.job.reduces=-1;
hive (default)> insert into table stu_buck
select id, name from stu;

(7) Query bucketed data
hive (default)>

select * from stu_buck;
OK
stu_buck.id     stu_buck.name
1004    ss4
1008    ss8
1012    ss12
1016    ss16
1001    ss1
1005    ss5
1009    ss9
1013    ss13
1002    ss2
1006    ss6
1010    ss10
1014    ss14
1003    ss3
1007    ss7
1011    ss11
1015    ss15

Bucket rules:
According to the results, Hive's bucketing uses hashing the value of the bucketing field, and then divides it by the number of buckets to obtain the remainder to determine which bucket the record is stored in.

6.6.2 Bucket sampling query

For very large datasets, sometimes users need to use a representative query result instead of all the results. Hive can meet this requirement by sampling the table.
Query the data in the table stu_buck.
hive (default)> select * from stu_buck tablesample(bucket 1 out of 4 on id);
Note: tablesample is a sampling statement, syntax: TABLESAMPLE(BUCKET x OUT OF y).
y must be a multiple or factor of the total number of bucket s in the table. hive determines the sampling ratio according to the size of y. For example, the table is divided into 4 copies in total. When y=2, the data of (4/2=) 2 buckets is extracted, and when y=8, the data of (4/8=) 1/2 bucket is extracted.
x indicates which bucket to start extracting from. If multiple partitions need to be extracted, the subsequent partition number is the current partition number plus y. For example, the total number of buckets in a table is 4, and tablesample(bucket 1 out of 2) means that a total of (4/2=) 2 buckets of data are extracted, and the 1st (x) and 3rd (x+y) buckets are extracted. The data.
Note: the value of x must be less than or equal to the value of y, otherwise
FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck

6.7 Other common query functions

6.7.1 Null field assignment

  1. Function description
    NVL: Assign a value to the data whose value is NULL, and its format is NVL(value, default_value). Its function is that if value is NULL, the NVL function returns the value of default_value, otherwise it returns the value of value, and if both parameters are NULL, it returns NULL.
    2… Data preparation: using the employee table

  2. Query: If the employee's comm is NULL, replace it with -1
    hive (default)> select comm,nvl(comm, -1) from emp;
    OK

comm    _c1
NULL    -1.0
300.0   300.0
500.0   500.0
NULL    -1.0
1400.0  1400.0
NULL    -1.0
NULL    -1.0
NULL    -1.0
NULL    -1.0
0.0     0.0
NULL    -1.0
NULL    -1.0
NULL    -1.0
NULL    -1.0
5.	Query: If the employee's comm for NULL,leadership id replace
hive (default)> select comm, nvl(comm,mgr) from emp;
OK
comm    _c1
NULL    7902.0
300.0   300.0
500.0   500.0
NULL    7839.0
1400.0  1400.0
NULL    7839.0
NULL    7839.0
NULL    7566.0
NULL    NULL
0.0     0.0
NULL    7788.0
NULL    7698.0
NULL    7566.0
NULL    7782.0
6.7.2 CASE WHEN
  1. data preparation
name	dept_id	sex
 Goku	A	male
 the sea	A	male
 Song Song	B	male


Sister Feng	A	Female
 Sister Ting	B	Female
 Tingting	B	Female

2. need
Find the number of men and women in different departments. The result is as follows:

A     2       1
B     1       2

3. Create local emp_sex.txt and import data
huang@bigdata3 datas]$ vi emp_sex.txt

Goku	A	male
 the sea	A	male
 Song Song	B	male
 Sister Feng	A	Female
 Sister Ting	B	Female
 Tingting	B	Female

4. Create hive table and import data

create table emp_sex(
name string, 
dept_id string, 
sex string) 
row format delimited fields terminated by "\t";
load data local inpath '/opt/module/datas/emp_sex.txt' into table emp_sex;

5. Query data on demand

select 
  dept_id,
  sum(case sex when 'male' then 1 else 0 end) male_count,
  sum(case sex when 'Female' then 1 else 0 end) female_count
from 
  emp_sex
group by
  dept_id;

6.7.2 Row to Column

1. Related function description
CONCAT(string A/col, string B/col…): Returns the result of the concatenation of the input strings, supports any number of input strings;
CONCAT_WS(separator, str1, str2,…): It is a special form of CONCAT(). The separator between the remaining parameters of the first parameter. The delimiter can be the same string as the rest of the arguments. If the delimiter is NULL, the return value will also be NULL. This function skips any NULL and empty strings after the delimiter parameter. The delimiter will be added between the concatenated strings;
COLLECT_SET(col): The function only accepts basic data types. Its main function is to de-aggregate the value of a field to generate an array type field.
2. data preparation
Table 6-6 Data preparation

name	constellation	blood_type
 Sun Wukong	Aries	A
 the sea	Sagittarius	A
 Song Song	Aries	B
 Pig Bajie	Aries	A
 Sister Feng	Sagittarius	A

3. need
Group people with the same zodiac sign and blood type together. The result is as follows:

Sagittarius,A            the sea|Sister Feng
 Aries,A            Sun Wukong|Pig Bajie
 Aries,B            Song Song

4. Create local constellation.txt, import data
[huang@bigdata03 datas]$ vi constellation.txt
Monkey King Aries A
sea ​​Sagittarius A
Song Song Aries B
Pig Bajie Aries A
Sister Feng Sagittarius A
5. Create hive table and import data

create table person_info(
name string, 
constellation string, 
blood_type string) 
row format delimited fields terminated by "\t";
load data local inpath "/opt/module/datas/constellation.txt" into table person_info;

6. Query data on demand

select
    t1.base,
    concat_ws('|', collect_set(t1.name)) name
from
    (select
        name,
        concat(constellation, ",", blood_type) base
    from
        person_info) t1
group by
    t1.base;

6.7.3 Column to row

1. Function description
EXPLODE(col): Split the complex array or map structure in a column of hive into multiple rows.
LATERAL VIEW
Usage: LATERAL VIEW udtf(expression) tableAlias ​​AS columnAlias
Explanation: It is used with UDTF s such as split, explode, etc. It can split a column of data into multiple rows of data, and on this basis, the split data can be aggregated.
2. data preparation

movie	category
<suspect tracking"	Suspense,action,science fiction,plot
<Lie to me>	Suspense,gangster,action,psychology,plot
<Wolf Warrior 2	war,action,disaster

3. need
Expand the array data in the movie category. The result is as follows:

<suspect tracking"      Suspense
<suspect tracking"      action
<suspect tracking"      science fiction
<suspect tracking"      plot
<Lie to me>   Suspense
<Lie to me>   gangster
<Lie to me>   action
<Lie to me>   psychology
<Lie to me>   plot
<Wolf Warrior 2        war
<Wolf Warrior 2        action
<Wolf Warrior 2        disaster

4. Create local movie.txt and import data
[huang@bigdata03 datas]$ vi movie.txt
Suspect Tracker Suspense, Action, Sci-Fi, Drama
"Lie to me" Mystery, Police, Action, Psychological, Drama
"Wolf Warrior 2" war, action, disaster
5. Create hive table and import data

create table movie_info(
    movie string, 
    category array<string>) 
row format delimited fields terminated by "\t"
collection items terminated by ",";
load data local inpath "/opt/module/datas/movie.txt" into table movie_info;

6. Query data on demand

select
    movie,
    category_name
from 
    movie_info lateral view explode(category) table_tmp as category_name;

6.7.4 Window function (window function)

1.Related function description
OVER(): Specifies the size of the data window in which the analysis function works, which may vary from row to row.
CURRENT ROW: current row
n PRECEDING: go forward n rows of data
n FOLLOWING: the next n rows of data
UNBOUNDED: the starting point, UNBOUNDED PRECEDING means the starting point from the front, UNBOUNDED FOLLOWING means the ending point at the back
LAG(col,n,default_val): the nth row of data ahead
LEAD(col,n, default_val): the next row of data
NTILE(n): Distribute the rows in the ordered partition to the groups of the specified data. Each group has a number, and the number starts from 1. For each row, NTILE returns the number of the group to which the row belongs. Note: n must be of type int.
2. Data preparation: name, orderdate, cost

jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

3. need
(1) Query the total number of customers who have purchased in April 2017
(2) Query the customer's purchase details and monthly purchase total
(3) In the above scenario, the cost should be accumulated according to the date
(4) Query the last purchase time of each customer
(5) Query the order information of the first 20% of the time
4. Create local business.txt, import data
[cris@hadoop102 datas]$ vi business.txt
5. Create hive table and import data

create table business(
name string, 
orderdate string,
cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
load data local inpath "/opt/module/datas/business.txt" into table business;

6. Query data on demand
(1) Query the total number of customers who have purchased in April 2017

select name,count(*) over () 
from business 
where substring(orderdate,1,7) = '2017-04' 
group by name;

(2) Query the customer's purchase details and monthly purchase total

select name,orderdate,cost,sum(cost) over(partition by month(orderdate)) from
 business;

(3) In the above scenario, the cost of each customer is accumulated by date

select name,orderdate,cost, 
sum(cost) over() as sample1,--add all rows 
sum(cost) over(partition by name) as sample2,--according to name Grouping, adding data within the group 
sum(cost) over(partition by name order by orderdate) as sample3,--according to name Grouping, data accumulation within the group 
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--and sample3 Same,Aggregation from start point to current row 
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --Aggregate the current row and the previous row 
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--current line and previous line and next line 
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --The current line and all subsequent lines 
from business;

Rows must follow the Order by clause to limit the result of sorting and use a fixed number of rows to limit the number of data rows in the partition
(4) Check the customer's last purchase time

select name,orderdate,cost, 
lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate ) as time1, lag(orderdate,2) over (partition by name order by orderdate) as time2 
from business;

(5) Query the order information of the first 20% of the time

select * from (
    select name,orderdate,cost, ntile(5) over(order by orderdate) sorted
    from business
) t
where sorted = 1;

6.7.5 Rank
1.Function description
RANK() will repeat when the order is the same, the total will not change
DENSE_RANK() will repeat when the order is the same, the total will decrease
ROW_NUMBER() will be calculated in order
2.data preparation
Table 6-7 Data preparation
name subject score

Sun Wukong	language	87
 Sun Wukong	math	95
 Sun Wukong	English	68
 the sea	language	94
 the sea	math	56
 the sea	English	84
 Song Song	language	64
 Song Song	math	86
 Song Song	English	84
 Tingting	language	65
 Tingting	math	85
 Tingting	English	78

3. need
Calculate the grades for each subject.
4. Create local score.txt and import data
[huang@bigdata03 datas]$ vi score.txt
5. Create hive table and import data

create table score(
name string,
subject string, 
score int) 
row format delimited fields terminated by "\t";
load data local inpath '/opt/module/datas/score.txt' into table score;

6. Query data on demand

select name,
subject,
score,
rank() over(partition by subject order by score desc) rp,
dense_rank() over(partition by subject order by score desc) drp,
row_number() over(partition by subject order by score desc) rmp
from score;

name subject score rp drp rmp
Monkey King Math 95 1 1 1
Song Song Mathematics 86 2 2 2
Ting Ting Math 85 3 3 3
Ocean Mathematics 56 4 4 4
Song Song English 84 1 1 1
Sea English 84 1 1 2
Ting Ting English 78 3 2 3
Monkey King English 68 4 3 4
Sea Language 94 1 1 1
Sun Wukong Language 87 2 2 2
Ting Ting Chinese 65 3 3 3
Song Song language 64 4 4 4

Extension: Find the top three students in each subject?

Posted by veenasv on Tue, 03 May 2022 02:57:43 +0300