Hive SQL statement

Library operation

Create database

-- Create a database in HDFS The default path on is/user/hive/warehouse/*.db
create database mydatabase;
-- have access to if exists Determine whether the database already exists(If it exists, it will not be created)
create database if not exists mydatabase;
-- Create a database and specify its storage path
create database mydatabase location '/mydatabase.db'; 
-- Create a database and specify an existing folder(my)As the storage location of database content
create database mydatabase location '/databases/my/';

query data base

-- Show all databases
show databases;
-- Fuzzy search
show databases like 'my*';
-- see information
desc database mydatabase;
-- View details
desc database extended mydatabase;
-- Switch current database
use mydatabase;

 

modify the database

Some other additional information can be modified, but metadata information cannot be modified

-- Add information to the database
alter database mydatabase set dbproperties('createtime'='202003');
-- View the information added above
desc database extended mydatabase;

 

Delete database

-- Delete an empty database
drop database mydatabase;
-- Best use if exists Judge whether the database exists
drop database if exists mydatabase;
-- If the database is not empty,Can use cascade Command force delete
drop database mydatabase cascade;

 

Table operation

Create table

-- Create a table
create table student(id int, name string);
-- Create an external table
create external table student(id int, name string);
-- Create a table and set the separator for the data in the table(Take tabs as an example)
create table student(id int, name string) 
row format delimited fields terminated by '\t';
-- Create a table and set the separator of array data in the table(Take tabs as an example)
create table student(id int, name string) 
collection items terminated by "\t" ;

 

View table

-- View all tables in the current database
show tables;

Modify table

-- rename table 
alter table student rename to new_student;
-- Add column(The added column precedes the partition field,Add multiple columns separated by commas in parentheses)
alter table student add columns(gender string);
-- Update column information(Old field name,new field name,All new field types should be written)
alter table student change column name name2 string;
-- Replace all fields in the table(Merge and replace all fields with one field)
alter table student replace columns(replace string);

 

Delete table

drop table student;

Internal table (management table) and external table

The difference between the two

When deleting, the internal table deletes both metadata and specific data, while the external table only deletes metadata.

Mutual conversion

Note: it is case sensitive here, and the contents in parentheses should be capitalized!

If it is not capitalized, the attribute becomes an ordinary additional attribute.

-- Convert to external table
alter table student set tblproperties('EXTERNAL'='TRUE');
-- Convert to internal table
alter table student set tblproperties('EXTERNAL'='FALSE');

 

Partition table

The partition corresponds to an independent folder on HDFS, which belongs to metadata, but its usage is equivalent to a field, which can be used for filtering

Create partition table

-- Create a table,And set to"month"Field partition
create table student(id int, name string) 
partitioned by(month string);
-- Create secondary partition table
create table student(id int, name string)
partitioned by(month string, day string)

Add partition

-- Add a partition to the partition table
alter table student add partition(month='202003');
-- Add multiple partitions to the partition table(Separated by spaces)
alter table student add partition(month='202003') partition(month='202003');

 

Add data to partition table

-- Add keywords partition(...)Just specify the partition;If there is no such partition,New automatically
load data local inpath'/opt/file.txt' into student partition(month='202003');
insert into student partition(month='202003') values(1,'abc');

 

Query partition table data

-- Find data by partition
select * from student where month='202003';
select * from student where month='202003' and day='01';

 

delete a partition

-- Delete a partition in the partition table
alter table student drop partition(month='202003');
-- Delete partitions in multiple partition tables(Separated by commas)
alter table student drop partition(month='202003'),partition(month='202003');

 

 

View partition

-- Show all partitions
show partitions student;

 

Repair partition

If the data is directly uploaded to the partition directory through HDFS, if the partition is not defined, the newly uploaded data will not be queried

-- Repair command
msck repair table student;
-- You can also directly make this directory a partition directory(Here with month='20200316'take as an example)
alter table student add partition(month='20200316');

 

Data operation

Data import

Load import

-- Local file import Hive surface
load data local inputpath '/opt/student.txt' into table student;
-- HDFS File import Hive surface(It is equivalent to moving files to the folder where the database is located)
load data inputpath '/student.txt' into table student;
-- You can also move files directly to HDFS Medium Hive Table of contents
hadoop fs -put student.txt /user/hive/warehouse/student
-- Import,And overwrite all data in the table
load data local inputpath '/opt/student.txt' overwrite into table student;
-- Pass when creating table Location Specify load data path(folder)
create table student(id int, name string)
row format delimited fields terminated by '\t'
location '/dir';

Insert insert

-- Add a record directly
insert into table student values(1,'abc');
-- add to,And overwrite all data in the table
insert overwrite table student values(1,'abc');

 

Import import

Only files exported by export can be imported

-- adopt import Import data
import table student2 from '/export/student';

Data query

Basic query

-- Query all data in the table
select * from student;
-- Query the data of the specified column in the table
select id, name from student;
-- Insert the query results into other tables
insert into student2 select * from student;
-- Create a new table with the results of the query
create table student2 as select id, name from student;
-- Display as column alias(as No),Using aliases can also improve performance
select id as sid, name as sname from student;
-- To be queried id The value is displayed after adding 100
select id+100 from student;
-- Common functions(count:count, Maximum:max, minimum value:min, Sum:sum, average:avg)
select count(*) from student;
-- Limit Statement is used to limit the number of rows returned
select * from student limit 3;
-- Where Statement for filtering
select * from student where id = 1;

 

Floor rounding
-- Right 123.123 Rounding,The result is 123
select floor(123.123)

 

Like and Rlike

like: select a similar value

%Represents zero or more characters (any character).
_ Represents a character.

Rlike: regular matching in Java

-- Query students whose surname begins with "small"
select * from student where name like 'Small%';
-- Query students whose names begin with "small", and their names only have two words
select * from student where name like 'Small_';
-- query age Those records that contain only numbers in the field
select * from student where age rlike '\\d+';

 

Distinct de duplication

Data will be put into the same Reducer, which may report memory overflow. Use with caution when there is a large amount of data

-- No matter how many people are in this age group,Only one is displayed after weight removal
select age,count(distinct age) from mydatabase.student group by age;

 

Group By group query
-- To field age grouping,coordination count Use to display the number of each group
select age,count(*) from student group by age;
-- To field grade grouping,coordination avg Use to display each group age Average of
select grade,avg(age) from student group by grade;
-- First age grouping,Again gender grouping,Statistics on the number of men and women of each age group
select count(gender) from student group by age,gender;

 

Having statement

where: it plays a role on the columns in the table and cannot be used with aggregate functions

having: it plays a role in the columns in the query results, which is equivalent to secondary filtering. It can be used as an aggregation function. It can only be used in group byf grouping statistics statements

-- To field grade grouping,display age The average value is greater than 18 grade
select grade from student group by grade having avg(age)>18;

 

Join statement

Only equivalent connection is supported, and non equivalent connection is not supported

-- Suppose there are two tables:dept Department table and employee Employee table
-- Inner connection(Only the existing data will be displayed)
-- Query the data with the same number of employee table and department table,And display the employee name and department name
select employee.name,dept.name from employee join dept on dept.d_id=employee.d_id;
-- Left outer connection(Display all data in the left table)
select employee.name,dept.name from employee left join dept on dept.d_id=employee.d_id;
-- Right outer connection(Display all data in the right table)
select employee.name,dept.name from employee right join dept on dept.d_id=employee.d_id;
-- Full external connection(Show all data,Mismatched values use NULL Value substitution)
select employee.name,dept.name from employee full join dept on dept.d_id=employee.d_id;

 

Common query function

NVL null field assignment

NVL(string1, replace_with)

If string1 is NULL, the function returns replace_with, otherwise return the value of string1

-- If age by null,Replace with 18
select nvl(age,18) from student;
-- The replaced parameter can be a field,If age by null,use id Value substitution
select nvl(age,id) from student;

 

Time class
Date_format

Format time, note: only the cross bar "-" can be matched

select date_format('2020-03-19','yyyy-MM-dd HH:mm:ss');
-- result: 2020-03-19 00:00:00
Date_add

Add the time to the number of days, and the number of days can be negative

select date_add('2020-03-19', 10); 
-- result: 2020-03-29
Date_sub

The time is subtracted from the number of days. The number of days can be negative

select date_sub('2020-03-19', 10);
-- result: 2020-03-09

 

Datediff

The two times are subtracted, and the result is days. Note: parameter 1 - parameter 2

Hours, minutes and seconds do not affect the final result

select datediff('2020-03-19', '2020-03-29');
-- result: -10
select datediff('2020-03-29', '2020-03-19');
-- result: 10
select datediff('2020-03-29 13:13:13','2020-03-19 12:12:12');
-- result: 10

 

CASE WHEN statement
-- Judgment, if gender by'male'or'female',Set 1 respectively, and finally count the number of men and women of each age group
select
age,
sum(case gender when 'male' then 1 else 0 end) male_count,
sum(case gender when 'female' then 1 else 0 end) female_count
from student group by age;

 

IF statement
-- The following code is equivalent to the above case when
select
age,
sum(if(gender='male',1,0)) male_count,
sum(if(gender='female',1,0)) female_count
from student group by age;

 

Row to column
Concat

concat(string1/col, string2/col, ...)

Enter any string (or field, which can be int type, etc.) to return the spliced result

select concat(id,'-',name,'-',age) from student;

 

Concat_ws

concat_ws(separator, str1, str2, ...)

The special form of concat() parameter can only be a string, and the first parameter is the separator of the following parameters

select concat_ws('-', name, gender) from student;

 

Collect_set

collect_set(col)

De reprocess a field and return the array type; This function only accepts basic data types

select collect_set(age) from student;
Column to row
Explode

explode(col)

Split a complex array or map structure in a column into multiple rows

-- Put it up collect_set Use of results after explode split
select explode(ages)
from (select collect_set(age) as ages from student ) as n1;

 

Lateral View

LATERAL VIEW udtf(expression) tableAlias AS columnAlias

Used together with UDTF such as split and expand, it can split a column of data into multiple rows of data and aggregate the split results

-- The assumptions are as follows movies surface,The field names are movie(string)and category(array<string>)
-- movie	category
--<Suspect tracking	Suspense,action,science fiction,plot
--<Seals	action,plot,Crime
--<War wolf 2	Warfare,action,disaster
select movie, cate
from movies
lateral view explode(category) tmp_table as cate;
-- result:
--<Suspect tracking	Suspense
--<Suspect tracking	action
--<Suspect tracking	science fiction
--<Suspect tracking	plot
--<Seals	action
-- ...

 

Window function

OVER(): Specifies the size of the data window in which the analysis function works. The size of the data window may change with the change of rows,

Note: this function will affect the result data (for example, after sorting in over(order by id), the result will also be sorted)

CURRENT ROW: CURRENT ROW;
N predicting: data in the previous n lines;
n FOLLOWING: data in the next n rows;
UNBOUNDED: starting point, UNBOUNDED predicting indicates the starting point from the front, UNBOUNDED FOLLOWING indicates the end point from the back;
LAG(col,n): data in the nth row ahead;
LEAD(col,n): the nth row of data in the future;
NTILE(n): distribute the rows in the ordered partition to the groups of specified data. Each group has a number, starting from 1,
For each row, NTILE returns the number of the group to which the row belongs. Note: n must be of type int.

-- Fixed format writing of several parameters
-- The calculation starts from the current line[2,4]OK gender quantity
select *,count(gender) over(rows between 2 following and 4 following) from student;

 

Suppose there is the following business table

name orderdate cost
------------------
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94
------------------
demand 
(1)Query the customers and total number of people who purchased in April 2017 
(2)Query the customer's purchase details and monthly total purchase amount 
(3)The above scenario,Want to cost Accumulate by date 
(4)In the above scenario, the monthly expenses of each user are accumulated respectively
(5)Query the customer's last purchase time 
(6)Top 20 queries%Order information for time
  1. Query the customers and total number of people who purchased in April 2017

    select 
    name,
    count(*) over() as all_person
    from business
    where date_format(orderdate,'yyyy-MM')='2017-04'
    group by name;
  2. Query the customer's purchase details and the user's monthly total purchase amount

    select 
    name,
    orderdate,
    date_format(orderdate,'yyyy-MM') this_month,
    cost,
    sum(cost) over(distribute by name, date_format(orderdate,'yyyy-MM')) as this_user_this_month_sum
    from business;
  3. In the above scenario, cost should be accumulated by date

    select 
    name,
    orderdate,
    cost,
    sum(cost) over(distribute by name sort by orderdate)
    from business;
  4. In the above scenario, the monthly expenses of each user are accumulated respectively

    select 
    name,
    orderdate,
    cost,
    sum(cost) over(distribute by name,month(orderdate) sort by day(orderdate))
    from business;
  5. Query the customer's last purchase time

    -- lag The third parameter of:If no data is found,Replace with this parameter,Otherwise NULL
    select
    name,
    orderdate,
    cost,
    lag(orderdate,1,'0000-00-00') over(distribute by name sort by orderdate)
    from business;
  6. Order information 20% before query

    -- use ntile Function grouping implements this operation
    select * from
    (
        select
        name,
        orderdate,
        cost,
        ntile(5) over(order by orderdate) as sorted
        from business
    ) as tmp_table
    where sorted = 1;
Rank sort

This function is used in conjunction with OVER()

The RANK() sort will be repeated when it is the same, and the total number will not change
DENSE_ If rank() sorting is the same, it will be repeated and the total number will be reduced
ROW_NUMBER() is calculated in order

Suppose there is the following score table

name  subject  score 
--------------------
Xiao Ming	language	87
 Xiao Ming	mathematics	95
 Xiao Ming	English	68
 Little green	language	94
 Little green	mathematics	56
 Little green	English	84
 Xiao Hong	language	64
 Xiao Hong	mathematics	86
 Xiao Hong	English	84
 Blue 	language	65
 Blue 	mathematics	85
 Blue 	English	78
---------------------

 

-- Requirements: calculate the score ranking of each discipline. 
select
*,
rank() over(distribute by subject sort by score desc),
dense_rank() over(distribute by subject sort by score desc),
row_number() over(distribute by subject sort by score desc)
from score;

 

Regexp_replace regular replacement

regexp_replace(string A, string B, replace)

Replace the part of string A that conforms to JAVA regular expression B with replace.

Note that in some cases, escape characters are used

-- Put the in the string“/"Replace with“-"
select regexp_replace('2020/03/21','/','-');
-- result:2020-03-21

 

Data sorting

Order By global sort

There is only one Reducer for sorting the whole table

-- Press data id Value ascending sort(Default ascending order,Can not write asc)
select * from student order by id asc;
-- Press data id Value descending sort
select * from student order by id desc;

 

Sort By internal sort

Sort each Reducer without affecting the global result set

Direct use distributes the results equally to each file (avoiding data skew)

It is generally used in combination with distribution by

-- Set first reduce number
set mapreduce.job.reduces=3;
-- View results directly,No change can be seen
select * from student sort by id;
-- Export sorting results to a file
insert overwrite local directory '/opt/datas/sort-out'
select * from student sort by id;

 

Distribute By partition sort

Similar to the Partition partition in MapReduce, it is generally used in conjunction with Sort By sorting

You need to allocate multiple reduce to see the effect

Note: this statement needs to be written before the Sort By statement!

-- Set first reduce Number of
set mapreduce.job.reduces=3;
-- First according to id Value partition,Then according to age Value ascending sort
insert overwrite local directory '/opt/datas/dis-out'
select * from student distribute by id sort by age;

 

Cluster By sort

When the distribution by and Sort By fields are the same, the Cluster By method can be used

The sort can only be ascending

-- The following two expressions are equivalent
select * from student cluster by grade;
select * from student distribute by grade sort by grade;

 

Bucket sorting and sampling query

Partition refers to the storage path of data, and bucket division refers to data files

Create bucket table
-- Create bucket table
create table studentbucket (id int, name string, age int)
clustered by (id) into 4 buckets
row format delimited fields terminated by '\t';
-- You can view the table structure to get bucket quantity
desc formatted studentbucket;

 

Before importing data, set some properties

-- Enable barrel separation function
set hive.enforce.bucketing=true;
-- set up reduce The number is-1,It will be automatically determined according to the number of barrels reduce number
set mapreduce.job.reduces=-1;

 

insert data

-- Because partition is required,So I have to go mr Insert data as a task
-- be careful:load Method does not go mr task
-- So it's used here select Insert data from other tables
insert into table studentbucket select * from student;

 

Bucket table sampling query

Sampling syntax: TABLESAMPLE(BUCKET x OUT OF y)

Note: the value of y must be less than or equal to x!

Meaning: x indicates which bucket to start from,

Y refers to the total number of buckets extracted / y, and every other y takes down a bucket

-- Sampling query
-- This is from bucket1 Start extracting one bucket quantity(4/4=1)Data
select * from studentbucket tablesample(bucket 1 out of 4 on id);
-- This is from bucket1 Start to extract 2 bucket(The first x=1 And section x+y=3 individual bucket)Data
select * from studentbucket tablesample(bucket 1 out of 2 on id);

 

Data export

Insert export

-- Export query results to local
insert overwrite local directory '/opt/datas' select * from student;
-- Export query results to local,And press'\t'division 
insert overwrite local directory '/opt/datas'
row format delimited fields terminated by '\t'
select * from student;
-- Export query results to HDFS
insert overwrite directory '/opt/datas' select * from student;

 

Hadoop command export

# Send HDFS files directly to local
hadoop fs -get /user/hive/warehouse/student /opt/datas

 

Hive Shell command export

# Export the query results to a file through the redirector in linux
bin/hive -e "select * from student" > /opt/datas/student.txt;

 

Export export

-- adopt export Export to HDFS,And saved the metadata
export table student to '/export/student';

 

Data deletion

Truncate delete

To clear the data in the table, you can only delete the internal table, not the data in the external table

-- use truncate Clear data in table
truncate table student;

 

function

System built-in function

-- View system built-in functions
show functions;
-- View the usage of built-in functions in the system(split take as an example)
desc function split;
-- View the details of system built-in functions(split take as an example)
desc function extended split;

 

Custom function

UDF

User-Defined-Function

One in and one out

For example: split, datediff

Inherit org apache. hadoop. ql.exec. UDF

Implement the evaluate method

UDAF

User-Defined Aggregation Function

Aggregate function, one more in and one out

Similar: count/max/min

UDTF

User-Defined Table-Generating Functions

One in and many out

For example: lateral view explore()

Inherit org apache. hadoop. hive. ql.udf. generic. GenericUDTF;

Implement three methods: initialize, process and close

Programming steps

  1. Inherit org apache. hadoop. ql.exec. UDF

  2. You need to implement the evaluate function; The evaluate function supports overloading;

  3. Creating functions in hive's command line window

    • Add jar package

      add jar linux_jar_path
    • Create function

      create [temporary] function [dbname.]function_name AS class_name;
  4. Delete functions in hive's command line window

    Drop [temporary] function [if exists] [dbname.]function_name; 
  5. matters needing attention

    UDF must have a return type, which can return null, but the return type cannot be void;

Maven dependency

<dependencies> 
    <!--https://mvnrepository.com/artifact/org.apache.hive/hive-exec --> 
    <dependency> 
      <groupId>org.apache.hive</groupId> 
      <artifactId>hive-exec</artifactId> 
      <version>1.2.1</version> 
    </dependency> 
</dependencies>
 

Tags: hive

Posted by varasurf on Thu, 05 May 2022 06:52:27 +0300