hive learning notes 11: UDTF

Welcome to my GitHub

Here we classify and summarize all the original works of Xinchen (including supporting source code): https://github.com/zq2599/blog_demos

hive learning notes series navigation

  1. Basic data type
  2. Complex data type
  3. Internal and external tables
  4. Partition table
  5. Barrel Division
  6. HiveQL Foundation
  7. Built in function
  8. Sqoop
  9. Basic UDF
  10. User defined aggregate function (UDAF)
  11. UDTF

Overview of this article

  1. This article is the eleventh in the "hive learning notes" series. So far, we have studied one in one out UDF and one in one out UDAF. Finally, there is one in one out UDTF left in this article, which is also the main content of this article;
  2. The UDTF with one in and many out is called user-defined table generating functions (UDTF);
  3. In the previous article, we have experienced that expand is the built-in UDTF of hive:
hive> select explode(address) from t3;
OK
province	guangdong
city	shenzhen
province	jiangsu
city	nanjing
Time taken: 0.081 seconds, Fetched: 4 row(s)
copy
  1. There are two examples of UDTF in this article: splitting a column into multiple columns and splitting a column into multiple rows (multiple columns per row);
  2. Then start the actual combat;

Source download

  1. If you don't want to encode, you can download all the source code in GitHub. The address and link information are shown in the following table:

name

link

remarks

Project Home

https://github.com/zq2599/blog_demos

The project is on the home page of GitHub

git warehouse address (https)

https://github.com/zq2599/blog_demos.git

The warehouse address of the source code of the project, https protocol

git warehouse address (ssh)

git@github.com:zq2599/blog_demos.git

The warehouse address of the source code of the project, ssh protocol

  1. There are multiple folders in this git project. The application of this chapter is under the hiveudf folder, as shown in the red box below:

preparation

In order to verify the function of UDTF, we need to prepare the table and data:

  1. Create a new table named t16:
create table t16(
person_name  string,
string_field string
)
row format delimited 
fields terminated by '|'
stored as textfile;
copy
  1. Local new text file 016 Txt, as follows:
tom|1:province:guangdong
jerry|2:city:shenzhen
john|3
copy
  1. Import data:
load data 
local inpath '/home/hadoop/temp/202010/25/016.txt' 
overwrite into table t16;
copy
  1. After data preparation, start coding;

Key points of UDTF development

  1. You need to inherit GenericUDTF class;
  2. Rewrite the initialize method. There is only one input parameter of this method, and the type is StructObjectInspector. From here, you can get several fields used by UDTF and field types;
  3. The return value of initialize is the type of StructObjectInspector, and the name and type of each column generated by UDTF are set in the return value;
  4. Rewrite the process method, which is the logic code of one in and many out. Prepare the data of each column and put it in the array. Execute the forward method once, that is, one row of records;
  5. The close method is not necessary. If the business logic is executed, the code that releases resources can be executed here;
  6. Next, develop UDTF according to the above key points;

Split a column into multiple columns

  • The UDTF to be developed next is called UDF_ Wordplitsingleow is used to split the input parameter into multiple columns;
  • In the red box below is the string of an original record in table t16_ Field, which will be udf_wordsplitsinglerow processing:
  • After the fields in the red box above are processed by UDTF, one column becomes three columns. The name of each column is shown in the yellow box below, and the value of each column is shown in the red box below:
  • These are the functions we will develop soon;
  • Open the hiveudf project created earlier and create a new wordsplitsinglerow java:
package com.bolingcavalry.hiveudf.udtf;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.*;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector.Category;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;

import java.util.ArrayList;
import java.util.List;

/**
 * @Description: Split the specified field into multiple columns
 * @author: willzhao E-mail: zq2599@gmail.com
 * @date: 2020/11/5 14:43
 */
public class WordSplitSingleRow extends GenericUDTF {

    private PrimitiveObjectInspector stringOI = null;

    private final static String[] EMPTY_ARRAY = {"NULL", "NULL", "NULL"};

    /**
     * The logic of splitting a column into multiple columns is here
     * @param args
     * @throws HiveException
     */
    @Override
    public void process(Object[] args) throws HiveException {

        String input = stringOI.getPrimitiveJavaObject(args[0]).toString();

        // Invalid string
        if(StringUtils.isBlank(input)) {
            forward(EMPTY_ARRAY);
        } else {

            // Split string
            String[] array = input.split(":");

            // If the string array is illegal, the original string and error prompt are returned
            if(null==array || array.length<3) {
                String[] errRlt = new String[3];
                errRlt[0] = input;
                errRlt[1] = "can not split to valid array";
                errRlt[2] = "-";

                forward(errRlt);
            } else {
                forward(array);
            }
        }
    }

    /**
     * Release resources here. No resources need to be released in this example
     * @throws HiveException
     */
    @Override
    public void close() throws HiveException {

    }

    @Override
    public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {

        List<? extends StructField> inputFields = argOIs.getAllStructFieldRefs();

        // Currently, UDTF only processes one parameter. Here, judge whether the parameter is passed in
        if (1 != inputFields.size()) {
            throw new UDFArgumentLengthException("ExplodeMap takes only one argument");
        }

        // This UDTF only handles string types
        if(!Category.PRIMITIVE.equals(inputFields.get(0).getFieldObjectInspector().getCategory())) {
            throw new UDFArgumentException("ExplodeMap takes string as a parameter");
        }

        stringOI = (PrimitiveObjectInspector)inputFields.get(0).getFieldObjectInspector();

        //Column name collection
        ArrayList<String> fieldNames = new ArrayList<String>();

        //Value value corresponding to column
        ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>();

        // Column name of the first column
        fieldNames.add("id");
        // The inspector type of the first column is string
        fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

        // Column name of the second column
        fieldNames.add("key");
        // The inspector type of the second column is string
        fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

        // Column name of the third column
        fieldNames.add("value");
        // The inspector type of the third column is string
        fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

        return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
    }
}
copy
  • The key point in the above code is the process method. After obtaining the input parameter, divide the string with a colon to get the array, and then call the forward method to generate a row of records with three columns;

Verify UDTF

Next, wordsplitsinglerow Java is deployed as a temporary function and verified;

  1. After coding, in POM Execute the command mvn clean package -U in the directory where XML is located;
  2. Get the file hiveudf-1.0-snapshot. In the target directory jar
  3. Download the jar to the hive server and put it in this directory: / home/hadoop/udf/
  4. In hive session mode, execute the following command to add the local jar:
add jar /home/hadoop/udf/hiveudf-1.0-SNAPSHOT.jar;
copy
  1. Deploy temporary functions:
create temporary function udf_wordsplitsinglerow as 'com.bolingcavalry.hiveudf.udtf.WordSplitSingleRow';
copy
  1. Perform the following SQL validation:
select udf_wordsplitsinglerow(string_field) from t16;
copy
  1. The result is as follows. You can see the string of each line of record_ The field field is divided into three fields: id, key and value:
hive> select udf_wordsplitsinglerow(string_field) from t16;
OK
id	key	value
1	province	guangdong
2	city	shenzhen
3	can not split to valid array	-
Time taken: 0.066 seconds, Fetched: 3 row(s)
copy

Key points to note

  • It is worth noting that UDTF cannot appear in the select statement together with other fields. For example, the following SQL will fail:
select person_name,udf_wordsplitsinglerow(string_field) from t16;
copy
  • The error message is as follows:
hive> select person_name,udf_wordsplitsinglerow(string_field) from t16;
FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions
copy
  • If you want to get the results of UDTF and other fields, you can use the final view syntax. The complete SQL is as follows:
select t.person_name, udtf_id, udtf_key, udtf_value
from (
    select person_name, string_field 
    from  t16
) t LATERAL VIEW udf_wordsplitsinglerow(t.string_field) v as  udtf_id, udtf_key, udtf_value;
copy
  • The query results are as follows. It can be seen that both the specified field and UDTF can be displayed:
hive> select t.person_name, udtf_id, udtf_key, udtf_value
    > from (
    >     select person_name, string_field 
    >     from  t16
    > ) t LATERAL VIEW udf_wordsplitsinglerow(t.string_field) v as  udtf_id, udtf_key, udtf_value;
OK
t.person_name	udtf_id	udtf_key	udtf_value
tom	1	province	guangdong
jerry	2	city	shenzhen
john	3	can not split to valid array	-
Time taken: 0.122 seconds, Fetched: 3 row(s)
copy

Split a column into multiple rows (multiple columns per row)

  • We tried to string_ The field field is split into three fields: id, key and value. However, the total number of rows remains unchanged after splitting. The next UDTF is string_field is split into multiple records, and then each record has three fields;
  • You need to import new data into t16 table and create a new text file 016_multi.txt, as follows:
tom|1:province:guangdong,4:city:yangjiang
jerry|2:city:shenzhen
john|3
copy
  • Execute the following command in the hive session window, and 016 will be used_ multi. Txt overwrites the existing contents of t16 table:
load data 
local inpath '/home/hadoop/temp/202010/25/016_multi.txt' 
overwrite into table t16;
copy
  • The data at this time is shown in the following figure. In the red box is the string of a record_ The field value, the UDTF to be developed next, will be separated by commas to get two strings: 1:province:guangdong and 4:city:yangjiang. Next, separate each string with colons to get two records such as id, key and value, that is, multiple rows and columns:
  • The expected UDTF results are shown in the figure below. The red box and yellow box are both from the string of one record_ Field value:
  • Next, start coding and create a new WordSplitMultiRow Java, the code is as follows. It can be seen that the difference between WordSplitSingleRow and WordSplitSingleRow is only in the process method. WordSplitMultiRow's process executes forward multiple times, so there are multiple records:
package com.bolingcavalry.hiveudf.udtf;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.*;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector.Category;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import java.util.ArrayList;
import java.util.List;

/**
 * @Description: Split the specified field into multiple rows, and each row has multiple columns
 * @author: willzhao E-mail: zq2599@gmail.com
 * @date: 2020/11/5 14:43
 */
public class WordSplitMultiRow extends GenericUDTF {

    private PrimitiveObjectInspector stringOI = null;


    private final static String[] EMPTY_ARRAY = {"NULL", "NULL", "NULL"};

    /**
     * The logic of splitting a column into multiple columns is here
     * @param args
     * @throws HiveException
     */
    @Override
    public void process(Object[] args) throws HiveException {
        String input = stringOI.getPrimitiveJavaObject(args[0]).toString();

        // Invalid string
        if(StringUtils.isBlank(input)) {
            forward(EMPTY_ARRAY);
        } else {

            // Separated by commas
            String[] rowArray = input.split(",");

            // Handling exceptions
            if(null==rowArray || rowArray.length<1) {
                String[] errRlt = new String[3];
                errRlt[0] = input;
                errRlt[1] = "can not split to valid row array";
                errRlt[2] = "-";

                forward(errRlt);
            } else {
                // Each element of rowArray is a string such as "id:key:value"
                for(String singleRow : rowArray) {

                    // Make sure the string is valid
                    if(StringUtils.isBlank(singleRow)) {
                        forward(EMPTY_ARRAY);
                    } else {
                        // Split string
                        String[] array = singleRow.split(":");

                        // If the string array is illegal, the original string and error prompt are returned
                        if(null==array || array.length<3) {
                            String[] errRlt = new String[3];
                            errRlt[0] = input;
                            errRlt[1] = "can not split to valid array";
                            errRlt[2] = "-";

                            forward(errRlt);
                        } else {
                            forward(array);
                        }
                    }
                }

            }
        }
    }

    /**
     * Release resources here. No resources need to be released in this example
     * @throws HiveException
     */
    @Override
    public void close() throws HiveException {

    }

    @Override
    public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {

        List<? extends StructField> inputFields = argOIs.getAllStructFieldRefs();

        // Currently, UDTF only processes one parameter. Here, judge whether the parameter is passed in
        if (1 != inputFields.size()) {
            throw new UDFArgumentLengthException("ExplodeMap takes only one argument");
        }

        // This UDTF only handles string types
        if(!Category.PRIMITIVE.equals(inputFields.get(0).getFieldObjectInspector().getCategory())) {
            throw new UDFArgumentException("ExplodeMap takes string as a parameter");
        }

        stringOI = (PrimitiveObjectInspector)inputFields.get(0).getFieldObjectInspector();

        //Column name collection
        ArrayList<String> fieldNames = new ArrayList<String>();

        //Value value corresponding to column
        ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>();

        // Column name of the first column
        fieldNames.add("id");
        // The inspector type of the first column is string
        fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

        // Column name of the second column
        fieldNames.add("key");
        // The inspector type of the second column is string
        fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

        // Column name of the third column
        fieldNames.add("value");
        // The inspector type of the third column is string
        fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

        return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
    }
}
copy

Verify UDTF

Next, wordsplitmultirow Java is deployed as a temporary function and verified;

  1. After coding, in POM Execute the command mvn clean package -U in the directory where XML is located;
  2. Get the file hiveudf-1.0-snapshot. In the target directory jar
  3. Download the jar to the hive server and put it in this directory: / home/hadoop/udf/
  4. If you are still in the same hive session mode, you need to clean up the previous jar s and functions:
drop temporary function if exists udf_wordsplitsinglerow;
delete jar /home/hadoop/udf/hiveudf-1.0-SNAPSHOT.jar;
copy
  1. In hive session mode, execute the following command to add the local jar:
add jar /home/hadoop/udf/hiveudf-1.0-SNAPSHOT.jar;
copy
  1. Deploy temporary functions:
create temporary function udf_wordsplitmultirow as 'com.bolingcavalry.hiveudf.udtf.WordSplitMultiRow';
copy
  1. Perform the following SQL validation:
select udf_wordsplitmultirow(string_field) from t16;
copy
  1. The result is as follows. You can see the string of each line of record_ The field field is divided into three fields: id, key and value:
hive> select udf_wordsplitmultirow(string_field) from t16;
OK
id	key	value
1	province	guangdong
4	city	yangjiang
2	city	shenzhen
3	can not split to valid array	-
Time taken: 0.041 seconds, Fetched: 4 row(s)
copy
  1. Try to find out other fields with the syntax of final view. The SQL is as follows:
select t.person_name, udtf_id, udtf_key, udtf_value
from (
    select person_name, string_field 
    from  t16
) t LATERAL VIEW udf_wordsplitmultirow(t.string_field) v as  udtf_id, udtf_key, udtf_value;
copy
  1. The results are as follows, in line with expectations:
hive> select t.person_name, udtf_id, udtf_key, udtf_value
    > from (
    >     select person_name, string_field 
    >     from  t16
    > ) t LATERAL VIEW udf_wordsplitmultirow(t.string_field) v as  udtf_id, udtf_key, udtf_value;
OK
t.person_name	udtf_id	udtf_key	udtf_value
tom	1	province	guangdong
tom	4	city	yangjiang
jerry	2	city	shenzhen
john	3	can not split to valid array	-
Time taken: 0.056 seconds, Fetched: 4 row(s)
copy
  • So far, we have learned and practiced the three user-defined functions of HIVE. I hope these contents can bring some reference to your practice

Posted by Tonka1979 on Fri, 06 May 2022 10:32:18 +0300