Hive 自定义 UDTF 函数-蒲公英云

Hive 自定义 UDTF 函数

继承org.apache.hadoop.hive.ql.udf.generic.GenericUDTF

重写 initialize，process 和 close方法

Hive 调用 initialize 方法来确定传入参数的类型并确定 UDTF 生成表的每个字段的数据类型（即输入类型和输出类型。initialize 方法必须返回一个生成表的字段的相应的 StructObjectInspector。
初始化完成后，会调用process方法,真正的处理过程在process函数中，在process中，每一次forward()调用产生一行；如果产生多列可以将多个列的值放在一个数组中，然后将该数组传入到forward()函数。

调用close()方法，对需要清理的方法进行清理。

package com.cloudera.udtf;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;    
import java.util.ArrayList;    
public class UDTFExplode extends GenericUDTF { 
    @Override
    public void close() throws HiveException { 
    }
    @Override
    public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException { 
        ArrayList<String> fieldNames = new ArrayList<String>();
        ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>();
        fieldNames.add("col1");
        fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        fieldNames.add("col2");
        fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
    }
    @Override
    public void process(Object[] args) throws HiveException { 
        String input = args[0].toString();
        String[] test = input.split(";");
        for (int i = 0; i < test.length; i++) { 
            try { 
                String[] result = test[i].split(":");
                forward(result);
            } catch (Exception e) { 
                continue;
            }
        }
    }
}

打成Jar包，并上传到服务器

注册UDF函数

hive > add jar /root/hive-demo-0.0.1-SNAPSHOT.jar;
hive > create temporary function split_test as 'com.cloudera.UDTFExplode';

使用自定义UDTF函数

直接select中使用

select split_test('asd:123\;rtrt:3445\;vbvx:6787') as (col1,col2) ;
[root@cdh02 ~]# beeline 
WARNING: Use "yarn jar" to launch YARN applications.
...
Beeline version 2.1.1-cdh6.1.1 by Apache Hive
beeline> !connect jdbc//192.168.1.101:10000
Connecting to jdbc//192.168.1.101:10000
Enter username for jdbc//192.168.1.101 hive
Enter password for jdbc//192.168.1.101 
Connected to: Apache Hive (version 2.1.1-cdh6.1.1)
Driver: Hive JDBC (version 2.1.1-cdh6.1.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc//192.168.1.101:10000> create temporary function split_test as 'com.cloudera.udtf.UDTFExplode';
...
INFO  : OK
No rows affected (0.248 seconds)
0: jdbc//192.168.1.101:10000> select split_test('asd:123\;rtrt:3445\;vbvx:6787') as (col1,col2);
...
INFO  : OK
+-------+-------+
| col1  | col2  |
+-------+-------+
| asd   | 123   |
| rtrt  | 3445  |
| vbvx  | 6787  |
+-------+-------+
3 rows selected (18.492 seconds)

注意:
UDTF不可以添加其他字段使用，不可以嵌套调用，不可以和group by,cluster by,distribute by,sort by一起使用。

和lateral view一起使用

select '1', mytable.col1, mytable.col2 from dual lateral view split_test('asd:123\;rtrt:3445\;vbvx:6787') mytable as col1, col2;
0: jdbc//192.168.1.101:10000> select '1', 
. . . . . . . . . . . . . . . . . .> mytable.col1, 
. . . . . . . . . . . . . . . . . .> mytable.col2 
. . . . . . . . . . . . . . . . . .> from test 
. . . . . . . . . . . . . . . . . .> lateral view split_test('asd:123\;rtrt:3445\;vbvx:6787') 
. . . . . . . . . . . . . . . . . .> mytable 
. . . . . . . . . . . . . . . . . .> as col1, col2;
I...
INFO  : OK
+------+---------------+---------------+
| _c0  | mytable.col1  | mytable.col2  |
+------+---------------+---------------+
| 1    | asd           | 123           |
| 1    | rtrt          | 3445          |
| 1    | vbvx          | 6787          |
+------+---------------+---------------+
9 rows selected (21.693 seconds)