Hive 自定义 UDTF 函数
- 继承org.apache.hadoop.hive.ql.udf.generic.GenericUDTF
重写 initialize,process 和 close方法
- Hive 调用 initialize 方法来确定传入参数的类型并确定 UDTF 生成表的每个字段的数据类型(即输入类型和输出类型。initialize 方法必须返回一个生成表的字段的相应的 StructObjectInspector。
- 初始化完成后,会调用process方法,真正的处理过程在process函数中,在process中,每一次forward()调用产生一行;如果产生多列可以将多个列的值放在一个数组中,然后将该数组传入到forward()函数。
调用close()方法,对需要清理的方法进行清理。
package com.cloudera.udtf;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import java.util.ArrayList;
public class UDTFExplode extends GenericUDTF {
@Override
public void close() throws HiveException {
}
@Override
public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
ArrayList<String> fieldNames = new ArrayList<String>();
ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>();
fieldNames.add("col1");
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
fieldNames.add("col2");
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
}
@Override
public void process(Object[] args) throws HiveException {
String input = args[0].toString();
String[] test = input.split(";");
for (int i = 0; i < test.length; i++) {
try {
String[] result = test[i].split(":");
forward(result);
} catch (Exception e) {
continue;
}
}
}
}
- 打成Jar包,并上传到服务器
注册UDF函数
hive > add jar /root/hive-demo-0.0.1-SNAPSHOT.jar;
hive > create temporary function split_test as 'com.cloudera.UDTFExplode';
使用自定义UDTF函数
直接select中使用
select split_test('asd:123\;rtrt:3445\;vbvx:6787') as (col1,col2) ;
[root@cdh02 ~]# beeline
WARNING: Use "yarn jar" to launch YARN applications.
...
Beeline version 2.1.1-cdh6.1.1 by Apache Hive
beeline> !connect jdbc
//192.168.1.101:10000
Connecting to jdbc
//192.168.1.101:10000
Enter username for jdbc
//192.168.1.101
hive
Enter password for jdbc
//192.168.1.101
Connected to: Apache Hive (version 2.1.1-cdh6.1.1)
Driver: Hive JDBC (version 2.1.1-cdh6.1.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc
//192.168.1.101:10000> create temporary function split_test as 'com.cloudera.udtf.UDTFExplode';
...
INFO : OK
No rows affected (0.248 seconds)
0: jdbc
//192.168.1.101:10000> select split_test('asd:123\;rtrt:3445\;vbvx:6787') as (col1,col2);
...
INFO : OK
+-------+-------+
| col1 | col2 |
+-------+-------+
| asd | 123 |
| rtrt | 3445 |
| vbvx | 6787 |
+-------+-------+
3 rows selected (18.492 seconds)
注意:
UDTF不可以添加其他字段使用,不可以嵌套调用,不可以和group by,cluster by,distribute by,sort by一起使用。和lateral view一起使用
select '1', mytable.col1, mytable.col2 from dual lateral view split_test('asd:123\;rtrt:3445\;vbvx:6787') mytable as col1, col2;
0: jdbc
//192.168.1.101:10000> select '1',
. . . . . . . . . . . . . . . . . .> mytable.col1,
. . . . . . . . . . . . . . . . . .> mytable.col2
. . . . . . . . . . . . . . . . . .> from test
. . . . . . . . . . . . . . . . . .> lateral view split_test('asd:123\;rtrt:3445\;vbvx:6787')
. . . . . . . . . . . . . . . . . .> mytable
. . . . . . . . . . . . . . . . . .> as col1, col2;
I...
INFO : OK
+------+---------------+---------------+
| _c0 | mytable.col1 | mytable.col2 |
+------+---------------+---------------+
| 1 | asd | 123 |
| 1 | rtrt | 3445 |
| 1 | vbvx | 6787 |
+------+---------------+---------------+
9 rows selected (21.693 seconds)
还没有评论,来说两句吧...