hive学习笔记之七:内置函数

我就是我 2022-11-21 14:42 322阅读 0赞

欢迎访问我的GitHub

这里分类和汇总了欣宸的全部原创(含配套源码):https://github.com/zq2599/blog_demos

《hive学习笔记》系列导航

  1. 基本数据类型
  2. 复杂数据类型
  3. 内部表和外部表
  4. 分区表
  5. 分桶
  6. HiveQL基础
  7. 内置函数
  8. Sqoop
  9. 基础UDF
  10. 用户自定义聚合函数(UDAF)
  11. UDTF

本篇概览

  • 本文是《hive学习笔记》系列的第七篇,前文熟悉了HiveQL的常用语句,接下来把常用的内置函数简单过一遍,分为以下几部分:
  1. 数学
  2. 字符
  3. json处理
  4. 转换
  5. 日期
  6. 条件
  7. 聚合

准备数据

  1. 本次实战要准备两个表:学生表和住址表,字段都很简单,如下图所示,学生表有个住址ID字段,是住址表里的记录的唯一ID:
    在这里插入图片描述
  2. 先创建住址表:

    create table address (addressid int, province string, city string)
    row format delimited
    fields terminated by ‘,’;

  3. 创建address.txt文件,内容如下:

    1,guangdong,guangzhou
    2,guangdong,shenzhen
    3,shanxi,xian
    4,shanxi,hanzhong
    6,jiangshu,nanjing

  4. 加载数据到address表:

    load data
    local inpath ‘/home/hadoop/temp/202010/25/address.txt’
    into table address;

  5. 创建学生表,其addressid字段关联了address表的addressid字段:

    create table student (name string, age int, addressid int)
    row format delimited
    fields terminated by ‘,’;

  6. 创建student.txt文件,内容如下:

    tom,11,1
    jerry,12,2
    mike,13,3
    john,14,4
    mary,15,5

  7. 加载数据到student表:

    load data
    local inpath ‘/home/hadoop/temp/202010/25/student.txt’
    into table student;

  8. 至此,本次操作所需数据已准备完毕,如下所示:

    hive> select from address;
    OK
    1 guangdong guangzhou
    2 guangdong shenzhen
    3 shanxi xian
    4 shanxi hanzhong
    6 jiangshu nanjing
    Time taken: 0.043 seconds, Fetched: 5 row(s)
    hive> select
    from student;
    OK
    tom 11 1
    jerry 12 2
    mike 13 3
    john 14 4
    mary 15 5
    Time taken: 0.068 seconds, Fetched: 5 row(s)

    • 开始体验内置函数;

总览

  1. 进入hive控制台;
  2. 执行命令show functions;显示内置函数列表:

    hive> show functions;
    OK
    !
    !=
    %
    &
    *
    +
    -
    /
    <
    <=
    <=>
    <>

    =

    >

    =
    ^
    abs
    acos
    add_months
    and
    array
    array_contains
    ascii
    asin
    assert_true
    atan
    avg
    base64
    between
    bin
    case
    cbrt
    ceil
    ceiling
    coalesce
    collect_list
    collect_set
    compute_stats
    concat
    concat_ws
    context_ngrams
    conv
    corr
    cos
    count
    covar_pop
    covar_samp
    create_union
    cume_dist
    current_database
    current_date
    current_timestamp
    current_user
    date_add
    date_format
    date_sub
    datediff
    day
    dayofmonth
    decode
    degrees
    dense_rank
    div
    e
    elt
    encode
    ewah_bitmap
    ewah_bitmap_and
    ewah_bitmap_empty
    ewah_bitmap_or
    exp
    explode
    factorial
    field
    find_in_set
    first_value
    floor
    format_number
    from_unixtime
    from_utc_timestamp
    get_json_object
    greatest
    hash
    hex
    histogram_numeric
    hour
    if
    in
    in_file
    index
    initcap
    inline
    instr
    isnotnull
    isnull
    java_method
    json_tuple
    lag
    last_day
    last_value
    lcase
    lead
    least
    length
    levenshtein
    like
    ln
    locate
    log
    log10
    log2
    lower
    lpad
    ltrim
    map
    map_keys
    map_values
    matchpath
    max
    min
    minute
    month
    months_between
    named_struct
    negative
    next_day
    ngrams
    noop
    noopstreaming
    noopwithmap
    noopwithmapstreaming
    not
    ntile
    nvl
    or
    parse_url
    parse_url_tuple
    percent_rank
    percentile
    percentile_approx
    pi
    pmod
    posexplode
    positive
    pow
    power
    printf
    radians
    rand
    rank
    reflect
    reflect2
    regexp
    regexp_extract
    regexp_replace
    repeat
    reverse
    rlike
    round
    row_number
    rpad
    rtrim
    second
    sentences
    shiftleft
    shiftright
    shiftrightunsigned
    sign
    sin
    size
    sort_array
    soundex
    space
    split
    sqrt
    stack
    std
    stddev
    stddev_pop
    stddev_samp
    str_to_map
    struct
    substr
    substring
    sum
    tan
    to_date
    to_unix_timestamp
    to_utc_timestamp
    translate
    trim
    trunc
    ucase
    unbase64
    unhex
    unix_timestamp
    upper
    var_pop
    var_samp
    variance
    weekofyear
    when
    windowingtablefunction
    xpath
    xpath_boolean
    xpath_double
    xpath_float
    xpath_int
    xpath_long
    xpath_number
    xpath_short
    xpath_string
    year
    |
    ~
    Time taken: 0.003 seconds, Fetched: 216 row(s)

  3. 以lower函数为例,执行命令describe function lower;即可查看lower函数的说明:

    hive> describe function lower;
    OK
    lower(str) - Returns str with all characters changed to lowercase
    Time taken: 0.005 seconds, Fetched: 1 row(s)

    • 接下来从计算函数开始,体验常用函数;
    • 先执行以下命令,使查询结果中带有字段名:

    set hive.cli.print.header=true;

计算函数

  1. 加法+:

    hive> select name, age, age+1 as add_value from student;
    OK
    name age add_value
    tom 11 12
    jerry 12 13
    mike 13 14
    john 14 15
    mary 15 16
    Time taken: 0.098 seconds, Fetched: 5 row(s)

  2. 减法(-)、乘法(*)、除法(/)的使用与加法类似,不再赘述了;

  3. 四舍五入round:

    hive> select round(1.1), round(1.6);
    OK
    _c0 _c1
    1.0 2.0
    Time taken: 0.028 seconds, Fetched: 1 row(s)

  4. 向上取整ceil:

    hive> select ceil(1.1);
    OK
    _c0
    2
    Time taken: 0.024 seconds, Fetched: 1 row(s)

  5. 向下取整floor:

    hive> select floor(1.1);
    OK
    _c0
    1
    Time taken: 0.024 seconds, Fetched: 1 row(s)

  6. 平方pow,例如pow(2,3)表示2的三次方,等于8:

    hive> select pow(2,3);
    OK
    _c0
    8.0
    Time taken: 0.027 seconds, Fetched: 1 row(s)

  7. 取模pmod:

    hive> select pmod(10,3);
    OK
    _c0
    1
    Time taken: 0.059 seconds, Fetched: 1 row(s)

字符函数

  1. 转小写lower,转大写upper:

    hive> select lower(name), upper(name) from student;
    OK
    _c0 _c1
    tom TOM
    jerry JERRY
    mike MIKE
    john JOHN
    mary MARY
    Time taken: 0.051 seconds, Fetched: 5 row(s)

  2. 字符串长度length:

    hive> select name, length(name) from student;
    OK
    tom 3
    jerry 5
    mike 4
    john 4
    mary 4
    Time taken: 0.322 seconds, Fetched: 5 row(s)

  3. 字符串拼接concat:

    hive> select concat(“prefix_”, name) from student;
    OK
    prefix_tom
    prefix_jerry
    prefix_mike
    prefix_john
    prefix_mary
    Time taken: 0.106 seconds, Fetched: 5 row(s)

  4. 子串substr,substr(xxx,2)表示从第二位开始到右边所有,substr(xxx,2,3)表示从第二位开始取三个字符:

    hive> select substr(“0123456”,2);
    OK
    123456
    Time taken: 0.067 seconds, Fetched: 1 row(s)
    hive> select substr(“0123456”,2,3);
    OK
    123
    Time taken: 0.08 seconds, Fetched: 1 row(s)

  5. 去掉前后空格trim:

    hive> select trim(“ 123 “);
    OK
    123
    Time taken: 0.065 seconds, Fetched: 1 row(s)

json处理(get_json_object)

为了使用json处理的函数,先准备一些数据:

  1. 先创建表t15,只有一个字段用于保存字符串:

    create table t15(json_raw string)
    row format delimited;

  2. 创建t15.txt文件,内容如下:

    {
    “name”:”tom”,”age”:”10”}
    {
    “name”:”jerry”,”age”:”11”}

  3. 加载数据到t15表:

    load data
    local inpath ‘/home/hadoop/temp/202010/25/015.txt’
    into table t15;

  4. 使用get_json_object函数,解析json_raw字段,分别取出指定name和age属性:

    select
    get_json_object(json_raw, “$.name”),
    get_json_object(json_raw, “$.age”)
    from t15;

得到结果:

  1. hive> select
  2. > get_json_object(json_raw, "$.name"),
  3. > get_json_object(json_raw, "$.age")
  4. > from t15;
  5. OK
  6. tom 10
  7. jerry 11
  8. Time taken: 0.081 seconds, Fetched: 2 row(s)

日期

  1. 获取当前日期current_date:

    hive> select current_date();
    OK
    2020-11-02
    Time taken: 0.052 seconds, Fetched: 1 row(s)

  2. 获取当前时间戳current_timestamp:

    hive> select current_timestamp();
    OK
    2020-11-02 10:07:58.967
    Time taken: 0.049 seconds, Fetched: 1 row(s)

  3. 获取年份year、月份month、日期day:

    hive> select year(current_date()), month(current_date()), day(current_date());
    OK
    2020 11 2
    Time taken: 0.054 seconds, Fetched: 1 row(s)

  4. 另外,year和current_timestamp也能搭配使用:

    hive> select year(current_timestamp()), month(current_timestamp()), day(current_timestamp());
    OK
    2020 11 2
    Time taken: 0.042 seconds, Fetched: 1 row(s)

  5. 返回日期部分to_date:

    hive> select to_date(current_timestamp());
    OK
    2020-11-02
    Time taken: 0.051 seconds, Fetched: 1 row(s)

条件函数

  • 条件函数的作用和java中的switch类似,语法是case X when XX then XXX else XXXX end;
  • 示例如下,作用是判断name字段,如果等于tom就返回tom_case,如果等于jerry就返回jerry_case,其他情况都返回other_case:

    select name,
    case name when ‘tom’ then ‘tom_case’

    1. when 'jerry' then 'jerry_case'
    2. else 'other_case'

    end
    from student;

结果如下:

  1. hive> select name,
  2. > case name when 'tom' then 'tom_case'
  3. > when 'jerry' then 'jerry_case'
  4. > else 'other_case'
  5. > end
  6. > from student;
  7. OK
  8. tom tom_case
  9. jerry jerry_case
  10. mike other_case
  11. john other_case
  12. mary other_case
  13. Time taken: 0.08 seconds, Fetched: 5 row(s)

聚合函数

  1. 返回行数count:

    select count(*) from student;

触发MR,结果如下:

  1. Total MapReduce CPU Time Spent: 2 seconds 170 msec
  2. OK
  3. 5
  4. Time taken: 20.823 seconds, Fetched: 1 row(s)
  1. 分组后组内求和sum:

    select province, sum(1) from address group by province;

触发MR,结果如下:

  1. Total MapReduce CPU Time Spent: 1 seconds 870 msec
  2. OK
  3. guangdong 2
  4. jiangshu 1
  5. shanxi 2
  6. Time taken: 19.524 seconds, Fetched: 3 row(s)
  1. 分组后,组内最小值min,最大值max,平均值avg:

    select province, min(addressid), max(addressid), avg(addressid) from address group by province;

触发MR,结果如下:

  1. Total MapReduce CPU Time Spent: 1 seconds 650 msec
  2. OK
  3. guangdong 1 2 1.5
  4. jiangshu 6 6 6.0
  5. shanxi 3 4 3.5
  6. Time taken: 20.106 seconds, Fetched: 3 row(s)
  • 至此,hive常用到内置函数咱们都体验过一遍了,希望能给您提供一些参考,接下来的文章会体验一个常用工具:Sqoop

欢迎关注公众号:程序员欣宸

微信搜索「程序员欣宸」,我是欣宸,期待与您一同畅游Java世界…

发表评论

表情:
评论列表 (有 0 条评论,322人围观)

还没有评论,来说两句吧...

相关阅读