hive 求最大最小值均值及对应的一个key键(对应行)
hive 求最大最小均值就不多说了。
此次在业务上碰到一个问题,是要求最大值、最小值、平均值的同时,还要求最大值、最小值对应行的key字段(以找到对应时间)。并且key中的时间戳还是0时区,在查询时要注意对时区进行转换。
表结构大致如下:
key double
id+时间 对应的double数值
方法一:join (最常见)
比如取最小值和最小值所在行对应的key:
select *
from
(
select min(d) as min
from
(
select key,d
from default.test_202006
where key>=50104160000000 and key<50104240000000
union all
select key,d
from default.test_202006
where key>=50105000000000 and key<50105160000000
)
)t1
left join
(
select key,d
from
(
select key,d
from default.test_202006
where key>=50104160000000 and key<50104240000000
union all
select key,d
from default.test_202006
where key>=50105000000000 and key<50105160000000
)
)t2
on t1.min=t2.d
这样要达到目的,min +1 join, max +1 join, + avg
方法二:
select * from table where double in (select min(double) from table)
select *
from
(
select *
from default.test_202006
where key>=50104160000000 and key<50104240000000
union all
select *
from default.test_202006
where key>=50105000000000 and key<50105160000000
) where d in ( select min(d) as min
from
(
select *
from default.test_202006
where key>=50104160000000 and key<50104240000000
union all
select *
from default.test_202006
where key>=50105000000000 and key<50105160000000
))
方法三: row_number()
下面的语句取最大最小值:
select tt1.d as min, tt1.key as minkey, tt2.d as max, tt2.key as maxkey from
(
select *
from
(
select row_number() over(partition by s order by d) as ord, key, d
from
(
select *
from default.test_202006
where key>=50104160000000 and key<50104240000000
union all
select *
from default.test_202006
where key>=50105000000000 and key<50105160000000
)
)t1 where t1.ord=1
)tt1,
(
select *
from
(
select row_number() over(partition by s order by d desc) as ord, key, d
from
(
select *
from default.test_202006
where key>=50104160000000 and key<50104240000000
union all
select *
from default.test_202006
where key>=50105000000000 and key<50105160000000
)
)t2 where t2.ord=1
)tt2
方法四:在某同学建议下搞的骚操作
row_number两次,一次正序一次倒序。取出来再用collect处理。这个效率在测试中是最快的。最后业务上也是采用了这个方式,比其他同学join的处理快出20+s。
select t3.d[0] as min, t3.key[0] as minkey, t3.d[1] as max, t3.key[1] as maxkey, t4.avg
from
(
select collect_set(key) as key, collect_set(d) as d
from
(
select '999' as num,
row_number() over(partition by s order by d ) as ord1,
row_number() over(partition by i order by d desc) as ord2,
key, d
from
(
select *
from default.test_202006
where key>=50104160000000 and key<50104240000000
union all
select *
from default.test_202006
where key>=50105000000000 and key<50105160000000
)
)t2 where t2.ord1=1 or t2.ord2=1 group by num
)t3,
(
select avg(d) as avg
from
(
select *
from default.test_202006
where key>=50104160000000 and key<50104240000000
union all
select *
from default.test_202006
where key>=50105000000000 and key<50105160000000
)
)t4
以上方法分组也都适用。另外还有开窗函数配合FIRST_VALUE(),LAST_VALUE() 也可以尝试。
还没有评论,来说两句吧...