一个hive数仓清洗小案例:单词统计 比眉伴天荒 2022-10-01 00:53 83阅读 0赞 【需求】统计某一款产品的用户中,分别有多少用户是同一个年龄。例如25岁的有100人,60岁的有20人。 // 我已提前在hive数据仓库中建表customers,也已经把本地文件加载到了hive仓库里 **hive> desc customers;** OK id int name string age int Time taken: 0.044 seconds, Fetched: 3row(s) **hive> select \* from customers limit 15;// 分别对应id, name, age** OK 1 tom 12 2 tommy 13 3 kevin 14 4 goodman 15 5 tonykid 13 6 alice 14 7 alex 14 8 lucyclaire 13 9 woody 15 10 garyking 14 11 johnson 15 12 tracy 14 13 carl 13 14 hisonburg 58 15 laura 14 // 切割出age列,放到数组里,给每一行数组起个别名 **hive> select split(age, ' ') as AGE from customers;** OK \["12"\] \["13"\] \["14"\] \["15"\] \["13"\] \["14"\] \["14"\] \["13"\] \["15"\] \["14"\] \["15"\] \["14"\] \["13"\] \["58"\] // 炸开每一行数组后,把每个数组中的元素连接成为一个新表,把表名命名为AgeTable // 再从这个新表里查出AGE字段及其对应的出现次数 **hive> select AgeTable.AGE, count(\*) from ((select explode(split(age, ' ')) as AGE from customers) as AgeTable) group by AgeTable.AGE;** Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 8.83 sec HDFSRead: 9463 HDFS Write: 206 SUCCESS Total MapReduce CPU Time Spent: 8 seconds830 msec OK 12 3 13 6 14 7 15 5 27 1 48 3 58 5 // 获得统计结果后,需要存起来,供其他业务需求再次使用, // 所以要建个新表(也可以建一个外部表),把刚才的统计结果直接写到表里 **hive> CREATE TABLE customer\_age\_group AS select AgeTable.AGE, count(\*) AS age\_count FROM((select explode(split(age, ' ')) as AGE from customers) as AgeTable) group by AgeTable.AGE;** Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.98 sec HDFSRead: 9187 HDFS Write: 117 SUCCESS Total MapReduce CPU Time Spent: 6 seconds980 sec **hive> show tables;**// 成功将结果写入新表 OK customer\_age\_group customer\_roj\_orders customers **hive> desc customer\_age\_group;** OK age string age\_count bigint // 字段别名,hive给的默认数据类型是大整数 // 写入表时,按照统计结果的顺序写入 **hive> select \* from customer\_age\_group;** OK 12 3 13 6 14 7 15 5 27 1 48 3 58 5
还没有评论,来说两句吧...