一个hive数仓清洗小案例：单词统计

比眉伴天荒 2022-10-01 00:53 83阅读 0赞

【需求】统计某一款产品的用户中，分别有多少用户是同一个年龄。例如25岁的有100人，60岁的有20人。

// 我已提前在hive数据仓库中建表customers，也已经把本地文件加载到了hive仓库里

**hive> desc customers;**

OK

id int

name string

age int

Time taken: 0.044 seconds, Fetched: 3row(s)

**hive> select \* from customers limit 15;// 分别对应id, name, age**

OK

1 tom 12

2 tommy 13

3 kevin 14

4 goodman 15

5 tonykid 13

6 alice 14

7 alex 14

8 lucyclaire 13

9 woody 15

10 garyking 14

11 johnson 15

12 tracy 14

13 carl 13

14 hisonburg 58

15 laura 14

// 切割出age列，放到数组里，给每一行数组起个别名

**hive> select split(age, ' ') as AGE from customers;**

OK

\["12"\]

\["13"\]

\["14"\]

\["15"\]

\["13"\]

\["14"\]

\["14"\]

\["13"\]

\["15"\]

\["14"\]

\["15"\]

\["14"\]

\["13"\]

\["58"\]

// 炸开每一行数组后，把每个数组中的元素连接成为一个新表，把表名命名为AgeTable

// 再从这个新表里查出AGE字段及其对应的出现次数

**hive> select  AgeTable.AGE, count(\*) from ((select explode(split(age, ' ')) as AGE from customers) as AgeTable) group by AgeTable.AGE;**

Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 8.83 sec HDFSRead: 9463 HDFS Write: 206 SUCCESS

Total MapReduce CPU Time Spent: 8 seconds830 msec

OK

12 3

13 6

14 7

15 5

27 1

48 3

58 5

// 获得统计结果后，需要存起来，供其他业务需求再次使用，

// 所以要建个新表（也可以建一个外部表），把刚才的统计结果直接写到表里

**hive> CREATE TABLE customer\_age\_group AS select AgeTable.AGE, count(\*) AS age\_count FROM((select explode(split(age, ' ')) as AGE from customers) as AgeTable) group by AgeTable.AGE;**

Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.98 sec HDFSRead: 9187 HDFS Write: 117 SUCCESS

Total MapReduce CPU Time Spent: 6 seconds980 sec

**hive> show tables;**// 成功将结果写入新表

OK

customer\_age\_group

customer\_roj\_orders

customers

**hive> desc customer\_age\_group;**

OK

age string

age\_count bigint // 字段别名，hive给的默认数据类型是大整数

// 写入表时，按照统计结果的顺序写入

**hive> select \* from customer\_age\_group;**

OK

12 3

13 6

14 7

15 5

27 1

48 3

58 5

发表评论取消回复

表情：

评论列表（有 0 条评论，83人围观）

还没有评论，来说两句吧...

相关阅读