Hive partition 分区表-蒲公英云

分区表实际上就是对应一个HDFS文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件。Hive中的分区就是分目录，把一个大的数据集根据业务需要分割成小的数据集。在查询时通过WHERE子句中的表达式选择查询所需要的指定的分区，这样的查询效率会提高很多。

需求

需要根据日志产生的日期对日志进行管理。

数据准备

[root@hadoop102 stu_part]# pwd
/opt/module/hive-1.2.1/datas/stu_part
[root@hadoop102 stu_part]# ll 
总用量 16
-rw-r--r--. 1 root root 33 4月   3 17:54 20190429.txt
-rw-r--r--. 1 root root 33 4月   3 17:53 20190430.txt
-rw-r--r--. 1 root root 33 4月   3 17:54 20190501.txt
-rw-r--r--. 1 root root 33 4月   3 17:54 20190502.txt

创建分区表

create table stu_part(id int,name string) partitioned by (month string) row format delimited  fields terminated  by '\t';

$\-w1181$

加载数据到分区表中

hive (ylj_db)> load data local inpath '/opt/module/hive-1.2.1/datas/stu_part/20190429.txt' overwrite into table stu_part partition(month='201904');
Loading data to table ylj_db.stu_part partition (month=201904)
Partition ylj_db.stu_part{month=201904} stats: [numFiles=1, numRows=0, totalSize=33, rawDataSize=0]
OK
Time taken: 0.765 seconds
hive (ylj_db)> load data local inpath '/opt/module/hive-1.2.1/datas/stu_part/20190502.txt' overwrite into table stu_part partition(month='201905');
Loading data to table ylj_db.stu_part partition (month=201905)
Partition ylj_db.stu_part{month=201905} stats: [numFiles=1, numRows=0, totalSize=33, rawDataSize=0]
OK
Time taken: 0.806 seconds

$\-w1164$
$\-w1152$

查询

本次查询使用HiveJDBC查询，返回格式好看☺️

全部查询

0: jdbc:hive2://hadoop102:10000> select * from stu_part;
+--------------+----------------+-----------------+--+
| stu_part.id  | stu_part.name  | stu_part.month  |
+--------------+----------------+-----------------+--+
| 1            | zhangsan       | 201904          |
| 2            | lisi           | 201904          |
| 3            | houzi          | 201904          |
| 4            | tuzi           | 201904          |
| 1            | zhangsan       | 201905          |
| 2            | lisi           | 201905          |
| 3            | houzi          | 201905          |
| 4            | tuzi           | 201905          |
+--------------+----------------+-----------------+--+

单分区查询

0: jdbc:hive2://hadoop102:10000> select * from stu_part where month='201904';
+--------------+----------------+-----------------+--+
| stu_part.id  | stu_part.name  | stu_part.month  |
+--------------+----------------+-----------------+--+
| 1            | zhangsan       | 201904          |
| 2            | lisi           | 201904          |
| 3            | houzi          | 201904          |
| 4            | tuzi           | 201904          |
+--------------+----------------+-----------------+--+
4 rows selected (0.503 seconds)

多分区联合查询

select * from stu_part where month='201904'
union
select * from stu_part where month='201905'
union
select * from stu_part where month='201906';
select * from stu_part where month='201904' or month='201905';

增加与删除分区

创建单个分区

alter table stu_part add partition(month='201906') ;

同时创建多个分区

alter table stu_part add partition(month='201907') partition(month='201908');

$\-w1166$

删除单个分区

alter table stu_part drop partition (month='201906');

同时删除多个分区

alter table stu_part drop partition (month='201907'),partition (month='201908');

查看分区表有多少分区

hive (ylj_db)> show partitions stu_part;
OK
partition
month=201904
month=201905
Time taken: 0.124 seconds, Fetched: 2 row(s)

二级分区表

创建

create table stu_part2(id int,name string) partitioned by (month string,day string) row format delimited  fields terminated  by '\t';

正常的加载数据

hive (ylj_db)> load data local inpath '/opt/module/hive-1.2.1/datas/stu_part/20190429.txt' overwrite into table stu_part2 partition(month='201904',day='29');
Loading data to table ylj_db.stu_part2 partition (month=201904, day=29)
Partition ylj_db.stu_part2{month=201904, day=29} stats: [numFiles=1, numRows=0, totalSize=33, rawDataSize=0]
OK
Time taken: 0.62 seconds

$\-w1179$

$\-w1184$

查询数据

hive (ylj_db)> select * from stu_part2 where month='201904' and day='29';
OK
stu_part2.id    stu_part2.name  stu_part2.month stu_part2.day
1       zhangsan        201904  29
2       lisi    201904  29
3       houzi   201904  29
4       tuzi    201904  29
Time taken: 0.244 seconds, Fetched: 4 row(s)

分区表和数据产生关联的三种方式

上传数据后修复

创建分区目录

dfs -mkdir -p /user/hive/warehouse/ylj_db.db/stu_part/month=201906;

$\-w1178$

上传数据

dfs -put /opt/module/hive-1.2.1/datas/stu_part/20190601.txt /user/hive/warehouse/ylj_db.db/stu_part/month=201906;

$\-w1176$

查询数据

刚上传的数据查询不到，因为没有对应的元数据信息。

hive (ylj_db)> select * from stu_part where month='201906';
OK
stu_part.id     stu_part.name   stu_part.month
Time taken: 0.16 seconds

执行修复命令

hive (ylj_db)> msck repair table stu_part;
OK
Partitions not in metastore:    stu_part:month=201906
Repair: Added partition to metastore stu_part:month=201906
Time taken: 0.343 seconds, Fetched: 2 row(s)
hive (ylj_db)> select * from stu_part where month='201906';
OK
stu_part.id     stu_part.name   stu_part.month
1       zhangsan        201906
2       lisi    201906
3       houzi   201906
4       tuzi    201906
Time taken: 0.147 seconds, Fetched: 4 row(s)

上传数据后添加分区

上传数据

hive (ylj_db)> dfs -mkdir -p /user/hive/warehouse/ylj_db.db/stu_part/month=201907;
hive (ylj_db)> dfs -put /opt/module/hive-1.2.1/datas/stu_part/20190701.txt /user/hive/warehouse/ylj_db.db/stu_part/month=201907;

$\-w1178$

执行添加分区

alter table stu_part add partition(month='201907');

查询数据

hive (ylj_db)> select * from stu_part where month='201907';
OK
stu_part.id     stu_part.name   stu_part.month
1       zhangsan        201907
2       lisi    201907
3       houzi   201907
4       tuzi    201907
Time taken: 0.128 seconds, Fetched: 4 row(s)

创建文件夹后load数据到分区

创建目录

dfs -mkdir -p /user/hive/warehouse/ylj_db.db/stu_part/month=201908;

load上传数据

load data local inpath '/opt/module/hive-1.2.1/datas/stu_part/20190801.txt' into table stu_part partition(month='201908');

查询数据

hive (ylj_db)> select * from stu_part where month='201908';
OK
stu_part.id     stu_part.name   stu_part.month
1       zhangsan        201908
2       lisi    201908
3       houzi   201908
4       tuzi    201908
Time taken: 0.128 seconds, Fetched: 4 row(s)