如何使用Hive动态分区表

发布时间：2021-10-12 14:11:02 来源：亿速云阅读：225 作者：iii 栏目：编程语言

这篇文章主要介绍“如何使用Hive动态分区表”，在日常操作中，相信很多人在如何使用Hive动态分区表问题上存在疑惑，小编查阅了各式资料，整理出简单好用的操作方法，希望对大家解答”如何使用Hive动态分区表”的疑惑有所帮助！接下来，请跟着小编一起来学习吧！

作用
如果我们按天去insert分区数据,那么我们可以去指定静态分区的名称去插入数据.但是当我们不能确定分区名称时,便要使用动态分区去处理分区表.

实例
准备数据如下,为顾客数据.字段分别为id,name,orderdate.

1,jack,2016/11/11
2,michael,2016/11/12
3,summer,2016/11/13
4,spring,2016/11/14
5,nero,2016/11/15
6,book,2016/12/21
7,node,2016/12/22
8,tony,2016/12/23
9,green,2016/12/24
10,andy,2016/12/25
11,kaith,2016/12/26
12,spring,2016/12/27
13,andy,2016/12/28
14,tony,2016/12/29
15,green,2016/12/30
16,andy,2016/12/31
17,kaith,2017/1/1
18,xiaoming,2017/1/2

我们把数据放入一张名为t_temp的表中.

create table t_temp(id int,name string,orderdate string)
row format delimited
fields terminated by ',';

load date local inpath '/home/spark/jar/testdata/Customer.txt' into table t_temp;
1
2
3
4
5
然后建立分区表t_part

create table if not exists t_part
(id int ,name string ,orderdate string)
partitioned by (year string,month string)
row format delimited
fields terminated by ',';
1
2
3
4
5
我们使用静态分区可能会执行如下的语句插入数据:

insert into t_part partition(year = '2016',month = '12')
select id,name,orderdate from t_temp
where substring(orderdate,1,7) = '2016/12'
1
2
3
当分区数少的时候，我们可以采用这种方式去insert数据.当分区数过多或者分区名称未知时,我们需要去使用动态分区.

hive参数配置
在使用动态分区之前,我们要进行一些参数的配置.

hive.exec.dynamic.partition
默认值：false

是否开启动态分区功能，默认false关闭。

使用动态分区时候，该参数必须设置成true;

hive.exec.dynamic.partition.mode
默认值：strict

动态分区的模式，默认strict，表示必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区。

一般需要设置为nonstrict

hive.exec.max.dynamic.partitions.pernode
默认值：100

在每个执行MR的节点上，最大可以创建多少个动态分区。

该参数需要根据实际的数据来设定。

比如：源数据中包含了一年的数据，即day字段有365个值，那么该参数就需要设置成大于365，如果使用默认值100，则会报错。

hive.exec.max.dynamic.partitions
默认值：1000

在所有执行MR的节点上，最大一共可以创建多少个动态分区。

同上参数解释。

hive.exec.max.created.files
默认值：100000

整个MR Job中，最大可以创建多少个HDFS文件。

一般默认值足够了，除非你的数据量非常大，需要创建的文件数大于100000，可根据实际情况加以调整。

hive.error.on.empty.partition
默认值：false

当有空分区生成时，是否抛出异常。

一般不需要设置.

在设置完这些参数之后,我们可以执行如下的insert指令去使用动态分区

set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;
insert overwrite table t_part partition(year,month)
select id,name,orderdate,substring(orderdate,1,4),substring(orderdate,6,2) from t_temp;
1
2
3
4
执行结果如下：

Loading data to table test_neil.t_part partition (year=null, month=null)
Time taken for load dynamic partitions : 651
Loading partition {year=2016, month=12}
Loading partition {year=2017, month=01}
Loading partition {year=2016, month=11}
Time taken for adding to write entity : 1
Partition test_neil.t_part{year=2016, month=11} stats: [numFiles=1, numRows=5, totalSize=97, rawDataSize=92]
Partition test_neil.t_part{year=2016, month=12} stats: [numFiles=1, numRows=11, totalSize=210, rawDataSize=199]
Partition test_neil.t_part{year=2017, month=01} stats: [numFiles=1, numRows=2, totalSize=43, rawDataSize=41]
1
2
3
4
5
6
7
8
9
我们可以去查看这张表的分区情况:

show partitions t_part;
1
显示分区的情况如下:

partition
year=2016/month=11
year=2016/month=12
year=2017/month=01
year=__HIVE_DEFAULT_PARTITION__/month=__HIVE_DEFAULT_PARTITION__
1
2
3
4
5

到此，关于“如何使用Hive动态分区表”的学习就结束了，希望能够解决大家的疑惑。理论与实践的搭配能更好的帮助大家学习，快去试试吧！若想继续学习更多相关知识，请继续关注亿速云网站，小编会继续努力为大家带来更多实用的文章！

向AI问一下细节

如何使用Hive动态分区表

猜你喜欢

最新资讯

相关推荐

相关标签