查询

每个Reduce内部排序(sort by)

Sort By：对于大规模的数据集order by的效率非常低。在很多情况下，并不需要全局排序，此时可以使用Sort by。

Sort by为每个reduce产生一个排序文件。每个Reduce内部进行排序，对全局结果集来说不是排序。

1）设置reduce个数 hive (default)> set mapreduce.job.reduces=3;

2）查看设置reduce个数 hive (default)> set mapreduce.job.reduces;

3）根据部门编号降序查看员工信息 hive (default)> select * from emp sort by deptno desc;

4）将查询结果导入到文件中（按照部门编号降序排序）

hive (default)> insert overwrite local directory '/opt/module/hive/datas/sortby-result' select * from emp sort by deptno desc;

分区

Distribute By：在有些情况下，我们需要控制某个特定行应该到哪个 Reducer，通常是为了进行后续的聚集操作。distribute by子句可以做这件事。distribute by类似MapReduce 中partition（自定义分区），进行分区，结合sort by使用。

对于distribute by 进行测试，一定要分配多reduce 进行处理，否则无法看到distribute by 的效果。

（1）先按照部门编号分区，再按照员工编号薪资排序

hive (default)> set mapreduce.job.reduces=3;

hive (default)> insert overwrite local directory '/opt/module/hive/datas/distribute-result' select * from emp distribute by deptno sort by sal desc;

注意：

➢ distribute by的分区规则是根据分区字段的hash码与reduce的个数进行相除后，余数相同的分到一个区。

➢ Hive要求distribute by语句要写在sort by语句之前。

➢ 演示完以后mapreduce.job.reduces的值要设置回-1，否则下面分区or分桶表load 跑MapReduce的时候会报错。

grouping sets

cube

数据立方体-Hive Cube-CSDN博客

Hive中with cube、with rollup、grouping sets用法_hive sum with rollup-CSDN博客

函数

单行函数

nvl：替换null值

nvl(A, B)若A为null则返回B，否则返回A

concat_ws：以指定分隔符拼接字符串或者字符串数组

语法：concat_ws(string A, string…| array(string))

返回值：string

说明：使用分隔符A拼接多个字符串，或者一个数组的所有元素。

get_json_object：解析 json 字符串

语法：get_json_object(string json_string, string path)

返回值：string

说明：解析json的字符串json_string，返回path指定的内容。如果输入的json字符串无效，那么返回NULL。

1
2
3

hive> select get_json_object('[{"name":"大海海","sex":"男","age":"25"},{"name":"小宋宋","sex":"男","age":"47"}]','$.[0].name'); 

hive> 大海海

1
2
3

hive> select get_json_object('[{"name":"大海海","sex":"男","age":"25"},{"name":"小宋宋","sex":"男","age":"47"}]','$.[0]');

hive> {"name":"大海海","sex":"男","age":"25"}

unix_timestamp：返回当前或指定时间的时间戳

语法：unix_timestamp()

返回值：bigint

1
2
3

hive> select unix_timestamp('2022/08/08 08-08-08','yyyy/MM/dd HH-mm-ss'); 

hive> 1659946088

from_unixtime：转化 UNIX 时间戳

（从 1970-01-01 00:00:00 UTC 到指定时间的秒数）到当前时区的时间格式

语法：from_unixtime(bigint unixtime[, string format])

返回值：string

current_date：当前日期

current_timestamp：当前的日期加时间，并且精确的毫秒

date_add：日期加天数

语法：date_add(string startdate, int days)

返回值：string

说明：返回开始日期 startdate 增加 days 天后的日期

date_sub：日期减天数

语法：date_sub (string startdate, int days)

返回值：string

说明：返回开始日期startdate减少days天后的日期。

size：集合中元素的个数

1	hive> select size(friends) from test; --2/2 每一行数据中的friends集合里的个数

map：创建map集合

语法：map (key1, value1, key2, value2, …)

说明：根据输入的key和value对构建map类型

1
2
3

hive> select map('xiaohai',1,'dahai',2);   

hive> {"xiaohai":1,"dahai":2}

map_keys：返回map中的key

1
2
3

hive> select map_keys(map('xiaohai',1,'dahai',2));

hive>["xiaohai","dahai"]

map_values: 返回 map 中的value

1
2
3

hive> select map_values(map('xiaohai',1,'dahai',2));

hive>[1,2]

array 声明 array 集合

语法：array(val1, val2, …)

说明：根据输入的参数构建数组array类

array_contains: 判断 array 中是否包含某个元素

1
2
3

hive> select array_contains(array('a','b','c','d'),'a'); 

hive> true

sort_array：将 array 中的元素排序

1
2
3

hive> select sort_array(array('a','d','c'));

hive> ["a","c","d"]

struct 声明 struct 中的各属性

语法：struct(val1, val2, val3, …)

说明：根据输入的参数构建结构体struct类

1
2
3

hive> select struct('name','age','weight');

hive> {"col1":"name","col2":"age","col3":"weight"}

named_struct 声明 struct 的属性和值

1
2
3

hive> select named_struct('name','xiaosong','age',18,'weight',80);

hive> {"name":"xiaosong","age":18,"weight":80}

高级聚合函数

collect_list 收集并形成list集合，结果不去重

hive> select  sex, collect_list(job) from employee group by sex 

女 ["行政","研发","行政","前台"] 
男 ["销售","研发","销售","前台"]

collect_set 收集并形成set集合，结果去重

hive> select  sex, collect_set(job) from employee group by sex 

女 ["行政","研发","前台"] 
男 ["销售","研发","前台"]

爆炸函数

explode是将hive一行中复杂的array或者map结构拆分成多行。

lateral view用于和split, explode等UDTF一起使用，它能够将一行数据拆成多行数据，在此基础上可以对拆分后的数据进行聚合。lateral view首先为原始表的每行调用UDTF，UDTF会把一行拆分成一或者多行，lateral view再把结果组合，产生一个支持别名表的虚拟表。

explode将复杂结构一行拆成多行，然后再用lateral view做各种聚合。

SELECT
	cate, COUNT(*) cnt
FROM movie_info
LATERAL VIEW EXPLODE(SPLIT(category, ',')) t1 AS cate
GROUP BY cate;

窗口函数

lag和lead

功能：获取当前行的上/下边某行的字段的值。

first_value和last_value

功能：获取窗口内第一个值和最后一个值

分区表

create table dept_partition 
( 
deptno int,    --部门编号 
dname  string, --部门名称 
loc    
string  --部门位置 
) 
partitioned by (day string) 
row format delimited fields terminated by '\t';

装载数据

1
2
3

load data local inpath '/opt/module/hive/datas/dept_20220401.log'  
into table dept_partition  
partition(day='20220401');

插入数据

insert overwrite table dept_partition partition (day = '20220402') 
select deptno, dname, loc 
from dept_partition 
where day = '2020-04-01';

查询数据

1
2
3

select deptno, dname, loc ,day 
from dept_partition 
where day = '2020-04-01';

创建单个分区

1 2	alter table dept_partition add partition(day='20220403');

创建多个分区（不能有逗号）

1 2	alter table dept_partition add partition(day='20220404') partition(day='20220405');

删除一个分区

1 2	alter table dept_partition drop partition (day='20220403');

删除多个分区（必须有逗号）

1 2	alter table dept_partition drop partition (day='20220404'), partition(day='20220405');

二级分区

create table dept_partition2( 
deptno int,    -- 部门编号 
dname string, -- 部门名称 
loc string     -- 部门位置 
) 
partitioned by (day string, hour string) 
row format delimited fields terminated by '\t';

动态分区

insert into table dept_partition_dynamic  
partition(loc)  
select  
deptno,  
dname,  
loc  
from dept;

无需指定分区的值，自动分区。

分桶表

create table stu_buck( 
id int,  
name string 
) 
clustered by(id)  
into 4 buckets 
row format delimited fields terminated by '\t';

分桶排序表

create table stu_buck_sort( 
    id int,  
    name string 
) 
clustered by(id) sorted by(id)
into 4 buckets 
row format delimited fields terminated by '\t';

Hive文件格式

为Hive 表中的数据选择一个合适的文件格式，对提高查询性能的提高是十分有益的。 Hive 表数据的存储格式，可以选择text file、orc、parquet、sequence file 等。

ORC文件格式

ORC（Optimized Row Columnar）file format 是 Hive 0.11 版里引入的一种列式存储的文件格式。ORC文件能够提高Hive读写数据和处理数据的性能。

（1）行存储的特点查询满足条件的一整行数据的时候，列存储则需要去每个聚集的字段找到对应的每个列的值，行存储只需要找到其中一个值，其余的值都在相邻地方，所以此时行存储查询的速度更快

（2）列存储的特点因为每个字段的数据聚集存储，在查询只需要少数几个字段的时候，能大大减少读取的数据量；每个字段的数据类型一定是相同的，列式存储可以针对性的设计更好的设计压缩算法。

前文提到的text file 和sequence file 都是基于行存储的，orc 和parquet 是基于列式存储的。

create table orc_table 
(column_specs) 
stored as orc 
tblproperties (property_name=property_value, ...);

查询

每个Reduce内部排序(sort by)

分区

grouping sets

cube

函数

单行函数

nvl：替换null值

concat_ws：以指定分隔符拼接字符串或者字符串数组

get_json_object：解析 json 字符串

unix_timestamp：返回当前或指定时间的时间戳

from_unixtime：转化 UNIX 时间戳

current_date：当前日期

current_timestamp：当前的日期加时间，并且精确的毫秒

date_add：日期加天数

date_sub：日期减天数

size：集合中元素的个数

map：创建map集合

map_keys： 返回map中的key

map_values: 返回 map 中的value

array 声明 array 集合

array_contains: 判断 array 中是否包含某个元素

sort_array：将 array 中的元素排序

struct 声明 struct 中的各属性

named_struct 声明 struct 的属性和值

高级聚合函数

collect_list 收集并形成list集合，结果不去重

collect_set 收集并形成set集合，结果去重

爆炸函数

窗口函数

lag和lead

first_value和last_value

分区表

二级分区

动态分区

分桶表

分桶排序表

Hive文件格式

ORC文件格式

Parquet文件格式

map_keys：返回map中的key