【hive】行转列—explode()/posexplode()/lateral view 函数使用场景
文章目录
- 一、lateral view函数
- 二、explode()函数
- 三、posexplode()函数
- 四、行转列使用
- 单列转多行
- 多列转多行
一、lateral view函数
-
功能:
(图片来源网络,侵删)- 用于和UDTF函数(explode,split)结合使用,把某一行数据拆分成多行数据,再将多行结果组合成一个支持别名的虚拟表。
- 主要解决select使用UDTF做查询的过程中查询只能包含单个UDTF,不能包含其它字段以及多个UDTF的情况。
-
语法:LATERAL VIEW udtf(expression) tableAlias AS columnAlias (‘,’ columnAlias)
- columnAlias是给udtf(expression)列起的别名。
- tableAlias 虚拟表的别名。
-
lateral view函数会将lUDTF生成的结果放到一个虚拟表中,然后这个虚拟表和输入行进行join来达到连接UDTF外的select字段的目的。
使用方式:
举例1:与explode() 函数连用
--测试表的数据结构,可以看到第一列是id(string类型),第二列是rule_array(array类型) > desc test_2; --结果: +-------------+----------------+----------+ | col_name | data_type | comment | +-------------+----------------+----------+ | id | string | | | rule_array | array | | +-------------+----------------+----------+ --查看测试表的内容 > select * from test_2; --结果: +------------+----------------------------------------+ | test_2.id | test_2.rule_array | +------------+----------------------------------------+ | a | ["501","502","503","501","512"] | | b | ["511","512","513","511","512","511"] | | c | ["512","513","511","512"] | +------------+----------------------------------------+ --拆分rule_array > select id, rule_explode > from test_2 > lateral view explode(rule_array) adTable AS rule_explode; --结果: +-----+---------------+ | id | rule_explode | +-----+---------------+ | a | 501 | | a | 502 | | a | 503 | | a | 501 | | a | 512 | | b | 511 | | b | 512 | | b | 513 | | b | 511 | | b | 512 | | b | 511 | | c | 512 | | c | 513 | | c | 511 | | c | 512 | +-----+---------------+
举例2:与parse_url_tuple()函数连用
- parse_url_tuple()函数
- 语法:parse_url(string urlString, string partToExtract [, string keyToExtract])
- 功能:UDTF函数,解析URL字符串,它通过一次指定多个参数,从URL中解析出多个参数值返回多列。
- 参数:partToExtract的有效值为:HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO。
- 注意:在使用 select 时不能同时返回其他字段,不能嵌套调用,不能与group by放在一起调用等。为解决此问题,可以通过测试图(Lateral View)搭配调用。
--测试表的数据结构,可以看到第一列是id(int类型),第二列是url(string类型) > desc tb_url; --结果: +-----------+------------+----------+ | col_name | data_type | comment | +-----------+------------+----------+ | id | int | | | url | string | | +-----------+------------+----------+ --查看测试表的内容 > select * from tb_url; --结果: +------------+-------------------------------------------------+ | tb_url.id | tb_url.url | +------------+-------------------------------------------------+ | 1 | http://facebook.com/path/p1.php?query=1 | | 2 | http://www.baidu.com/news/index.jsp?uuid=frank | | 3 | http://www.jd.com/index?source=baidu | +------------+-------------------------------------------------+ --提取tb_url的query部分 > select > a.id as id, > c.query as query > from tb_url a > lateral view parse_url_tuple(url,"QUERY") c as query > ; --结果: +-----+---------------+ | id | query | +-----+---------------+ | 1 | query=1 | | 2 | uuid=frank | | 3 | source=baidu | +-----+---------------+
举例3:多个lateral view连用
--提取tb_url的HOST、PATH和QUERY部分 > select > a.id as id, > b.host as host, > b.path as path, > c.query as query > from tb_url a > lateral view parse_url_tuple(url,"HOST","PATH") b as host,path > lateral view parse_url_tuple(url,"QUERY") c as query > ; --结果: +-----+----------------+------------------+---------------+ | id | host | path | query | +-----+----------------+------------------+---------------+ | 1 | facebook.com | /path/p1.php | query=1 | | 2 | www.baidu.com | /news/index.jsp | uuid=frank | | 3 | www.jd.com | /index | source=baidu | +-----+----------------+------------------+---------------+
注意:
- lateral view函数的位置在from后where条件前。
- 生成虚拟表的别名不可省略。
- from后可带多个 lateral view函数。
- 如果要拆分的字段有null值,需要使用 lateral view outer替代,避免数据缺失。
二、explode()函数
- 功能:UDTF函数,可以将一个array或者map展开
- 语法:explode(map ,array )
- explode(array):将array列表里的每个元素生成一行
- explode(map):每一对元素作为一行,key为一列,value为一列
使用方式:
举例1:直接使用
--explode(array) > select explode(array(11,22,33)) as item; --结果: +-------+ | item | +-------+ | 11 | | 22 | | 33 | +-------+ --explode(map) > select explode(map("id",11,"name","lily","age",18)) as (k,v); --结果: +-------+-------+ | k | v | +-------+-------+ | id | 11 | | name | lily | | age | 18 | +-------+-------+
举例2:与lateral view一起使用
- 语法:lateral view explode(字段) 表别名 as 列别名
- 如果字段类型是map:lateral view explode(字段) 表别名 as (key别名,value别名)
> select * from test_2; --结果: +------------+----------------------------------------+ | test_2.id | test_2.rule_array | +------------+----------------------------------------+ | a | ["501","502","503","501","512"] | | b | ["511","512","513","511","512","511"] | | c | ["512","513","511","512"] | +------------+----------------------------------------+ > select id,rule > from test_2 > lateral view explode(rule_array) tbl as rule > ; --结果: +-----+-------+ | id | rule | +-----+-------+ | a | 501 | | a | 502 | | a | 503 | | a | 501 | | a | 512 | | b | 511 | | b | 512 | | b | 513 | | b | 511 | | b | 512 | | b | 511 | | c | 512 | | c | 513 | | c | 511 | | c | 512 | +-----+-------+
举例3:使用局限性
- 不能关联原有的表中其它字段。例如:
> select id, explode(rule_array) as rule from test_2; --报错: Error: Error while compiling statement: FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions (state=42000,code=10081)
- 不能与group by、cluster by 、distribute by、sort by联用。例如:
> select explode(rule_array) as rule from test_2 group by explode(rule_array); --报错: Error: Error while compiling statement: FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions (state=42000,code=10081)
- 不能进行UDTF嵌套。例如:
> select explode(explode(rule_array)) from test_2 ; --报错: Error: Error while compiling statement: FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions (state=42000,code=10081)
- 不允许选择其他表达式。例如:
> select explode("1,2,3,4,5") from test_2 ; --报错: Error: Error while compiling statement: FAILED: UDFArgumentException explode() takes an array or a map as a parameter (state=42000,code=40000)
三、posexplode()函数
- 功能:UDTF函数,将一个array或者map展开,可以将index(索引)和数据都取出来,使用两次posexplode()并用where语句使两次取到的index相等,从而实现对多列进行多行转换。explode()函数只能将对一列进行行转换。
- 语法:posexplode(map ,array )
使用方式:
举例1:
> desc test_2; --结果: +-------------+----------------+----------+ | col_name | data_type | comment | +-------------+----------------+----------+ | id | string | | | dt_array | array | | | rule_array | array | | +-------------+----------------+----------+ > select * from test_2; --结果: +------------+----------------------------------------------------+----------------------------------------+ | test_2.id | test_2.dt_array | test_2.rule_array | +------------+----------------------------------------------------+----------------------------------------+ | a | ["20230809","20230811","20230812","20230812","20230813"] | ["501","502","503","501","512"] | | b | ["20230809","20230811","20230812","20230812","20230813","20230809"] | ["511","512","513","511","512","511"] | | c | ["20230811","20230812","20230812","20230813"] | ["512","513","511","512"] | +------------+----------------------------------------------------+----------------------------------------+ --注:语句中b.idxb和c.idxc分别是子视图的两个索引,我们where子句中使用b.idxb=c.idxc保证顺序的一致性。 > select a.id,b.cur_day,c.rule > from test_2 a > lateral view posexplode(dt_array) b as idxb,cur_day, > lateral view posexplode(rule_array) c as idxc,rule where b.idxb=c.idxc > ; --结果: +-------+------------+---------+ | a.id | b.cur_day | c.rule | +-------+------------+---------+ | a | 20230809 | 501 | | a | 20230811 | 502 | | a | 20230812 | 503 | | a | 20230812 | 501 | | a | 20230813 | 512 | | b | 20230809 | 511 | | b | 20230811 | 512 | | b | 20230812 | 513 | | b | 20230812 | 511 | | b | 20230813 | 512 | | b | 20230809 | 511 | | c | 20230811 | 512 | | c | 20230812 | 513 | | c | 20230812 | 511 | | c | 20230813 | 512 | +-------+------------+---------+
四、行转列使用
单列转多行
举例1:使用explode()+lateral view
> select t.id,a.cur_day > from test_2 t > lateral view explode(t.dt_array) a as cur_day > ; --结果: +-------+------------+ | t.id | a.cur_day | +-------+------------+ | b | 20230809 | | b | 20230811 | | b | 20230812 | | b | 20230812 | | b | 20230813 | | b | 20230809 | | c | 20230811 | | c | 20230812 | | c | 20230812 | | c | 20230813 | | a | 20230809 | | a | 20230811 | | a | 20230812 | | a | 20230812 | | a | 20230813 | +-------+------------+
举例2:explode字段是string格式,先split()指定分隔符,如果省略split()则使用默认分隔符是英文逗号。
> desc test_3; --结果: +-----------+------------+----------+ | col_name | data_type | comment | +-----------+------------+----------+ | id | string | | | dt | string | | | rule1 | string | | +-----------+------------+----------+ > select * from test_3; --结果: +------------+----------------------------------------------------+--------------------------+ | test_3.id | test_3.dt | test_3.rule1 | +------------+----------------------------------------------------+--------------------------+ | a | 20230809,20230811,20230812,20230812,20230813 | 501,502,503,501,512 | | b | 20230809,20230811,20230812,20230812,20230813,20230809 | 511,512,513,511,512,511 | | c | 20230811,20230812,20230812,20230813 | 512,513,511,512 | +------------+----------------------------------------------------+--------------------------+ > select t.id,a.cur_day > from test_3 t > lateral view explode(split(t.dt,',')) a as cur_day > ; --结果: +-------+------------+ | t.id | a.cur_day | +-------+------------+ | a | 20230809 | | a | 20230811 | | a | 20230812 | | a | 20230812 | | a | 20230813 | | b | 20230809 | | b | 20230811 | | b | 20230812 | | b | 20230812 | | b | 20230813 | | b | 20230809 | | c | 20230811 | | c | 20230812 | | c | 20230812 | | c | 20230813 | +-------+------------+
多列转多行
举例1:使用posexplode()+lateral view
先测试使用explode(), 看看效果:
> select a.id,b.cur_day,c.rule > from test_2 a > lateral view explode(dt_array) b as cur_day, > lateral view explode(rule_array) c as rule > ; --结果: +-------+------------+---------+ | a.id | b.cur_day | c.rule | +-------+------------+---------+ | b | 20230809 | 511 | | b | 20230809 | 512 | | b | 20230809 | 513 | | b | 20230809 | 511 | | b | 20230809 | 512 | | b | 20230809 | 511 | | b | 20230811 | 511 | | b | 20230811 | 512 | | b | 20230811 | 513 | | b | 20230811 | 511 | | b | 20230811 | 512 | | b | 20230811 | 511 | | b | 20230812 | 511 | | b | 20230812 | 512 | | b | 20230812 | 513 | | b | 20230812 | 511 | | b | 20230812 | 512 | | b | 20230812 | 511 | | b | 20230812 | 511 | | b | 20230812 | 512 | | b | 20230812 | 513 | | b | 20230812 | 511 | | b | 20230812 | 512 | | b | 20230812 | 511 | | b | 20230813 | 511 | | b | 20230813 | 512 | | b | 20230813 | 513 | | b | 20230813 | 511 | | b | 20230813 | 512 | | b | 20230813 | 511 | | b | 20230809 | 511 | | b | 20230809 | 512 | | b | 20230809 | 513 | | b | 20230809 | 511 | | b | 20230809 | 512 | | b | 20230809 | 511 | | c | 20230811 | 512 | | c | 20230811 | 513 | | c | 20230811 | 511 | | c | 20230811 | 512 | | c | 20230812 | 512 | | c | 20230812 | 513 | | c | 20230812 | 511 | | c | 20230812 | 512 | | c | 20230812 | 512 | | c | 20230812 | 513 | | c | 20230812 | 511 | | c | 20230812 | 512 | | c | 20230813 | 512 | | c | 20230813 | 513 | | c | 20230813 | 511 | | c | 20230813 | 512 | | a | 20230809 | 501 | | a | 20230809 | 502 | | a | 20230809 | 503 | | a | 20230809 | 501 | | a | 20230809 | 512 | | a | 20230811 | 501 | | a | 20230811 | 502 | | a | 20230811 | 503 | | a | 20230811 | 501 | | a | 20230811 | 512 | | a | 20230812 | 501 | | a | 20230812 | 502 | | a | 20230812 | 503 | | a | 20230812 | 501 | | a | 20230812 | 512 | | a | 20230812 | 501 | | a | 20230812 | 502 | | a | 20230812 | 503 | | a | 20230812 | 501 | | a | 20230812 | 512 | | a | 20230813 | 501 | | a | 20230813 | 502 | | a | 20230813 | 503 | | a | 20230813 | 501 | | a | 20230813 | 512 | +-------+------------+---------+
出现这种情况,是因为两个并列的explode()的hql没办法识别cur_day对应的rule是什么,对于多个数组的行转列可以使用posexplode()函数。
例如使用如下查询语句:
> select a.id,b.cur_day,c.rule > from test_2 a > lateral view posexplode(dt_array) b as idxb,cur_day, > lateral view posexplode(rule_array) c as idxc,rule where b.idxb=c.idxc > ; --结果: +-------+------------+---------+ | a.id | b.cur_day | c.rule | +-------+------------+---------+ | b | 20230809 | 511 | | b | 20230811 | 512 | | b | 20230812 | 513 | | b | 20230812 | 511 | | b | 20230813 | 512 | | b | 20230809 | 511 | | c | 20230811 | 512 | | c | 20230812 | 513 | | c | 20230812 | 511 | | c | 20230813 | 512 | | a | 20230809 | 501 | | a | 20230811 | 502 | | a | 20230812 | 503 | | a | 20230812 | 501 | | a | 20230813 | 512 | +-------+------------+---------+
- 不允许选择其他表达式。例如:
- 不能进行UDTF嵌套。例如:
- 不能与group by、cluster by 、distribute by、sort by联用。例如:
- 不能关联原有的表中其它字段。例如:
- parse_url_tuple()函数
-
文章版权声明:除非注明,否则均为主机测评原创文章,转载或复制请以超链接形式并注明出处。