我的XML文件具有以下结构。
<records>
<record customer_id=0001>
<msg>
<demographics gender=F agecat=1 edcat=1 jobcat=2 empcat=2 retire=0 jobsat=1 marital=1 spousedcat=1 residecat=4 homeown=0 hometype=2 addresscat=2/>
<demographics gender=F agecat=3 edcat=5 jobcat=2 empcat=0 retire=0 jobsat=3 marital=2 spousedcat=1 residecat=4 homeown=0 hometype=3 addresscat=2/>
.....
</msg>
</record>
</records>我想要最终的结果看起来像蜂巢
0001 F 1 1 2 2 0 1 1 1 4 0 2 2
0001 F 3 5 2 0 0 3 2 1 4 0 3 2
0001 ....
0001 ....我尝试了下面这样的方法。
add jar /usr/lib/hue/hivexmlserde-1.0.0.0.jar;
CREATE external TABLE pbp (
gender string, agecat int, edcat int, jobcat int, empcat int, retire int, jobsat int, spousedcat int, homeown int, hometype int, addresscat int
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.gender"="/demographics/@gender",
"column.xpath.agecat"="/demographics/@agecat",
"column.xpath.edcat"="/demographics/@edcat",
"column.xpath.jobcat"="/demographics/@jobcat",
"column.xpath.empcat"="/demographics/@empcat",
"column.xpath.retire"="/demographics/@retire",
"column.xpath.jobsat"="/demographics/@jobsat",
"column.xpath.spousedcat"="/demographics/@spousedcat",
"column.xpath.homeown"="/demographics/@homeown",
"column.xpath.hometype"="/demographics/@hometype",
"column.xpath.addresscat"="/demographics/@addresscat",
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES (
"xmlinput.start"="<demographics",
"xmlinput.end"="/>"
);但是,在创建customer_id表时,我不知道如何在另一列中插入相同的PBP值。也请建议我正在做的事情是正确的,因为我是这个环境的新手。如有任何帮助,我们不胜感激!
发布于 2020-05-13 22:39:38
我知道这已经很晚了,如果你还在寻找答案,那么我们必须将开始和结束标记从“记录标记”本身开始,而不是“人口统计标记”,并将数据作为字符串数组使用。我已经修改了下面的create语句,
add jar /usr/lib/hue/hivexmlserde-1.0.0.0.jar;
CREATE external TABLE pbp (
customer_id string, gender ARRAY<string>, agecat ARRAY<int>, edcat ARRAY<int>, jobcat ARRAY<int>, empcat ARRAY<int>, retire ARRAY<int>, jobsat ARRAY<int>, spousedcat ARRAY<int>, homeown ARRAY<int>, hometype ARRAY<int>, addresscat ARRAY<int>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.customer_id"="/record/@customer_id",
"column.xpath.gender"="/record/msg/demographics/@gender",
"column.xpath.agecat"="/record/msg/demographics/@agecat",
"column.xpath.edcat"="/record/msg/demographics/@edcat",
"column.xpath.jobcat"="/record/msg/demographics/@jobcat",
"column.xpath.empcat"="/record/msg/demographics/@empcat",
"column.xpath.retire"="/record/msg/demographics/@retire",
"column.xpath.jobsat"="/record/msg/demographics/@jobsat",
"column.xpath.spousedcat"="/record/msg/demographics/@spousedcat",
"column.xpath.homeown"="/record/msg/demographics/@homeown",
"column.xpath.hometype"="/record/msg/demographics/@hometype",
"column.xpath.addresscat"="/record/msg/demographics/@addresscat",
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES (
"xmlinput.start"="<record",
"xmlinput.end"="/record>"
);稍后,您必须使用POSEXPLODE数组逐行显示数据的非规范化视图以及customer_id,假设性别属性对于所有人口统计数据都是强制的。
CREATE TABLE IF NOT EXISTS pbp_denormalized STORED AS <FILE_FORMAT> AS
SELECT
customer_id, n.gender, agecat[pos] AS agecat, edcat[pos] AS edcat, jobcat[pos] AS jobcat, empcat[pos] AS empcat, retire[pos] AS retire, jobsat[pos] AS jobsat, spousedcat[pos] AS spousedcat, homeown[pos] AS homeown, hometype[pos] AS hometype, addresscat[pos] AS addresscat
FROM pbp
LATERAL VIEW POSEXPLODE(gender) n AS pos, gender;上面的代码应该会得到您想要的结果。
https://stackoverflow.com/questions/31996797
复制相似问题