首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >配置单元Xml serDe创建表

配置单元Xml serDe创建表
EN

Stack Overflow用户
提问于 2015-08-14 03:25:29
回答 1查看 911关注 0票数 3

我的XML文件具有以下结构。

代码语言:javascript
复制
<records>
    <record customer_id=0001>
      <msg>
        <demographics gender=F agecat=1 edcat=1 jobcat=2 empcat=2 retire=0 jobsat=1 marital=1 spousedcat=1 residecat=4 homeown=0 hometype=2 addresscat=2/>
        <demographics gender=F agecat=3 edcat=5 jobcat=2 empcat=0 retire=0 jobsat=3 marital=2 spousedcat=1 residecat=4 homeown=0 hometype=3 addresscat=2/>
      .....
      </msg>
    </record>
</records>

我想要最终的结果看起来像蜂巢

代码语言:javascript
复制
0001 F 1 1 2 2 0 1 1 1 4 0 2 2
0001 F 3 5 2 0 0 3 2 1 4 0 3 2
0001 ....
0001 ....

我尝试了下面这样的方法。

代码语言:javascript
复制
add jar /usr/lib/hue/hivexmlserde-1.0.0.0.jar;

CREATE external TABLE pbp (
  gender string, agecat int, edcat int, jobcat int, empcat int, retire int, jobsat int, spousedcat int, homeown int, hometype int, addresscat int
  )
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
  "column.xpath.gender"="/demographics/@gender",
  "column.xpath.agecat"="/demographics/@agecat",
  "column.xpath.edcat"="/demographics/@edcat",
  "column.xpath.jobcat"="/demographics/@jobcat",
  "column.xpath.empcat"="/demographics/@empcat",
  "column.xpath.retire"="/demographics/@retire",
  "column.xpath.jobsat"="/demographics/@jobsat",
  "column.xpath.spousedcat"="/demographics/@spousedcat",
  "column.xpath.homeown"="/demographics/@homeown",
  "column.xpath.hometype"="/demographics/@hometype",
  "column.xpath.addresscat"="/demographics/@addresscat",
  )
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES (
  "xmlinput.start"="<demographics",
  "xmlinput.end"="/>"
);

但是,在创建customer_id表时,我不知道如何在另一列中插入相同的PBP值。也请建议我正在做的事情是正确的,因为我是这个环境的新手。如有任何帮助,我们不胜感激!

EN

回答 1

Stack Overflow用户

发布于 2020-05-13 22:39:38

我知道这已经很晚了,如果你还在寻找答案,那么我们必须将开始和结束标记从“记录标记”本身开始,而不是“人口统计标记”,并将数据作为字符串数组使用。我已经修改了下面的create语句,

代码语言:javascript
复制
add jar /usr/lib/hue/hivexmlserde-1.0.0.0.jar;

CREATE external TABLE pbp (
  customer_id string, gender ARRAY<string>, agecat ARRAY<int>, edcat ARRAY<int>, jobcat ARRAY<int>, empcat ARRAY<int>, retire ARRAY<int>, jobsat ARRAY<int>, spousedcat ARRAY<int>, homeown ARRAY<int>, hometype ARRAY<int>, addresscat ARRAY<int>
  )
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
  "column.xpath.customer_id"="/record/@customer_id",
  "column.xpath.gender"="/record/msg/demographics/@gender",
  "column.xpath.agecat"="/record/msg/demographics/@agecat",
  "column.xpath.edcat"="/record/msg/demographics/@edcat",
  "column.xpath.jobcat"="/record/msg/demographics/@jobcat",
  "column.xpath.empcat"="/record/msg/demographics/@empcat",
  "column.xpath.retire"="/record/msg/demographics/@retire",
  "column.xpath.jobsat"="/record/msg/demographics/@jobsat",
  "column.xpath.spousedcat"="/record/msg/demographics/@spousedcat",
  "column.xpath.homeown"="/record/msg/demographics/@homeown",
  "column.xpath.hometype"="/record/msg/demographics/@hometype",
  "column.xpath.addresscat"="/record/msg/demographics/@addresscat",
  )
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES (
  "xmlinput.start"="<record",
  "xmlinput.end"="/record>"
);

稍后,您必须使用POSEXPLODE数组逐行显示数据的非规范化视图以及customer_id,假设性别属性对于所有人口统计数据都是强制的。

代码语言:javascript
复制
CREATE TABLE IF NOT EXISTS pbp_denormalized STORED AS <FILE_FORMAT> AS
SELECT
customer_id, n.gender, agecat[pos] AS agecat, edcat[pos] AS edcat, jobcat[pos] AS jobcat, empcat[pos] AS empcat, retire[pos] AS retire, jobsat[pos] AS jobsat, spousedcat[pos] AS spousedcat, homeown[pos] AS homeown, hometype[pos] AS hometype, addresscat[pos] AS addresscat
FROM pbp
LATERAL VIEW POSEXPLODE(gender) n AS pos, gender;

上面的代码应该会得到您想要的结果。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/31996797

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档