尝试使用Spark1.4.1数据文件读取JSON文件并在其中导航。似乎猜到的模式不正确。
JSON文件是:
{
"FILE": {
"TUPLE_CLI": [{
"ID_CLI": "C3-00000004",
"TUPLE_ABO": [{
"ID_ABO": "T0630000000000004",
"TUPLE_CRA": {
"CRA": "T070000550330",
"EFF": "Success"
},
"TITRE_ABO": ["Mr",
"OOESGUCKDO"],
"DATNAISS": "1949-02-05"
},
{
"ID_ABO": "T0630000000100004",
"TUPLE_CRA": [{
"CRA": "T070000080280",
"EFF": "Success"
},
{
"CRA": "T070010770366",
"EFF": "Failed"
}],
"TITRE_ABO": ["Mrs",
"NP"],
"DATNAISS": "1970-02-05"
}]
},
{
"ID_CLI": "C3-00000005",
"TUPLE_ABO": [{
"ID_ABO": "T0630000000000005",
"TUPLE_CRA": [{
"CRA": "T070000200512",
"EFF": "Success"
},
{
"CRA": "T070010410078",
"EFF": "Success"
}],
"TITRE_ABO": ["Miss",
"OB"],
"DATNAISS": "1926-11-22"
}]
}]
}
}星星之火代码是:
val j = sqlContext.read.json("/user/arthur/test.json")
j.printSchema结果是:
root
|-- FILE: struct (nullable = true)
| |-- TUPLE_CLI: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- ID_CLI: string (nullable = true)
| | | |-- TUPLE_ABO: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- DATNAISS: string (nullable = true)
| | | | | |-- ID_ABO: string (nullable = true)
| | | | | |-- TITRE_ABO: array (nullable = true)
| | | | | | |-- element: string (containsNull = true)
| | | | | |-- TUPLE_CRA: string (nullable = true)很明显,TUPLE_CRA是一个数组。我不明白为什么没猜到。在我看来,推断模式应该是:
root
|-- FILE: struct (nullable = true)
| |-- TUPLE_CLI: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- ID_CLI: string (nullable = true)
| | | |-- TUPLE_ABO: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- DATNAISS: string (nullable = true)
| | | | | |-- ID_ABO: string (nullable = true)
| | | | | |-- TITRE_ABO: array (nullable = true)
| | | | | | |-- element: string (containsNull = true)
| | | | | |-- TUPLE_CRA: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- CRA: string (nullable = true)
| | | | | | | |-- EFF: string (nullable = true)有人有解释吗?如果JSON模式要复杂得多,是否有一种方法可以轻松地告诉Spark实际的模式是什么?
发布于 2015-11-26 15:25:01
好吧,终于明白JSON不是预期的了。您会注意到,第一个TUPLE_CRA是一个没有方括号的元素[]。其他的TUPLE_CRA是带有括号和内部几个元素的数组。这就是为什么星火无法准确地猜出结构的原因。所以这个问题来自于这个JSON的生成。我需要修改它,使每个TUPLE_CRA都成为一个数组,即使其中只有一个元素。
https://stackoverflow.com/questions/33940472
复制相似问题