我正在使用,它可以从三个不同的模式版本中的一个加载数据:
// Original
{ "A": {"B": 1 } }
// Addition "C"
{ "A": {"B": 1 }, "C": 2 }
// Additional "A.D"
{ "A": {"B": 1, "D": 3 }, "C": 2 }我可以通过检查模式是否包含"C“字段来处理附加的"C”,如果没有,可以向dataframe添加一个新列。但是,我想不出如何为子对象创建一个字段。
public void evolvingSchema() {
String versionOne = "{ \"A\": {\"B\": 1 } }";
String versionTwo = "{ \"A\": {\"B\": 1 }, \"C\": 2 }";
String versionThree = "{ \"A\": {\"B\": 1, \"D\": 3 }, \"C\": 2 }";
process(spark.getContext(), "1", versionOne);
process(spark.getContext(), "2", versionTwo);
process(spark.getContext(), "2", versionThree);
}
private static void process(JavaSparkContext sc, String version, String data) {
SQLContext sqlContext = new SQLContext(sc);
DataFrame df = sqlContext.read().json(sc.parallelize(Arrays.asList(data)));
if(!Arrays.asList(df.schema().fieldNames()).contains("C")) {
df = df.withColumn("C", org.apache.spark.sql.functions.lit(null));
}
// Not sure what to put here. The fieldNames does not contain the "A.D"
try {
df.select("C").collect();
} catch(Exception e) {
System.out.println("Failed to C for " + version);
}
try {
df.select("A.D").collect();
} catch(Exception e) {
System.out.println("Failed to A.D for " + version);
}
}发布于 2015-11-23 18:05:29
JSON源不太适合具有不断发展的模式的数据(替代Avro或Parquet ),但简单的解决方案是对所有源使用相同的模式,并使新字段可选/可空:
import org.apache.spark.sql.types.{StructType, StructField, LongType}
val schema = StructType(Seq(
StructField("A", StructType(Seq(
StructField("B", LongType, true),
StructField("D", LongType, true)
)), true),
StructField("C", LongType, true)))你可以像这样把schema传递给DataFrameReader
val rddV1 = sc.parallelize(Seq("{ \"A\": {\"B\": 1 } }"))
val df1 = sqlContext.read.schema(schema).json(rddV1)
val rddV2 = sc.parallelize(Seq("{ \"A\": {\"B\": 1 }, \"C\": 2 }"))
val df2 = sqlContext.read.schema(schema).json(rddV2)
val rddV3 = sc.parallelize(Seq("{ \"A\": {\"B\": 1, \"D\": 3 }, \"C\": 2 }"))
val df3 = sqlContext.read.schema(schema).json(rddV3)您将得到一个独立于变体的一致结构:
require(df1.schema == df2.schema && df2.schema == df3.schema)将缺失列自动设置为null
df1.printSchema
// root
// |-- A: struct (nullable = true)
// | |-- B: long (nullable = true)
// | |-- D: long (nullable = true)
// |-- C: long (nullable = true)
df1.show
// +--------+----+
// | A| C|
// +--------+----+
// |[1,null]|null|
// +--------+----+
df2.show
// +--------+---+
// | A| C|
// +--------+---+
// |[1,null]| 2|
// +--------+---+
df3.show
// +-----+---+
// | A| C|
// +-----+---+
// |[1,3]| 2|
// +-----+---+Note:
此解决方案依赖于数据源。它可能与其他资源或even result in malformed records一起工作,也可能不起作用。
发布于 2015-11-24 09:53:07
zero323已经回答了这个问题,但是在Scala中。这是同样的事情,但在Java中。
public void evolvingSchema() {
String versionOne = "{ \"A\": {\"B\": 1 } }";
String versionTwo = "{ \"A\": {\"B\": 1 }, \"C\": 2 }";
String versionThree = "{ \"A\": {\"B\": 1, \"D\": 3 }, \"C\": 2 }";
process(spark.getContext(), "1", versionOne);
process(spark.getContext(), "2", versionTwo);
process(spark.getContext(), "2", versionThree);
}
private static void process(JavaSparkContext sc, String version, String data) {
StructType schema = DataTypes.createStructType(Arrays.asList(
DataTypes.createStructField("A",
DataTypes.createStructType(Arrays.asList(
DataTypes.createStructField("B", DataTypes.LongType, true),
DataTypes.createStructField("D", DataTypes.LongType, true))), true),
DataTypes.createStructField("C", DataTypes.LongType, true)));
SQLContext sqlContext = new SQLContext(sc);
DataFrame df = sqlContext.read().schema(schema).json(sc.parallelize(Arrays.asList(data)));
try {
df.select("C").collect();
} catch(Exception e) {
System.out.println("Failed to C for " + version);
}
try {
df.select("A.D").collect();
} catch(Exception e) {
System.out.println("Failed to A.D for " + version);
}
}https://stackoverflow.com/questions/33807145
复制相似问题