API生成的json文件如下所示。JSON文件的格式不正确。我们是否可以使用pyspark处理坏的记录,丢弃并只将好的行加载到dataframe中。
{
"name": "PowerAmplifier",
"Component": "12uF Capacitor\n1/21Resistor\n3 Inductor In Henry\PowerAmplifier\n ",
"url": "https://www.onsemi.com/products/amplifiers-comparators/",
"image": "https://www.onsemi.com/products/amplifiers-comparators/",
"ThresholdTime": "48min",
"MFRDate": "2019-05-08",
"FallTime": "15Min",
"description": "PowerAmplifier"
}代码:
spark = SparkSession.builder \
.master('local') \
.appName('Fairchild') \
.config('spark.executor.memory', '2g') \
.getOrCreate()
df = sqlContext.read.json ("D:\\Fairchild\\component.json").select("name" , "url", "image","ThresholdTime", "MFRDate", "FallTime", "description")
#df.show()
df.registerTempTable("data")
spark.sql("select * from data").show()错误:
Traceback (most recent call last):
File "C:/Users/RAJEE/PycharmProjects/onsemi/test.py", line 24, in <module>
result = spark.sql("select * from data")
File "C:\spark\python\pyspark\sql\session.py", line 767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "C:\spark\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
File "C:\spark\python\pyspark\sql\utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException: java.lang.RuntimeException:请分享你的建议。
发布于 2020-01-17 05:24:17
从Spark 2.3开始,当引用的列只包含内部损坏的记录列时,不允许来自原始JSON/CSV文件的查询。我添加了两条json记录,一条是好的,一条是坏的,这样我就可以查询数据帧了。默认情况下,它至少需要一个好的json记录。您还可以使用option("mode", "DROPMALFORMED")忽略所有格式错误的记录。
您也可以在python代码中使用option("mode", "DROPMALFORMED")。
sqlContext.read.option("mode", "DROPMALFORMED").json ("D:\\Fairchild\\component.json").select("name" , "url", "image","ThresholdTime", "MFRDate", "FallTime", "description")JSON数据-
{"name": "PowerAmplifier","Component": "12uF Capacitor\n1/21Resistor\n3 Inductor In Henry\PowerAmplifier\n ","url": "https://www.onsemi.com/products/amplifiers-comparators/", "image": "https://www.onsemi.com/products/mplifiers-comparators/", "ThresholdTime": "48min", "MFRDate": "2019-05-08", "FallTime": "15Min", "description": "PowerAmplifier"}
{"name": "PowerAmplifier","Component": "good record","url": "https://www.onsemi.com/products/amplifiers-comparators/", "image": "https://www.onsemi.com/products/mplifiers-comparators/", "ThresholdTime": "48min", "MFRDate": "2019-05-08", "FallTime": "15Min", "description": "PowerAmplifier"}在不忽略格式错误的情况下读取它的代码。在这种情况下,您可以使用列"_corrupt_record“来过滤坏记录。
val j = spark.read.json("/Users/msayed2/Documents/temp/test.txt")
j.show()结果
+-----------+--------+----------+-------------+--------------------+--------------+--------------------+--------------+--------------------+
| Component|FallTime| MFRDate|ThresholdTime| _corrupt_record| description| image| name| url|
+-----------+--------+----------+-------------+--------------------+--------------+--------------------+--------------+--------------------+
| null| null| null| null|{"name": "PowerAm...| null| null| null| null|
|good record| 15Min|2019-05-08| 48min| null|PowerAmplifier|https://www.onsem...|PowerAmplifier|https://www.onsem...|
+-----------+--------+----------+-------------+--------------------+--------------+--------------------+--------------+--------------------+你完全可以忽略像这样的畸形-
val j = spark.read.option("mode", "DROPMALFORMED").json("/Users/msayed2/Documents/temp/test.txt")
j.show()结果没有格式错误
+-----------+--------+----------+-------------+--------------+--------------------+--------------+--------------------+
| Component|FallTime| MFRDate|ThresholdTime| description| image| name| url|
+-----------+--------+----------+-------------+--------------+--------------------+--------------+--------------------+
|good record| 15Min|2019-05-08| 48min|PowerAmplifier|https://www.onsem...|PowerAmplifier|https://www.onsem...|
+-----------+--------+----------+-------------+--------------+--------------------+--------------+--------------------+https://stackoverflow.com/questions/59774527
复制相似问题