首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >丢弃坏记录,只将好记录从pyspark中的json文件加载到dataframe

丢弃坏记录,只将好记录从pyspark中的json文件加载到dataframe
EN

Stack Overflow用户
提问于 2020-01-17 00:59:53
回答 1查看 869关注 0票数 0

API生成的json文件如下所示。JSON文件的格式不正确。我们是否可以使用pyspark处理坏的记录,丢弃并只将好的行加载到dataframe中。

代码语言:javascript
复制
{
"name": "PowerAmplifier",
"Component": "12uF Capacitor\n1/21Resistor\n3 Inductor In Henry\PowerAmplifier\n ",
"url": "https://www.onsemi.com/products/amplifiers-comparators/", 
"image": "https://www.onsemi.com/products/amplifiers-comparators/", 
"ThresholdTime": "48min", 
"MFRDate": "2019-05-08", 
"FallTime": "15Min", 
"description": "PowerAmplifier"
}

代码:

代码语言:javascript
复制
spark = SparkSession.builder \
      .master('local') \
      .appName('Fairchild') \
      .config('spark.executor.memory', '2g') \
      .getOrCreate()

df = sqlContext.read.json ("D:\\Fairchild\\component.json").select("name" , "url", "image","ThresholdTime", "MFRDate", "FallTime", "description")
#df.show()
df.registerTempTable("data")
spark.sql("select * from data").show()

错误:

代码语言:javascript
复制
Traceback (most recent call last):
  File "C:/Users/RAJEE/PycharmProjects/onsemi/test.py", line 24, in <module>
    result = spark.sql("select * from data")
  File "C:\spark\python\pyspark\sql\session.py", line 767, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "C:\spark\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
  File "C:\spark\python\pyspark\sql\utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException: java.lang.RuntimeException:

请分享你的建议。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-01-17 05:24:17

从Spark 2.3开始,当引用的列只包含内部损坏的记录列时,不允许来自原始JSON/CSV文件的查询。我添加了两条json记录,一条是好的,一条是坏的,这样我就可以查询数据帧了。默认情况下,它至少需要一个好的json记录。您还可以使用option("mode", "DROPMALFORMED")忽略所有格式错误的记录。

您也可以在python代码中使用option("mode", "DROPMALFORMED")

代码语言:javascript
复制
sqlContext.read.option("mode", "DROPMALFORMED").json ("D:\\Fairchild\\component.json").select("name" , "url", "image","ThresholdTime", "MFRDate", "FallTime", "description")

JSON数据-

代码语言:javascript
复制
{"name": "PowerAmplifier","Component": "12uF Capacitor\n1/21Resistor\n3 Inductor In Henry\PowerAmplifier\n ","url": "https://www.onsemi.com/products/amplifiers-comparators/", "image": "https://www.onsemi.com/products/mplifiers-comparators/", "ThresholdTime": "48min", "MFRDate": "2019-05-08", "FallTime": "15Min", "description": "PowerAmplifier"}
{"name": "PowerAmplifier","Component": "good record","url": "https://www.onsemi.com/products/amplifiers-comparators/", "image": "https://www.onsemi.com/products/mplifiers-comparators/", "ThresholdTime": "48min", "MFRDate": "2019-05-08", "FallTime": "15Min", "description": "PowerAmplifier"}

在不忽略格式错误的情况下读取它的代码。在这种情况下,您可以使用列"_corrupt_record“来过滤坏记录。

代码语言:javascript
复制
    val j = spark.read.json("/Users/msayed2/Documents/temp/test.txt")
    j.show()

结果

代码语言:javascript
复制
+-----------+--------+----------+-------------+--------------------+--------------+--------------------+--------------+--------------------+
|  Component|FallTime|   MFRDate|ThresholdTime|     _corrupt_record|   description|               image|          name|                 url|
+-----------+--------+----------+-------------+--------------------+--------------+--------------------+--------------+--------------------+
|       null|    null|      null|         null|{"name": "PowerAm...|          null|                null|          null|                null|
|good record|   15Min|2019-05-08|        48min|                null|PowerAmplifier|https://www.onsem...|PowerAmplifier|https://www.onsem...|
+-----------+--------+----------+-------------+--------------------+--------------+--------------------+--------------+--------------------+

你完全可以忽略像这样的畸形-

代码语言:javascript
复制
    val j = spark.read.option("mode", "DROPMALFORMED").json("/Users/msayed2/Documents/temp/test.txt")
    j.show()

结果没有格式错误

代码语言:javascript
复制
+-----------+--------+----------+-------------+--------------+--------------------+--------------+--------------------+
|  Component|FallTime|   MFRDate|ThresholdTime|   description|               image|          name|                 url|
+-----------+--------+----------+-------------+--------------+--------------------+--------------+--------------------+
|good record|   15Min|2019-05-08|        48min|PowerAmplifier|https://www.onsem...|PowerAmplifier|https://www.onsem...|
+-----------+--------+----------+-------------+--------------+--------------------+--------------+--------------------+
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/59774527

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档