我正在使用Pandas读取csv文件,这是一个两列数据,然后我试图转换到火花数据。这方面的守则是:
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(df)数据文件:
print(df) 给出如下内容:
Name Category
0 EDSJOBLIST apply at www.edsjoblist.com ['biotechnology', 'clinical', 'diagnostic', 'd...
1 Power Direct Marketing ['advertising', 'analytics', 'brand positionin...
2 CHA Hollywood Medical Center, L.P. ['general medical and surgical hospital', 'hea...
3 JING JING GOURMET [nan]
4 TRUE LIFE KINGDOM MINISTRIES ['religious organization']
5 fasterproms ['microsoft .net']
6 STEREO ZONE ['accessory', 'audio', 'car audio', 'chrome', ...
7 SAN FRANCISCO NEUROLOGICAL SOCIETY [nan]
8 Fl Advisors ['comprehensive financial planning', 'financia...
9 Fortunatus LLC ['bottle', 'bottling', 'charitable', 'dna', 'f...
10 TREADS LLC ['retail', 'wholesaling']有人能帮我吗?
发布于 2018-07-03 17:32:03
Spark在处理object数据类型时可能有困难。一个潜在的解决办法是首先将所有内容转换为字符串:
sdf = sqlCtx.createDataFrame(df.astype(str))这样做的一个后果是,包括nan在内的所有东西都将转换为字符串。您需要注意正确处理这些转换,并将列强制转换为适当的类型。
例如,如果您有一个带有浮点值的列"colA",您可以使用如下内容将字符串"nan"转换为null
from pyspark.sql.functions import col, when
sdf = sdf.withColumn("colA", when(col("colA") != "nan", col("colA").cast("float")))https://stackoverflow.com/questions/51159672
复制相似问题