我正在尝试使用星火读取一个csv文件在jupyter笔记本。到目前为止我已经
spark = SparkSession.builder.master("local[4]").getOrCreate()
reviews_df = spark.read.option("header","true").csv("small.csv")
reviews_df.collect()这就是reviews_df的样子:
[Row(reviewerID=u'A1YKOIHKQHB58W', asin=u'B0001VL0K2', overall=u'5'),
Row(reviewerID=u'A2YB0B3QOHEFR', asin=u'B000JJSRNY', overall=u'5'),
Row(reviewerID=u'AAI0092FR8V1W', asin=u'B0060MYKYY', overall=u'5'),
Row(reviewerID=u'A2TAPSNKK9AFSQ', asin=u'6303187218', overall=u'5'),
Row(reviewerID=u'A316JR2TQLQT5F', asin=u'6305364206', overall=u'5')...]但是,数据帧的每一行都包含列名,如何重新格式化数据,使其成为:
[(u'A1YKOIHKQHB58W', u'B0001VL0K2', u'5'),
(u'A2YB0B3QOHEFR', u'B000JJSRNY', u'5')....]发布于 2017-05-07 23:27:51
dataframe总是返回Row对象,这就是为什么当您在Dataframe上发出collect()时,它显示-
Row(reviewerID=u'A1YKOIHKQHB58W', asin=u'B0001VL0K2', overall=u'5')为了得到你想要的,你可以-
reviews_df.rdd.map(lambda row : (row.reviewerID,row.asin,row.overall)).collect()这将返回您的tuple of values of rows
https://stackoverflow.com/questions/43837606
复制相似问题