我有一个如下格式的文件,我需要将它解析成一个有7列的数据帧。你能帮我讲讲如何继续吗?我不熟悉pyspark,这个数据有逗号和竖线作为分隔符。
1,玩具总动员(1995),冒险|动画|儿童|喜剧|幻想
2,Jumanji (1995),冒险|儿童|幻想
3,《暴躁的老人》(1995),喜剧|浪漫
4,等待呼气(1995),喜剧|戏剧|浪漫
发布于 2021-06-26 12:59:23
这是我的试验。我认为标签应该是一列数组,而不是每一列。但不管怎样,我试过了。
df = spark.read.option("inferSchema","true").csv("test.txt").toDF('id', 'title', 'tags')
df1 = df.withColumn('tags', f.split('tags', '\|'))
df1.show(truncate=False)
+---+------------------------+-------------------------------------------------+
|id |title |tags |
+---+------------------------+-------------------------------------------------+
|1 |Toy Story (1995) |[Adventure, Animation, Children, Comedy, Fantasy]|
|2 |Jumanji (1995) |[Adventure, Children, Fantasy] |
|3 |Grumpier Old Men (1995) |[Comedy, Romance] |
|4 |Waiting to Exhale (1995)|[Comedy, Drama, Romance] |
+---+------------------------+-------------------------------------------------+
df2 = df1
for i in range(0, 5):
df2 = df2.withColumn('tag' + str(i), f.col('tags')[i])
df2.drop('tags').show(truncate=False)
+---+------------------------+---------+---------+--------+------+-------+
|id |title |tag0 |tag1 |tag2 |tag3 |tag4 |
+---+------------------------+---------+---------+--------+------+-------+
|1 |Toy Story (1995) |Adventure|Animation|Children|Comedy|Fantasy|
|2 |Jumanji (1995) |Adventure|Children |Fantasy |null |null |
|3 |Grumpier Old Men (1995) |Comedy |Romance |null |null |null |
|4 |Waiting to Exhale (1995)|Comedy |Drama |Romance |null |null |
+---+------------------------+---------+---------+--------+------+-------+https://stackoverflow.com/questions/68135950
复制相似问题