我对火花很陌生,我不能运行并行化集合,这是我的代码:
from pyspark import SparkContext as sc
words = [
'Apache', 'Spark', 'is', 'an', 'open-source', 'cluster-computing',
'framework', 'Apache', 'Spark', 'open-source', 'Spark'
]
# Creates a RDD from a list of words
distributed_words = sc.parallelize(words)
distributed_words.count()和我得到:
TypeError: parallelize() missing 1 required positional argument: 'c'
why?发布于 2020-05-20 01:23:49
您需要初始化spark Context,我们可以从Spark Session中得到这个值,从Spark-2开始,然后对单词集合进行parallelize。
Example:
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("test").master("local").getOrCreate()
sc=spark.sparkContext
words = [
'Apache', 'Spark', 'is', 'an', 'open-source', 'cluster-computing',
'framework', 'Apache', 'Spark', 'open-source', 'Spark'
]
distributed_words = sc.parallelize(words)
distributed_words.count()
#11https://stackoverflow.com/questions/61903455
复制相似问题