我在学阿帕奇-火花。在仔细阅读了Spark教程之后,我了解了如何将Python函数传递给Apache来处理RDD数据集。但是现在我仍然不知道Apache是如何与类中的方法工作的。例如,我的代码如下所示:
import numpy as np
import copy
from pyspark import SparkConf, SparkContext
class A():
def __init__(self, n):
self.num = n
class B(A):
### Copy the item of class A to B.
def __init__(self, A):
self.num = copy.deepcopy(A.num)
### Print out the item of B
def display(self, s):
print s.num
return s
def main():
### Locally run an application "test" using Spark.
conf = SparkConf().setAppName("test").setMaster("local[2]")
### Setup the Spark configuration.
sc = SparkContext(conf = conf)
### "data" is a list to store a list of instances of class A.
data = []
for i in np.arange(5):
x = A(i)
data.append(x)
### "lines" separate "data" in Spark.
lines = sc.parallelize(data)
### Parallelly creates a list of instances of class B using
### Spark "map".
temp = lines.map(B)
### Now I got the error when it runs the following code:
### NameError: global name 'display' is not defined.
temp1 = temp.map(display)
if __name__ == "__main__":
main()实际上,我使用上面的代码来使用class B并行地生成一个temp = lines.map(B)实例列表。之后,我做了temp1 = temp.map(display),因为我希望并行地打印出class B实例列表中的每一项。但是现在这个错误出现了:NameError: global name 'display' is not defined.,我想知道如果我仍然使用Apache并行计算,我如何能够修复这个错误。我真的很感激有人能帮我。
发布于 2015-07-08 06:04:21
结构
.
├── ab.py
└── main.pymain.py
import numpy as np
from pyspark import SparkConf, SparkContext
import os
from ab import A, B
def main():
### Locally run an application "test" using Spark.
conf = SparkConf().setAppName("test").setMaster("local[2]")
### Setup the Spark configuration.
sc = SparkContext(
conf = conf, pyFiles=[
os.path.join(os.path.abspath(os.path.dirname(__file__)), 'ab.py')]
)
data = []
for i in np.arange(5):
x = A(i)
data.append(x)
lines = sc.parallelize(data)
temp = lines.map(B)
temp.foreach(lambda x: x.display())
if __name__ == "__main__":
main()ab.py
import copy
class A():
def __init__(self, n):
self.num = n
class B(A):
### Copy the item of class A to B.
def __init__(self, A):
self.num = copy.deepcopy(A.num)
### Print out the item of B
def display(self):
print self.num评论:
for x in rdd.sample(False, 0.001).collect(): x.display()foreach而不是mapdisplay法。我不确定在这种情况下s应该是什么https://stackoverflow.com/questions/31279206
复制相似问题