我试着从IPython笔记本上运行mrjob示例
from mrjob.job import MRJob
class MRWordFrequencyCount(MRJob):
def mapper(self, _, line):
yield "chars", len(line)
yield "words", len(line.split())
yield "lines", 1
def reducer(self, key, values):
yield key, sum(values) 然后用代码运行它。
mr_job = MRWordFrequencyCount(args=["testfile.txt"])
with mr_job.make_runner() as runner:
runner.run()
for line in runner.stream_output():
key, value = mr_job.parse_output_line(line)
print key, value并得到错误:
TypeError: <module '__main__' (built-in)> is a built-in class有办法从IPython笔记本上运行mrjob吗?
发布于 2015-10-27 00:26:22
我还没有找到“完美的方法”,但您可以做的一件事是使用%%file魔术创建一个笔记本单元,将单元格内容写入文件:
%%file wordcount.py
from mrjob.job import MRJob
class MRWordFrequencyCount(MRJob):
def mapper(self, _, line):
yield "chars", len(line)
yield "words", len(line.split())
yield "lines", 1
def reducer(self, key, values):
yield key, sum(values)然后让mrjob在以后的单元格中运行该文件:
import wordcount
reload(wordcount)
mr_job = wordcount.MRWordFrequencyCount(args=['example.txt'])
with mr_job.make_runner() as runner:
runner.run()
for line in runner.stream_output():
key, value = mr_job.parse_output_line(line)
print key, value注意,我调用了我的文件wordcount.py,并从MRWordFrequencyCount模块导入了类MRWordFrequencyCount --文件名和模块必须匹配。另外,Python缓存导入的模块,并且当您更改wordcount.py-file时,iPython不会重新加载模块,而是使用旧的缓存模块。这就是我把reload()电话放进去的原因。
参考资料:https://groups.google.com/d/msg/mrjob/CfdAgcEaC-I/8XfJPXCjTvQJ
更新(更短)
对于较短的第二个笔记本单元,您可以通过从笔记本中调用shell来运行mrjob。
! python mrjob.py shakespeare.txt参考资料:http://jupyter.cs.brynmawr.edu/hub/dblank/public/Jupyter%20Magics.ipynb
https://stackoverflow.com/questions/24701101
复制相似问题