首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >使用mrjob python将Comomn爬行位置作为对Amazon的输入

使用mrjob python将Comomn爬行位置作为对Amazon的输入
EN

Stack Overflow用户
提问于 2015-09-27 19:47:31
回答 1查看 373关注 0票数 2

从我开始使用mrjob到现在才几天,我已经尝试了一些低级和中等级别的tasks.Now,我只能使用python将公共爬行[now onwards will be know as CC]定位作为emr的输入。

我的配置文件如下所示:

代码语言:javascript
复制
runners:
  emr:
    aws_access_key_id: <AWS Access Key>
    aws_secret_access_key: <AWS Secret Access Key>
    aws_region: us-east-1
    ec2_key_pair: cslab
    ec2_key_pair_file: ~/cslab.pem
    ec2_instance_type: m1.small
    num_ec2_instances: 5
  local:
    base_tmp_dir: /tmp

Big thing small :I am trying to get the number of words in a web page of a site

Big thing big: Is my code below

我的代码:

代码语言:javascript
复制
import warc

class MRcount(MRJob):
    # ...

    def mapper(self, _, s3_path):
        s3_url_parsed = urlparse.urlparse(s3_url)
        bucket_name = s3_url_parsed.netloc
        key_path = s3_url_parsed.path[1:]
        conn = boto.connect_s3()
        bucket = conn.get_bucket('aws-publicdatasets', validate=False)
        key = Key(bucket, s3_path)
        webpage_text = record.payload.read()
        yield record.header['warc-target-uri'],len(webpage_text.split()
if __name__ == '__main__':
    MRcount.run())

到现在为止一切都很好,但当我试着运行它的时候。

Cmd:

代码语言:javascript
复制
$ python mr_crawl.py -r emr s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-52/wet.paths.gz

错误:

代码语言:javascript
复制
boto.exception.S3ResponseError: S3ResponseError: 301 Moved Permanently
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>PermanentRedirect</Code><Message>The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.</Message <RequestId>06660583263444FC</RequestId><Bucket>smarkets-db</Bucket><HostId>TCZJTKZ8wo8V1h0xjkOI6grojs/r9IBkhMOcvolXv06QEtxTX89M55aLTPGOo/ht</HostId><Endpoint>eu-west-bucket.s3.amazonaws.com</Endpoint></Error>

我认为这是因为我的配置文件中的区域,并删除了它,但我得到一个新的错误。

我的新配置文件:

代码语言:javascript
复制
runners:
  emr:
    aws_access_key_id: <AWS Access Key>
    aws_secret_access_key: <AWS Secret Access Key>
    ec2_key_pair: cslab
    ec2_key_pair_file: ~/cslab.pem
    ec2_instance_type: m1.small
    num_ec2_instances: 5
  local:
    base_tmp_dir: /tmp

I得到以下错误SSH错误:

代码语言:javascript
复制
using configs in /etc/mrjob.conf
using existing scratch bucket mrjob-4db6342a70e021ad
using s3://mrjob-4db6342a70e021ad/tmp/ as our scratch dir on S3
creating tmp directory /tmp/word_count.20140603.181541.006786
writing master bootstrap script to /tmp/word_count.20140603.181541.006786/b.py
Copying non-input files into s3://mrjob-4db6342a70e021ad/tmp/word_count.matthew.20140603.181541.006786/files/
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Job flow created with ID: j-3DCN7LULSRILW
Created new job flow j-3DCN7LULSRILW
Job on job flow j-3DCN7LULSRILW failed with status FAILED: The given SSH key name was invalid
Logs are in s3://mrjob-4db6342a70e021ad/tmp/logs/j-3DCN7LULSRILW/
Scanning S3 logs for probable cause of failure
Waiting 5.0s for S3 eventual consistency
Terminating job flow: j-3DCN7LULSRILW
Traceback (most recent call last):
  File "word_count.py", line 16, in <module>
    MRcount.run()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 494, in run
    mr_job.execute()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 512, in execute
    super(MRJob, self).execute()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/launch.py", line 147, in execute
    self.run_job()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/launch.py", line 208, in run_job
    runner.run()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/runner.py", line 458, in run
    self._run()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/emr.py", line 809, in _run
    self._wait_for_job_to_complete()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/emr.py", line 1599, in _wait_for_job_to_complete
    raise Exception(msg)
Exception: Job on job flow j-3DCN7LULSRILW failed with status FAILED: The given SSH key name was invalid

谢谢,

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-04-30 17:41:54

在您的MrJob配置中,您需要根据ec2_key_pair控制台中的键盘列表设置ec2_key_pair

代码语言:javascript
复制
runners:
  emr:
    aws_access_key_id: <AWS Access Key>
    aws_secret_access_key: <AWS Secret Access Key>
    ec2_key_pair: cslab # <---- this name doesnt exist inside aws, so aws doesnt know the public key to use
    ec2_key_pair_file: ~/cslab.pem # <-- you can comment this out if you dont need to login to the machine via ssh
    ec2_instance_type: m1.small
    num_ec2_instances: 5
  local:
    base_tmp_dir: /tmp

要查看aws中包含的密钥对列表,看这个医生

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/32812152

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档