文章/答案/技术大牛

发布

社区首页 >问答首页 >Scrapy: TypeError:字符串索引必须是整数，而不是str？

问Scrapy: TypeError:字符串索引必须是整数，而不是str？
EN

Stack Overflow用户

提问于 2017-03-15 23:33:38

回答 1查看 1.5K关注 0票数 0

我写了一只蜘蛛，它从新闻网站上抓取数据：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from items import CravlingItem

import re


class CountrySpider(CrawlSpider):
    name = 'Post_and_Parcel_Human_Resource'

    allowed_domains = ['postandparcel.info']
    start_urls = ['http://postandparcel.info/category/news/human-resources/']

    rules = (
        Rule(LinkExtractor(allow='',
                           restrict_xpaths=(
                               '//*[@id="page"]/div[4]/div[1]/div[1]/div[1]/h1/a',
                               '//*[@id="page"]/div[4]/div[1]/div[1]/div[2]/h1/a',
                               '//*[@id="page"]/div[4]/div[1]/div[1]/div[3]/h1/a'
                           )),
             callback='parse_item',
             follow=False),
    )

    def parse_item(self, response):
        i = CravlingItem()
        i['title'] = " ".join(response.xpath('//div[@class="cd_left_big"]/div/h1/text()')
                              .extract()).strip() or " "
        i['headline'] = self.clear_html(
            " ".join(response.xpath('//div[@class="cd_left_big"]/div//div/div[1]/p')
                                 .extract()).strip()) or " "
        i['text'] = self.clear_html(
            " ".join(response.xpath('//div[@class="cd_left_big"]/div//div/p').extract()).strip()) or " "
        i['url'] = response.url
        i['image'] = (" ".join(response.xpath('//*[@id="middle_column_container"]/div[2]/div/img/@src')
                              .extract()).strip()).replace('wp-content/', 'http://postandparcel.info/wp-content/') or " "
        i['author'] = " "
        # print("\n")
        # print(i)
        return i

    @staticmethod
    def clear_html(html):
        text = re.sub(r'<(style).*?</\1>(?s)|<[^>]*?>|\n|\t|\r', '', html)
        return text

我还在管道中编写了一段代码来细化提取的文本，下面是管道：

from scrapy.conf import settings
from scrapy import log
import pymongo
import json
import codecs
import re
class RefineDataPipeline(object):
    def process_item(self, item, spider):
        #In this section: the below edits will be applied to all scrapy crawlers.
    item['text'] =str( item['text'].encode("utf-8"))
    replacements ={"U.S.":" US ", " M ":"Million", "same as the title":"", " MMH Editorial ":"", " UPS ":"United Parcel Service", " UK ":" United Kingdom "," Penn ":" Pennsylvania ", " CIPS ":" Chartered Institute of Procurement and Supply ", " t ":" tonnes ", " Uti ":" UTI ", "EMEA":" Europe, Middle East and Africa ", " APEC ":" Asia-Pacific Economic Cooperation ", " m ":" million ", " Q4 ":" 4th quarter ", "LLC":"", "Ltd":"", "Inc":"", "Published text":" Original text "}


    allparen= re.findall('\(.+?\)',item['text'])
    for item in allparen:
        if item[1].isupper() and item[2].isupper():
            replacements[str(item)]=''
        elif item[1].islower() or item[2].islower():
            replacements[str(item)]=item[1:len(item)-1]
        else:
            try:
                val = int(item[1:len(item)-1])
                replacements[str(item)]= str(val)
            except ValueError:
                pass
    def multireplace(s, replacements):
        substrs = sorted(replacements, key=len, reverse=True)
        regexp = re.compile('|'.join(map(re.escape, substrs)))
            return regexp.sub(lambda match: replacements[match.group(0)],s)
    item['text'] = multireplace(item['text'], replacements)
    item['text'] = re.sub( '\s+', ' ', item['text'] ).strip()
    return item

但是，有一个巨大的问题阻止蜘蛛成功地抓取数据：

回溯(最近一次调用)：文件"/usr/lib/python2.7/dist-packages/twisted/internet/defer.py"，第588行，在_runCallbacks current.result =回调(current.result，*args，**kw) File "/home/hathout/Desktop/updataed portcalls/thomas/thas.管线. be“第41行中，在process_item item‘’text‘=multireplace(item’‘text’，置换) TypeError:字符串索引必须是整数，而不是str索引。

我真的不知道如何克服"TypeError: string索引必须是整数，而不是str“的错误。

lambda

scrapy

python

json

python-2.7

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-03-16 00:06:50

简单回答:变量item是一个字符串

很长的答案:在这一节

allparen= re.findall('\(.+?\)',item['text'])
for item in allparen:
    ...

在allparen上循环，应该是字符串列表或空列表，并使用与循环变量相同的变量名item。因此，item是一个字符串，而不是dict/Item对象。对循环变量使用不同的名称，如：

for paren in allparen:
    if paren[1].isupper() and paren[2].isupper():
    ...

基本上，您在循环中使用相同的变量名覆盖了原始的item变量。

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/42822705

复制

相似问题

问Scrapy: TypeError:字符串索引必须是整数，而不是str？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Scrapy: TypeError:字符串索引必须是整数，而不是str？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Scrapy: TypeError:字符串索引必须是整数，而不是str？
EN