文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在scraperwiki中将这些数据添加到数据库中

问如何在scraperwiki中将这些数据添加到数据库中
EN

Stack Overflow用户

提问于 2014-05-07 16:51:21

回答 1查看 141关注 0票数 0

import scraperwiki
import urllib2, lxml.etree
url = 'http://eci.nic.in/eci_main/statisticalreports/SE_1998/StatisticalReport-DEL98.pdf'
pdfdata = urllib2.urlopen(url).read()
xmldata = scraperwiki.pdftoxml(pdfdata)
root = lxml.etree.fromstring(xmldata)
# how many pages in PDF
pages = list(root)
print "There are",len(pages),"pages"
#from page 86 to 107
for page in pages[86:107]:
    for el in page:
            data = {}
        if el.tag == "text":
            if int(el.attrib['left']) < 215: data = { 'Rank': el.text }
            elif int(el.attrib['left']) < 230: data['Name'] = el.text 
            elif int(el.attrib['left']) < 592: data['Sex'] = el.text
            elif int(el.attrib['left']) < 624: data['Party'] = el.text
            elif int(el.attrib['left']) < 750: data['Votes'] = el.text
            elif int(el.attrib['left']) < 801: data['Percentage'] = el.text
            print data

现在我想知道如何用scraperwiki将这些数据保存到数据库中。我已经尝试了一些命令，比如

scraperwiki.sqlite.save(unique_keys=[], table_name='ecidata1998', data=data)

但是当我检查数据集时，他们没有给我所需的结果，是代码或最后一条语句有问题吗？请帮帮忙。Python编程和Scraperwiki的新特性。

pdf

screen-scraping

scraperwiki

python

回答 1

Stack Overflow用户

发布于 2014-05-24 02:00:24

你的代码有几个问题。

首先，您设置的从PDF中拉出不同内容的条件需要更加严格和精确(例如，if int(el.attrib['left']) < 215将拉出左位置小于215像素的任何文本，这适用于您正在查看的PDF页面中的其他内容，例如文本“选民”)。

其次，您需要一种方法来检查何时拥有该行的所有数据，并可以继续下一行。(您可以尝试逐行提取数据，但我只是发现，当有了该行的所有数据后，依次从每个字段中获取数据并创建一个新行会更容易。)

(至于scraperwiki.sqlite.save不工作的原因，可能是因为您的数据中有许多空值，但数据本身并不正确。)

这对我来说很有效：

import scraperwiki
import urllib2
import lxml.etree


def create_blank_row():
    """ Create an empty candidate data dictionary. """
    return {'Rank': None,
            'Name': None,
            'Sex': None,
            'Party': None,
            'Votes': None,
            'Percentage': None}


def row_is_filled(dictionary):
    """ Return True if all values of dictionary are filled; False if not. """
    for item in dictionary.values():
        if not item:
            return False
    return True


def main():
    url = ('http://eci.nic.in/eci_main/statisticalreports'
           '/SE_1998/StatisticalReport-DEL98.pdf')
    pdfdata = urllib2.urlopen(url).read()
    xmldata = scraperwiki.pdftoxml(pdfdata)
    root = lxml.etree.fromstring(xmldata)

    # how many pages in PDF
    pages = list(root)
    print "There are", len(pages), "pages"

    output_data = []
    candidate_data = create_blank_row()
    #from page 86 to 107
    for page in pages[86:107]:
        for el in page:
            if el.tag == "text":
                if 206 < int(el.attrib['left']) <= 214:
                    # There are some None values here which we want to ignore.
                    if el.text:
                        candidate_data['Rank'] = el.text

                if int(el.attrib['left']) == 222:
                    # Also removes ". " from start of names.
                    candidate_data['Name'] = el.text[2:]

                if int(el.attrib['left']) == 591:
                    candidate_data['Sex'] = el.text

                if int(el.attrib['left']) == 622:
                    candidate_data['Party'] = el.text

                if 725 < int(el.attrib['left']) <= 753:
                    candidate_data['Votes'] = el.text

                if 790 < int(el.attrib['left']) < 801:
                    candidate_data['Percentage'] = el.text

            if row_is_filled(candidate_data):
                output_data.append(candidate_data)
                candidate_data = create_blank_row()

    # Collect candidate data into a list then add to SQL database.
    # Calls to this SQL write function slow, so minimise how many times we do.
    scraperwiki.sqlite.save(unique_keys=['Rank', 'Name', 'Sex', 'Party', 
                                         'Votes'], 
                            table_name='ecidata1998',
                            data=output_data)

if __name__ == '__main__':
    main()

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/23513094

复制

相似问题

问如何在scraperwiki中将这些数据添加到数据库中
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在scraperwiki中将这些数据添加到数据库中EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在scraperwiki中将这些数据添加到数据库中
EN