文章/答案/技术大牛

发布

社区首页 >问答首页 >使用BeautifullSoup和Python写入csv时遇到问题

问使用BeautifullSoup和Python写入csv时遇到问题
EN

Stack Overflow用户

提问于 2017-07-31 17:09:55

回答 1查看 48关注 0票数 0

我在将抓取的数据写入csv文件时遇到问题。虽然页面已加载且脚本的第一部分正常工作，但写入csv会导致问题。现在，我尝试从抓取的数据中生成整数，因为这在我的其他项目中工作得很好。然而，在这个项目中似乎存在问题。

我得到的错误代码是：

ValueError: invalid literal for int() with base 10: '\nNotes To A Friend: The Experience\n'

我的问题是:如何以更复杂的方式将数据写入csv？

代码：

    import urllib.request
    from bs4 import BeautifulSoup
    from selenium import webdriver
    import pandas as pd
    import time 
    from datetime import datetime
    from collections import OrderedDict
    import re

    browser = webdriver.Firefox()
    browser.get('https://www.kickstarter.com/discover?ref=nav')
    categories = browser.find_elements_by_class_name('category-container')

    category_links = []
    for category_link in categories:
#Each item in the list is a tuple of the category's name and its link.
                category_links.append((str(category_link.find_element_by_class_name('f3').text),
                     category_link.find_element_by_class_name('bg-white').get_attribute('href')))


scraped_data = []
now = datetime.now()
counter = 1

for category in category_links:
browser.get(category[1])
browser.find_element_by_class_name('sentence-open').click()
time.sleep(2)
browser.find_element_by_id('category_filter').click()
time.sleep(2)

for i in range(27):
    try:
        time.sleep(2)
        browser.find_element_by_id('category_'+str(i)).click()
        time.sleep(2)            
    except:
        pass


projects = []
for project_link in browser.find_elements_by_class_name('clamp-3'):
    projects.append(project_link.find_element_by_tag_name('a').get_attribute('href'))

for counter, project in enumerate(projects): 
    page1 = urllib.request.urlopen(projects[counter])
    soup1 = BeautifulSoup(page1, "lxml")
    page2 = urllib.request.urlopen(projects[counter].split('?')[0]+'/community')
    soup2 = BeautifulSoup(page2, "lxml")
    time.sleep(2)
    print(str(counter)+': '+project+'\nStatus: Started.')
    project_dict = OrderedDict()
    project_dict['Category'] = category[0]
    browser.get(project)
    project_dict['Name'] = int(soup1.find(class_='type-24 type-28-sm type-38-md navy-700 medium mb3').text)

    project_dict['Home State'] = int(soup1.find(class_='nowrap navy-700 flex items-center medium type-12').text)

    try:
        project_dict['Backer State'] = int(soup2.find(class_='location-list-wrapper js-location-list-wrapper').text)
    except:
        pass

    print('Status: Done.')
    counter+=1
    scraped_data.append(project_dict)

    later = datetime.now()
    diff = later - now

    print('The scraping took '+str(round(diff.seconds/60.0,2))+' minutes,         and                         scraped '+str(len(scraped_data))+' projects.')

    df = pd.DataFrame(scraped_data)
    df.to_csv('kickstarter-data1.csv')

beautifulsoup

python

csv

selenium

web-scraping

回答 1

Stack Overflow用户

发布于 2017-07-31 21:04:31

自停止解析文本的整数转换后，这里将进行一些更改：

通过以下方式使用html5lib初始化BeautifulSoup：

BeautifulSoup(page1, "html5lib")

阅读了回复。需要将str对象作为第一个参数传递给BeautifulSoup。

    response = urllib.request.urlopen(projects[counter])
    page1 = response.read()
    soup1 = BeautifulSoup(page1, "html5lib")

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/45411714

复制

相似问题

问使用BeautifullSoup和Python写入csv时遇到问题
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用BeautifullSoup和Python写入csv时遇到问题EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用BeautifullSoup和Python写入csv时遇到问题
EN