首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >scraping;尝试向dataframe添加类

scraping;尝试向dataframe添加类
EN

Stack Overflow用户
提问于 2020-01-05 22:01:16
回答 1查看 187关注 0票数 0

我正在刮的职位名称,公司,地点和职位摘要张贴在确实。由于公司类别中缺少数据,我需要帮助解决如何处理这些缺失的信息。

我可以透过以下方法取得所有有关资料:

代码语言:javascript
复制
company = driver.find_elements_by_class_name("company")

等。

接下来,我将遍历驱动程序的文本值,并将它们附加到列表中:

代码语言:javascript
复制
joblist = []
for i in range(len(jobs)):
    joblist.append(jobs[i].text)

最后,我将这些信息添加到熊猫DataFrame中:

代码语言:javascript
复制
df["Job Title"] = joblist

等。

我现在发现,有些职位并没有将公司名称列在适当的类别中,而是将公司名称放在职位名称中。当公司的价值丢失,或某一职位的价值为空时,我如何将对应于正确职务公告的空槽添加到“公司列表/数据”中?谢谢你的帮助。

EN

回答 1

Stack Overflow用户

发布于 2020-01-06 12:22:25

似乎你使用了错误的方法-你搜索所有的公司,分开所有的标题,等等。

代码语言:javascript
复制
company = driver.find_elements_by_class_name("company")
summary = driver.find_elements_by_class_name("summary")
location = driver.find_elements_by_class_name("location")
title = driver.find_elements_by_class_name("title")

但是,您应该搜索对象,该对象保存所有信息,用于单一报价,然后在此报价中进行搜索。

代码语言:javascript
复制
all_offers = driver.find_elements_by_class_name("info")

joblist = []

for offer in all_offers:
    # uses `offer` instead `driver` to search only in this one offer
    # uses `element` instead of `elements` to search only one value

    try:
        company = offer.find_element_by_class_name("company").text.strip()
    except Exception as ex:
        #print('[Exception] company:', ex)
        company = 'NAN'

    summary = offer.find_element_by_class_name("summary").text.strip()
    location = offer.find_element_by_class_name("location").text.strip()
    title = offer.find_element_by_class_name("title").text.strip()

    joblist.append([company, summary, location, title])    

编辑:用https://books.toscrape.com创建的最小工作示例,用于学习刮擦。

此页面上没有类other,因此它将向所有数据行添加NAN

代码语言:javascript
复制
import selenium.webdriver

driver = selenium.webdriver.Firefox()
driver.get('https://books.toscrape.com')

all_items = driver.find_elements_by_class_name('product_pod')

data = []

for item in all_items:
    try:
        name = item.find_element_by_xpath('.//h3/a').get_attribute('title')
    except Exception as ex:
        #print('[Exception] name:', ex)
        name = ''

    try:    
        price = item.find_element_by_class_name('price_color').text.strip()
    except Exception as ex:
        #print('[Exception] price:', ex)
        price = ''

    try:    
        other = item.find_element_by_class_name('other').text.strip()
    except Exception as ex:
        #print('[Exception] other:', ex)
        other = 'NAN'

    data.append([name, price, other])

for row in data:
    print(row)

结果:

代码语言:javascript
复制
['A Light in the Attic', '£51.77', 'NAN']
['Tipping the Velvet', '£53.74', 'NAN']
['Soumission', '£50.10', 'NAN']
['Sharp Objects', '£47.82', 'NAN']
['Sapiens: A Brief History of Humankind', '£54.23', 'NAN']
['The Requiem Red', '£22.65', 'NAN']
['The Dirty Little Secrets of Getting Your Dream Job', '£33.34', 'NAN']
['The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull', '£17.93', 'NAN']
['The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics', '£22.60', 'NAN']
['The Black Maria', '£52.15', 'NAN']
['Starving Hearts (Triangular Trade Trilogy, #1)', '£13.99', 'NAN']
["Shakespeare's Sonnets", '£20.66', 'NAN']
['Set Me Free', '£17.46', 'NAN']
["Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", '£52.29', 'NAN']
['Rip it Up and Start Again', '£35.02', 'NAN']
['Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991', '£57.25', 'NAN']
['Olio', '£23.88', 'NAN']
['Mesaerion: The Best Science Fiction Stories 1800-1849', '£37.59', 'NAN']
['Libertarianism for Beginners', '£51.33', 'NAN']
["It's Only the Himalayas", '£45.17', 'NAN']
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/59604584

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档