首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >抓取表的Training.gov.au

抓取表的Training.gov.au
EN

Stack Overflow用户
提问于 2019-06-18 15:13:25
回答 1查看 78关注 0票数 0

我正在尝试使我的一些工作自动化。问题中的网站是training.gov.au,在特定页面下嵌套表格,例如https://training.gov.au/Training/Details/BSBWHS402。我真正想做的是能够指向我想要使用的模块(在本例中为BSBWHS402),并迭代嵌套在该页面上的特定表格,然后将这些表格重新处理为.csv,或者理想情况下将其转换为预格式化的.doc

在某种程度上,我可以通过扼杀其他人的工作来从代码中获得我需要的东西,但无法让它看起来与表中的站点相似。我试着粘贴到.csv中并使用分隔符,但这不起作用,显然也不是真正的自动化。

代码语言:javascript
复制
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
website_url = requests.get('https://training.gov.au/Training/Details/BSBWHS402').text
soup = BeautifulSoup(website_url,'lxml')
tables = soup.findAll('table')
My_table = soup.find('Elements and Performance Criteria')
df = pd.read_html(str(tables))
results = (df[8].to_json(orient='records'))
print(results)

我得到了以下单行代码;

代码语言:javascript
复制
[{"0":"ELEMENT","1":"PERFORMANCE CRITERIA"},{"0":"Elements describe the essential outcomes.","1":"Performance criteria describe the performance needed to demonstrate achievement of the element."},{"0":"1 Assist with determining the legal framework for WHS in the workplace","1":"1.1 Access current WHS legislation and related documentation relevant to the organisation\u2019s operations 1.2 Use knowledge of the relationship between WHS Acts, regulations, codes of practice, standards and guidance material to assist with determining legal requirements in the workplace 1.3 Assist with identifying and confirming the duties, rights and obligations of individuals and parties as specified in legislation 1.4 Assist with seeking advice from legal advisers where necessary"},{"0":"2 Assist with providing advice on WHS compliance","1":"2.1 Assist with providing advice to individuals and parties about their legal duties, rights and obligations, and the location of relevant information in WHS legislation 2.2 Assist with providing advice to individuals and parties about the functions and powers of the WHS regulator and how they are exercised, and the objectives and principles underpinning WHS"},{"0":"3 Assist with WHS legislation compliance measures","1":"3.1 Assist with assessing how the workplace complies with relevant WHS legislation 3.2 Assist with determining the WHS training needs of individuals and parties, and with providing training to meet legal and other requirements 3.3 Assist with developing and implementing changes to workplace policies, procedures, processes and systems that will achieve compliance"}]

我不确定如何准确地使用它,但我至少可以注意到它已经给出了它应该位于哪一列的分配。

非常开放的批评和想法,如何使这个产品更好。我将为它创建一个UI来输入模块名称,但这是将来我的问题。提前感谢

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-06-18 15:23:08

而不是

代码语言:javascript
复制
df[8].to_json

使用

代码语言:javascript
复制
df[8].to_csv

你会得到你想要的。

为了保留新行,您必须使用其他库,如lxml,而不是pandas,因为pd.read_html会标准化内容。请参阅pandas github上的this issue

下面是example with BeautifulSoup

代码语言:javascript
复制
from bs4 import BeautifulSoup
import csv
website_url = requests.get('https://training.gov.au/Training/Details/BSBWHS402').text
soup = BeautifulSoup(website_url,'lxml')
# The string argument is new in Beautiful Soup 4.4.0.
# In earlier versions it was called text.
table = (soup.find("h2", string="Elements and Performance Criteria")).find_next('table')

output_rows = []
for table_row in table.findAll('tr'):
    columns = table_row.findAll('td')
    output_row = []
    for column in columns:
        output_row.append(column.text)
    output_rows.append(output_row)

with open('output.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(output_rows)
    csvfile.flush()
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/56643603

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档