我正在为以下给定的网站在多个页面上使用“美丽汤”来抓取数据,并且能够做到。我可以使用Pandas抓取多个页面的数据吗?下面是抓取单个页面的代码,并且URL有链接到其他页面,如http://www.example.org/whats-on/calendar?page=3。
import pandas as pd
url = 'http://www.example.org/whats-on/calendar?page=3'
dframe = pd.read_html(url,header=0)
dframe[0]
dframe[0].to_csv('out.csv')发布于 2017-11-09 20:27:24
简单地遍历数字的范围,并附加到数据格式的列表中。之后,连接到一个大文件。当前代码的一个问题也是默认的第一行是header=0。但是,页面没有列标题。因此,使用header=None,然后重命名列。
下面是0- 3页.扩展其他页面的循环限制.
import pandas as pd
dfs = []
# PAGES 0 - 3 SCRAPE
url = 'http://www.lapl.org/whats-on/calendar?page={}'
for i in range(4):
dframe = pd.read_html(url.format(i), header=None)[0]\
.rename(columns={0:'Date', 1:'Topic', 2:'Location',
3:'People', 4:'Category'})
dfs.append(dframe)
finaldf = pd.concat(dfs)
finaldf.to_csv('Output.csv')输出
print(finaldf.head())
# Date Topic Location People Category
# 0 Thu, Nov 09, 201710:00am to 12:30pm California Healthier Living : A Chronic Diseas... West Los Angeles Regional Library Seniors Health
# 1 Thu, Nov 09, 201710:00am to 11:30am Introduction to Microsoft WordLearn the basics... North Hollywood Amelia Earhart Regional Library Adults, Job Seekers, Seniors Computer Class
# 2 Thu, Nov 09, 201711:00am Board of Library Commissioners Central Library Adults Meeting
# 3 Thu, Nov 09, 201712:00pm to 1:00pm Tech TryOutCentral Library LobbyDid you know t... Central Library Adults, Teens Computer Class
# 4 Thu, Nov 09, 201712:00pm to 1:30pm Taller de Tejido/ Crochet WorkshopLearn how to... Benjamin Franklin Branch Library Adults, Seniors, Spanish Speakers Arts and Crafts, En Español发布于 2020-01-13 12:41:31
下面的代码将循环遍历以下区域中给定的页面,并将所选字段附加到dataframe中。
def get_from_website():
Sample = pd.DataFrame()
for num in range(1,6):
website = 'https://weburl/?page=' + str(num)
datalist = pd.read_html(website)
Sample= Sample.append(datalist[0])
Sample.columns=['Field1', 'Field2', 'Field3', 'Field4', 'Field5', 'Field6', 'Time', 'Field7', 'Field8' ]
return Samplehttps://stackoverflow.com/questions/47209301
复制相似问题