我试图从:data.aspx?sq=vt-9中为我的中队刮取调度数据。
我已经知道了如何使用BeautifulSoup使用以下方法提取数据:
import urllib2
from urllib2 import urlopen
import bs4 as bs
url = 'https://www.cnatra.navy.mil/scheds/schedule_data.aspx?sq=vt-9'
html = urllib2.urlopen(url).read()
soup = bs.BeautifulSoup(html, 'lxml')
table = soup.find('table')
print(table.text)但是,表隐藏在日期下面(如果不是当前日期),然后按下“查看日程”按钮。
如何修改我的代码以“按”“查看计划”按钮,这样我就可以抓取数据了?加分,如果代码也可以选择一个日期!
我试图使用:
import urllib2
from urllib2 import urlopen
import bs4 as bs
from selenium import webdriver
driver = webdriver.Chrome("/users/base/Downloads/chromedriver")
driver.get("https://www.cnatra.navy.mil/scheds/schedule_data.aspx?sq=vt-9")
button = driver.find_element_by_id('btnViewSched')
button.click()它成功地打开Chrome并‘点击’按钮,但我不能从这里刮,因为地址是不变的。
发布于 2019-03-30 04:34:24
您可以使用纯selenium获取日程:
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
driver.get("https://www.cnatra.navy.mil/scheds/schedule_data.aspx?sq=vt-9")
button = driver.find_element_by_id('btnViewSched')
button.click()
print(driver.find_element_by_id('dgEvents').text)输出:
TYPE VT Brief EDT RTB Instructor Student Event Hrs Remarks Location
Flight VT-9 07:45 09:45 11:15 JARVIS, GRANT M [LT] LENNOX, KEVIN I [ENS] BI4101 1.5 2 HR BRIEF MASS BRIEF
Flight VT-9 07:45 09:45 11:15 MOYNAHAN, WILLIAM P [CDR] FINNERAN, MATTHEW P [1stLt] BI4101 1.5 2 HR BRIEF MASS BRIEF
Flight VT-9 07:45 12:15 13:45 JARVIS, GRANT M [LT] TAYLOR, ADAM R [1stLt] BI4101 1.5 2 HR BRIEF MASS BRIEF @ 0745 W/ JARVIS MEI OPS
Flight VT-9 07:45 12:15 13:45 MOYNAHAN, WILLIAM P [CDR] LOW, TRENTON G [ENS] BI4101 1.5 2 HR BRIEF MASS BRIEF @ 0745 W/ MOYNAHAN MEI OPS
Watch VT-9 00:00 14:00 ANDERSON, LAURA [LT] ODO (ON CALL) 14.0
Watch VT-9 00:00 14:00 ANDERSON, LAURA [LT] ODO (ON CALL) 14.0
Watch VT-9 00:00 23:59 ANDERSON, LAURA [LT] ODO (ON CALL) 24.0
Watch VT-9 00:00 23:59 ANDERSON, LAURA [LT] ODO (ON CALL) 24.0
Watch VT-9 07:00 19:00 STUY, JOHN [LTJG] DAY IWO 12.0
Watch VT-9 19:00 07:00 STRACHAN, ALLYSON [LTJG] IWO 12.0 发布于 2019-03-30 04:39:47
在我阅读您的问题时,您需要使用selenium在需要输入的地方抓取.aspx页面。
阅读这篇文章,它将帮助您实现用selenium抓取.aspx页面的数据
发布于 2019-03-30 08:49:17
在“查看计划”点击,请求使用相同的url,但使用数据btnViewSched=View Schedule和令牌发送。这里的代码是以地图格式的列表收集表数据:
import requests
from bs4 import BeautifulSoup
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/73.0.3683.86 Safari/537.36',
'DNT': '1',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,'
'application/signed-exchange;v=b3',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'ru,en-US;q=0.9,en;q=0.8,tr;q=0.7',
}
response = requests.get('https://www.cnatra.navy.mil/scheds/schedule_data.aspx?sq=vt-9', headers=headers)
assert response.ok
page = BeautifulSoup(response.text, "lxml")
# get __VIEWSTATE, __EVENTVALIDATION and __VIEWSTATEGENERATOR for further requests
__VIEWSTATE = page.find("input", attrs={"id": "__VIEWSTATE"}).attrs["value"]
__EVENTVALIDATION = page.find("input", attrs={"id": "__EVENTVALIDATION"}).attrs["value"]
__VIEWSTATEGENERATOR = page.find("input", attrs={"id": "__VIEWSTATEGENERATOR"}).attrs["value"]
# View Schedule click set here
data = {
'__EVENTTARGET': '',
'__EVENTARGUMENT': '',
'__VIEWSTATE': __VIEWSTATE,
'__VIEWSTATEGENERATOR': __VIEWSTATEGENERATOR,
'__EVENTVALIDATION': __EVENTVALIDATION,
'btnViewSched': 'View Schedule',
'txtNameSearch': ''
}
# request with params
response = requests.post('https://www.cnatra.navy.mil/scheds/schedule_data.aspx?sq=vt-9', headers=headers, data=data)
assert response.ok
page = BeautifulSoup(response.text, "lxml")
# get table headers to map as a keys in result
table_headers = [td.text.strip() for td in page.select("#dgEvents tr:first-child td")]
# get all rows, without table headers
table_rows = page.select("#dgEvents tr:not(:first-child)")
result = []
for row in table_rows:
table_columns = row.find_all("td")
# use map with results for row and add all columns as map (key:value)
row_result = {}
for i in range(0, len(table_headers)):
row_result[table_headers[i]] = table_columns[i].text.strip()
# add row_result to result list
result.append(row_result)
for r in result:
print(r)
print("the end")示例输出:
{“类型”:“飞行”,“VT”:“Vt-9”,“简要”:“07:45”,“EDT”:“09:45”,“RTB”:“11:15”,“指导员”:“JARVIS,GRANT M LT”,“学生”:“LENNOX,KEVIN I ENS”,“Event”:“HR 4101”,“Hrs”:“1.5”,“备注”:“2 HR简要简报”,“地点”:“}”
https://stackoverflow.com/questions/55428119
复制相似问题