我目前正试图让自己更多地参与编程和python。为了一个小项目,我想为一个网站建立一个网络爬虫。所以我读到了,scrapy,和
到目前为止还不错..。
网站结构
这是一个简单的网站,有一个下拉菜单,可供选择。如果我选择其中之一,网站的网址不会改变。只有底层的html代码才会改变。当选择一个值时,您将得到一个带有几个列/行的结果表,的形式如下:
<div id="result">
<table class="table">
<thead>
<tr>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
<th>E</th>
</tr>
</thead>
<tbody>
<tr>
<td><b>...</b></td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
...
more follows here...任务
我想把每个下拉菜单条目的结果表的所有日期都刮掉。到目前为止,我只提取了下拉菜单值。
问题
如何积极选择下拉菜单中的值,以便网站html代码发生变化,并显示我希望从其中提取数据的希望表?在看了关于刮擦和美丽汤的文档之后,我仍然不明白这部分。
当前代码
from bs4 import BeautifulSoup
import requests
import urllib3
BASE_URL = "http://routerpasswords.com/"
def get_router_types(url):
r = requests.get(url)
html_content = r.content
soup = BeautifulSoup(html_content)
print("option values: \n")
option_values = soup.find_all("option")
print(option_values)
print(" \n")
print("router types: \n")
router_types = [option.get('value') for option in soup.find_all('option')]
print(router_types)
return router_types
'''
Stuck here!
...
def get_passwords():
router_types = get_router_types(BASE_URL)
passwords = []
for types in router_types:
#print(types)
def main():
get_router_types(BASE_URL)
if __name__ == "__main__":
main()发布于 2017-03-10 09:13:14
每次单击按钮时,您都要将数据发送到服务器,您可以在chrome开发工具(F12)中找到post数据:

您可以使用requests模拟这篇文章
In [27]: data = {'findpass':'1',
...: 'router':'Belkin',
...: 'findpassword':'Find Password'}
In [28]: r = requests.post('http://routerpasswords.com/', data=data)发布于 2017-03-10 09:40:23
首先,我把所有的router_names都列在一个列表中,
然后为每个路由器执行一个新的请求,其中包含正确的post params (def: get_passwords_via_name)
from bs4 import BeautifulSoup
import requests
BASE_URL = "http://routerpasswords.com/"
def get_router_types(url):
r = requests.get(url)
html_content = r.content
soup = BeautifulSoup(html_content)
print("option values: \n")
option_values = soup.find_all("option")
print(option_values)
print(" \n")
print("router types: \n")
router_types = [option.get('value') for option in soup.find_all('option')]
return router_types, r
def get_passwords_via_name(router_name, rcookie):
data = {"findpass": "1", "router": router_name, "findpassword": "Find+Password"}
print data
c = requests.post('http://routerpasswords.com/', data=data)
print c.url
html_content = c.content
print c.status_code
soup = BeautifulSoup(html_content)
return soup.find("div", {"id": "result"})
def main():
rlist, r = get_router_types(BASE_URL)
for i in rlist:
print "debug"
print get_passwords_via_name(i, r)
if __name__ == "__main__":
main()卷曲方式:
curl 'http://routerpasswords.com/' --data 'findpass=1&router=ZyXEL&findpassword=Find+Password'https://stackoverflow.com/questions/42714061
复制相似问题