文章/答案/技术大牛

发布

社区首页 >问答首页 >web爬虫在动态更改网站代码上的应用

问web爬虫在动态更改网站代码上的应用
EN

Stack Overflow用户

提问于 2017-03-10 08:51:33

回答 2查看 1.2K关注 0票数 1

我目前正试图让自己更多地参与编程和python。为了一个小项目，我想为一个网站建立一个网络爬虫。所以我读到了，scrapy，和

到目前为止还不错..。

网站结构

这是一个简单的网站，有一个下拉菜单，可供选择。如果我选择其中之一，网站的网址不会改变。只有底层的html代码才会改变。当选择一个值时，您将得到一个带有几个列/行的结果表，的形式如下：

<div id="result">
    <table class="table">
        <thead>
            <tr>
                <th>A</th>
                <th>B</th>
                <th>C</th>
                <th>D</th>
                <th>E</th>
            </tr>
        </thead>
    <tbody>
        <tr>
            <td><b>...</b></td>
            <td>...</td>
            <td>...</td>
            <td>...</td>
            <td>...</td>
        </tr>
        ...
        more follows here...

任务

我想把每个下拉菜单条目的结果表的所有日期都刮掉。到目前为止，我只提取了下拉菜单值。

问题

如何积极选择下拉菜单中的值，以便网站html代码发生变化，并显示我希望从其中提取数据的希望表？在看了关于刮擦和美丽汤的文档之后，我仍然不明白这部分。

当前代码

from bs4 import BeautifulSoup
import requests
import urllib3

BASE_URL = "http://routerpasswords.com/"

def get_router_types(url):
    r = requests.get(url)
    html_content = r.content
    soup = BeautifulSoup(html_content)
    print("option values: \n")
    option_values = soup.find_all("option")
    print(option_values)
    print(" \n")
    print("router types: \n")
    router_types = [option.get('value') for option in soup.find_all('option')]
    print(router_types)
    return router_types

'''
Stuck here!
...

def get_passwords():
    router_types = get_router_types(BASE_URL)
    passwords = []
    for types in router_types:
        #print(types)

def main():
    get_router_types(BASE_URL)

if __name__ == "__main__":
    main()

html

python-3.x

beautifulsoup

web-crawler

python

回答 2

Stack Overflow用户

回答已采纳

发布于 2017-03-10 09:13:14

每次单击按钮时，您都要将数据发送到服务器，您可以在chrome开发工具(F12)中找到post数据：

您可以使用requests模拟这篇文章

In [27]: data = {'findpass':'1',
    ...: 'router':'Belkin',
    ...: 'findpassword':'Find Password'}

In [28]: r = requests.post('http://routerpasswords.com/', data=data)

票数 0

Stack Overflow用户

发布于 2017-03-10 09:40:23

首先，我把所有的router_names都列在一个列表中，

然后为每个路由器执行一个新的请求，其中包含正确的post params (def: get_passwords_via_name)

from bs4 import BeautifulSoup
import requests

BASE_URL = "http://routerpasswords.com/"


def get_router_types(url):
    r = requests.get(url)
    html_content = r.content
    soup = BeautifulSoup(html_content)
    print("option values: \n")
    option_values = soup.find_all("option")
    print(option_values)
    print(" \n")
    print("router types: \n")
    router_types = [option.get('value') for option in     soup.find_all('option')]
    return router_types, r

def get_passwords_via_name(router_name, rcookie):
    data = {"findpass": "1", "router": router_name, "findpassword":  "Find+Password"}
    print data

    c = requests.post('http://routerpasswords.com/', data=data)
    print c.url

    html_content = c.content
    print c.status_code
    soup = BeautifulSoup(html_content)
    return soup.find("div", {"id": "result"})

def main():
    rlist, r = get_router_types(BASE_URL)
    for i in rlist:
        print "debug"
        print get_passwords_via_name(i, r)

if __name__ == "__main__":
    main()

卷曲方式：

 curl 'http://routerpasswords.com/'   --data 'findpass=1&router=ZyXEL&findpassword=Find+Password'

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/42714061

复制

相似问题

问web爬虫在动态更改网站代码上的应用
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问web爬虫在动态更改网站代码上的应用EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问web爬虫在动态更改网站代码上的应用
EN