文章/答案/技术大牛

发布

社区首页 >问答首页 >难以从python漂亮的抓取中找到内容

问难以从python漂亮的抓取中找到内容
EN

Stack Overflow用户

提问于 2018-07-18 07:43:37

回答 1查看 98关注 0票数 0

我试图抓取this page并获取每篇文章标题的URL，它是一个'h3‘'a’元素，例如，第一个结果是一个链接，带有文本"Functional annotation of a full-length with collection“，它链接到this page。

我搜索返回的所有内容都是'[]‘

我的代码如下：

import requests
from bs4 import BeautifulSoup
req = requests.get('https://www.lens.org/lens/scholar/search/results?q="edith%20cowan"')
soup = BeautifulSoup(req.content, "html5lib")
article_links = soup.select('h3 a')
print(article_links)

我哪里错了？

python

beautifulsoup

python-requests

screen-scraping

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-07-18 14:51:05

您正在处理这个问题，因为您使用了错误的链接来获取文章链接。所以我没有做太多的修改，并想出了这个代码(请注意，我删除了bs4模块，因为它不再需要了)

import requests

search = "edith cowan"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}

json = {"scholarly_search":{"from":0,"size":"10","_source":{"excludes":["referenced_by_patent_hash","referenced_by_patent","reference"]},"query":{"bool":{"must":[{"query_string":{"query":f"\"{search}\"","fields":["title","abstract","default"],"default_operator":"and"}}],"must_not":[{"terms":{"publication_type":["unknown"]}}],"filter":[]}},"highlight":{"pre_tags":["<span class=\"highlight\">"],"post_tags":["</span>"],"fields":{"title":{}},"number_of_fragments":0},"sort":[{"referenced_by_patent_count":{"order":"desc"}}]},"view":"scholar"}

req = requests.post("https://www.lens.org/lens/api/multi/search?request_cache=true", headers = headers, json = json).json()

links = []
for x in req["query_result"]["hits"]["hits"]:
    links.append("https://www.lens.org/lens/scholar/article/{}/main".format(x["_source"]["record_lens_id"]))

search变量等于您正在搜索的术语(在本例中为"edith cowan")。链接存储在links变量中。

编辑: How I it

因此，主要问题可能是我从哪里获得链接，以及我如何知道要在json变量中包含哪些内容。为此，我使用了一个简单的HTML拦截器(在我的例子中是)。

这个工具告诉我，当您访问这个URL (您在问题中用来向其发送请求的那个)时，您的浏览器会向https://www.lens.org/lens/api/multi/search?request_cache=true发送一个post请求，然后它会检索该current page的所有信息。json变量burp套件还向您显示发送了哪些包，因此我将它们复制粘贴到json变量中。

为了更好的可视化，这是它在burp套件中的外观：

编辑:扫描所有页面

为了扫描所有页面，您可以使用以下脚本：

import requests

search = "edith cowan" #Change this to the term you are searching for
r_to_show = 100 #This is the number of articles per page (I strongly recommend leaving it at 100)

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}

json = {"scholarly_search":{"from":0,"size":f"{r_to_show}","_source":{"excludes":["referenced_by_patent_hash","referenced_by_patent","reference"]},"query":{"bool":{"must":[{"query_string":{"query":f"\"{search}\"","fields":["title","abstract","default"],"default_operator":"and"}}],"must_not":[{"terms":{"publication_type":["unknown"]}}],"filter":[]}},"highlight":{"pre_tags":["<span class=\"highlight\">"],"post_tags":["</span>"],"fields":{"title":{}},"number_of_fragments":0},"sort":[{"referenced_by_patent_count":{"order":"desc"}}]},"view":"scholar"}

req = requests.post("https://www.lens.org/lens/api/multi/search?request_cache=true", headers = headers, json = json).json()

links = [] #links are stored here
count = 0

#link_before and link_after helps determine when to stop going to the next page 
link_before = 0
link_after = 0

while True:
    json["scholarly_search"]["from"] += r_to_show
    if count > 0:
        req = requests.post("https://www.lens.org/lens/api/multi/search?request_cache=true", headers = headers, json = json).json() 
    for x in req["query_result"]["hits"]["hits"]:
        links.append("https://www.lens.org/lens/scholar/article/{}/main".format(x["_source"]["record_lens_id"]))
    count += 1
    link_after = len(links)
    if link_after == link_before:
        break
    link_before = len(links)
    print(f"page {count} done, links recorder {len(links)}")

我在代码中添加了一些注释，使其更易于理解。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/51391618

复制

相似问题

问难以从python漂亮的抓取中找到内容
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问难以从python漂亮的抓取中找到内容EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问难以从python漂亮的抓取中找到内容
EN