文章/答案/技术大牛

发布

社区首页 >问答首页 >来自子头的Webscraping的无关标记

问来自子头的Webscraping的无关标记
EN

Stack Overflow用户

提问于 2021-03-17 04:14:57

回答 2查看 89关注 0票数 1

我在试着刮掉维基百科上的“基因组”页面我只想抓取像“术语的起源”，“测序和绘图”，“病毒基因组”，“原核基因组”，“真核基因组”之类的小标题，包括下面的子标题，基因组大小，等等。为此，我编写了以下代码：

    def filter_headers(self, web_soup):
        # Grabs the headers from the web page
        """
        :param web_soup: the raw web soup from the webpage
        :return: header_soup: the headers in text form
        """
        # TO DO: how to separate out just the main body content while including the
        # title header
        # Find all tags with a pattern like h1,h2,h3,h4...
        headers = read_page_soup.find_all(re.compile(r'h\d+'))

        return headers

问题是，无论我如何具体地过滤掉我的标签，我仍然可以从导航菜单中获得相同的结果，比如Personal tools，Namespace，Variants，Views...Tools，Print/Export，在其他项目和语言中。例如，我先试了一下：

soup = read_page_soup.find(re.compile(r'h\d+'), {'class': 'mw-body-content'})
    results = sr.filter_headers(soup)
    for result in results:
        print(result.text)

然后我尝试了一下，过滤mw-parser-output类，如下所示：

soup = read_page_soup.find(re.compile(r'h\d+'), {'class': 'mw-parser-content'})
    results = sr.filter_headers(soup)
    for result in results:
        print(result.text)

我不明白这个。当我将鼠标悬停在div上时，它甚至不会突出显示维基百科的侧边栏。我希望能够找到一个适用于许多维基百科页面的解决方案，这样我就可以用相似的结果来抓取它们。然后，在未来，我希望将其扩展到其他网页，如Ars Technica。因此，如果任何人也能给出任何公平的警告，我想使用这种方法来进行一些基本的网络爬行。如果我在没有网络爬行应用程序的情况下逐个案例地尝试一些基本的网络爬行，请告诉我。

python

regex

web-scraping

beautifulsoup

回答 2

Stack Overflow用户

发布于 2021-03-17 10:34:19

您正在尝试获取一个头标记，但是您所描述的两个类都是divs。我测试过的维基百科页面通常有三个mw-body-content类(它们都是div，单个mw-parser-output类也是如此)：

<div id="siteNotice" class="mw-body-content">
<div class="mw-indicators mw-body-content">
<div id="bodyContent" class="mw-body-content">

要获得多个页眉标题，可以使用已有的mw-body-content类，但请切换到find_all并选择第三个项目。从那时起，在带有find_all的filter_headers和header的正则表达式中使用的逻辑似乎会产生预期的结果。

from bs4 import BeautifulSoup
import requests
import re

source = requests.get('https://en.wikipedia.org/wiki/Genome').text
read_page_soup = BeautifulSoup(source, 'lxml')

def filter_headers(web_soup):
    headers = web_soup.find_all(re.compile(r'h\d+'))
    return headers

soup = read_page_soup.find_all("div", {'class': 'mw-body-content'})[2]
results = filter_headers(soup)
for result in results:
    hN = int(result.name[1:])*3-3
    print(f"{result.name} {'-'*hN} {result.text}")

输出

h2 --- Contents
h2 --- Origin of term[edit]
h2 --- Sequencing and mapping[edit]
h2 --- Viral genomes[edit]
h2 --- Prokaryotic genomes[edit]
h2 --- Eukaryotic genomes[edit]
h3 ------ Coding sequences[edit]
h3 ------ Noncoding sequences[edit]
h4 --------- Tandem repeats[edit]
h4 --------- Transposable elements[edit]
h5 ------------ Retrotransposons[edit]
h5 ------------ DNA transposons[edit]
h2 --- Genome size[edit]
h3 ------ Genome size due to transposable elements[edit]
...
...

票数 0

Stack Overflow用户

发布于 2021-03-17 14:37:22

您不能直接从内容#toc li列表中的li元素中获取该信息吗？每个li有两个子spans，包含项目符号编号和名称：

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://en.wikipedia.org/wiki/Genome')
soup = bs(r.content, 'lxml')
contents = [i.select_one('.tocnumber').text + ' ' + i.select_one('.toctext').text  for i in soup.select('#toc li')]

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/66662742

复制

相似问题

问来自子头的Webscraping的无关标记
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问来自子头的Webscraping的无关标记EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问来自子头的Webscraping的无关标记
EN