文章/答案/技术大牛

发布

社区首页 >问答首页 >从p标签中提取信息，用漂亮的with插入到python中。

问从p标签中提取信息，用漂亮的with插入到python中。
EN

Stack Overflow用户

提问于 2021-06-13 00:24:29

回答 1查看 92关注 0票数 0

我正试图在网上搜索一个包含演讲和部长传记的政府公共页面。最后，我想要一本这样的字典：

data = {

    { "time": "18/05/2016",
    "author_speech": name
    "bio": [list, of, paragraphs_bio]
    "speech": [list, of, paragraphs_speech]
    "bio_link": "url"
    "speech_link": "url"
    }
    { "time": "01/01/2011",
    "author_speech": "name"
    "bio": [list, of, paragraphs]
    "speech": [list, of, paragraphs]
    "bio_link": "url"
    "speech_link": "url"
    }


}

页面示例

<div class="item-page">
   <p><strong>TITLE</strong></p>
   <p><strong>18/05/2016</strong>&nbsp; &nbsp; <a href="/link//paragraphs/bio">Author's name</a>&nbsp;|&nbsp;<a href="/link/paragraphs/speech">Speech</a></p>
   <p><strong>01/01/2011&nbsp; &nbsp;&nbsp;</strong><a href="/link/paragraphs/bio02">Author's name02</a></p>
   <p><strong>28/08/2013&nbsp; &nbsp; </strong><a href="/link/paragraphs/bio03">Author's name03</a>&nbsp;|&nbsp;<a href="/link/paragraphs/speech">Speech</a></p>
   <p><strong>01/01/2011&nbsp; &nbsp;&nbsp;</strong><a href="/link/paragraphs/bio04">Author's name04</a></p>
   <p><strong>01/01/2011&nbsp; &nbsp;&nbsp;</strong><a href="/link/paragraphs/bio05">Author's name05</a></p>
   <p><strong>28/08/2013&nbsp; &nbsp; </strong><a href="/link/paragraphs/bio03">Author's name06</a>&nbsp;|&nbsp;<a href="/link/paragraphs/speech">Speech</a></p>
</div>

使用漂亮的With，我目前正在创建单独的时间列表，语音作者，语音链接，到传记的链接，然后把它们放在字典或数据中。但我在两件事上有困难：

(1)正如您在上文所示的html示例中所看到的，在某些段落中，它具有要提取的五种信息，而在其他段落中，只有三条信息。这样，当涉及到合并列表时，它就不起作用了。是否可以逐段迭代并从每个段落中提取内部信息？
(2)在href链接中有需要提取信息的段落(我可以这样做)，但我很难将其集成到上述同一词典中。

url = 'www.example.com/'
html = urlopen(url)
bs = BeautifulSoup(html.read(), 'html.parser')

for time in bs.find_all('strong'):
    times.append(times.get_text())

times_tmp2 = times[2:]
time_tmp2 = "".join([str(_) for _ in times_tmp2])
time_tmp2 = unicodedata.normalize("NFKD", time_tmp2)
time_tmp2 = re.split("(\d{2}[-/]\d{2}[-/]\d{4})", time_tmp2)
time_tmp2 = [x for x in time_tmp2 if x != '']
time_tmp2 = [elem for elem in time_tmp2 if elem.strip()]
times_final = list(set(time_tmp2))


links_to_speech = []
for link in bs.find_all('a', string='Speech'):
    # print(urllib.parse.urljoin(url, link.get('href')))
    links_to_speech.append(urllib.parse.urljoin(url, link.get('href')))


authors = []
for author in bs.find_all('a'):
    authors.append(author.get_text())

authors_final = []
for author in authors:
    init = 'First Author'
    final = 'Not Author'
    index_init = authors.index(init)
    index_final = authors.index(final)
    a = autores[index_inito:index_final]
    a = [x for x in a if x != 'Speech']
    authors_final = a


links_bio = []
p = bs.find_all('p')
for link_bio in p:
    a = link_bio.find('a') 
    links_bio.append(a)

beautifulsoup

python

html

dictionary

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-06-13 02:24:12

根据上面提供的目标数据结构，您似乎在使用字典。还不清楚您希望您的密钥是什么，所以我可能会建议使用一个列表/数组。

我建议一种稍微不同的方法来剖析problem.One潜在的实现，那就是迭代每一行(表的<p>段(div <div>)，并使用现有的数据。这允许我们一次填充一个索引的data数组。

从这里开始，如果存在链接，那么您可以查询外部数据源(或者从页面上的不同位置读取)来收集相应的数据。在下面的示例中，我选择在不同的数据迭代中这样做，以帮助使代码更具可读性。

我以前没有使用过BeautifulSoap4库。如果我的解决方案对于库的使用不是最优雅的，我很抱歉。

from typing import List
from urllib.request import urlopen

import bs4.element
from bs4 import BeautifulSoup

data: List = []  # <- we want the data here.

# Parse the webpage html
bs = BeautifulSoup('''\
<div class="item-page">
   <p><strong>TITLE</strong></p>
   <p><strong>18/05/2016</strong>&nbsp; &nbsp; <a href="/link//paragraphs/bio">Author's name</a>&nbsp;|&nbsp;<a href="/link/paragraphs/speech">Speech</a></p>
   <p><strong>01/01/2011&nbsp; &nbsp;&nbsp;</strong><a href="/link/paragraphs/bio02">Author's name02</a></p>
   <p><strong>28/08/2013&nbsp; &nbsp; </strong><a href="/link/paragraphs/bio03">Author's name03</a>&nbsp;|&nbsp;<a href="/link/paragraphs/speech">Speech</a></p>
   <p><strong>01/01/2011&nbsp; &nbsp;&nbsp;</strong><a href="/link/paragraphs/bio04">Author's name04</a></p>
   <p><strong>01/01/2011&nbsp; &nbsp;&nbsp;</strong><a href="/link/paragraphs/bio05">Author's name05</a></p>
   <p><strong>28/08/2013&nbsp; &nbsp; </strong><a href="/link/paragraphs/bio03">Author's name06</a>&nbsp;|&nbsp;<a href="/link/paragraphs/speech">Speech</a></p>
</div>''', features='html.parser')

# Grab the paragraphs within the `item-page` div, checkout CSS selectors :).
entries = bs.select('div.item-page p')

# Populate the entries with time and links (if they are present)
for entry in entries:
    entry: bs4.element.Tag  # https://github.com/il-vladislav/BeautifulSoup4/blob/master/bs4/element.py

    time = entry.select_one('strong').get_text()
    if time == 'TITLE':
        continue  # skip this entry

    # Grab a list of the links (may be of size 0-2 depending on the contents).
    links = [link.get('href') for link in entry.select('a')]

    # Populate the array with a document.
    data.append({
        'time': time,
        'speech_link': links[0] if len(links) > 0 else '',
        'speech': [],
        'bio_link': links[1] if len(links) > 1 else '',
        'bio': [],
    })

# Collect speeches and bios if present.
for person in data:
    if person['speech_link']:  # empty strings evaluate as False and would be skipped.
        html = urlopen('https://baconipsum.com/api/?type=all-meat&paras=2&start-with-lorem=1&format=html')
        person['speech'] = [para.get_text() for para in BeautifulSoup(html, 'html.parser').select('p')]

    if person['bio_link']:
        html = urlopen('https://baconipsum.com/api/?type=all-meat&paras=2&start-with-lorem=1&format=html')
        person['bio'] = [para.get_text() for para in BeautifulSoup(html, 'html.parser').select('p')]

print(data)

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/67953915

复制

相似问题

问从p标签中提取信息，用漂亮的with插入到python中。
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从p标签中提取信息，用漂亮的with插入到python中。EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从p标签中提取信息，用漂亮的with插入到python中。
EN