首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >从p标签中提取信息,用漂亮的with插入到python中。

从p标签中提取信息,用漂亮的with插入到python中。
EN

Stack Overflow用户
提问于 2021-06-13 00:24:29
回答 1查看 92关注 0票数 0

我正试图在网上搜索一个包含演讲和部长传记的政府公共页面。最后,我想要一本这样的字典:

代码语言:javascript
复制
data = {

    { "time": "18/05/2016",
    "author_speech": name
    "bio": [list, of, paragraphs_bio]
    "speech": [list, of, paragraphs_speech]
    "bio_link": "url"
    "speech_link": "url"
    }
    { "time": "01/01/2011",
    "author_speech": "name"
    "bio": [list, of, paragraphs]
    "speech": [list, of, paragraphs]
    "bio_link": "url"
    "speech_link": "url"
    }


}

页面示例

代码语言:javascript
复制
<div class="item-page">
   <p><strong>TITLE</strong></p>
   <p><strong>18/05/2016</strong>&nbsp; &nbsp; <a href="/link//paragraphs/bio">Author's name</a>&nbsp;|&nbsp;<a href="/link/paragraphs/speech">Speech</a></p>
   <p><strong>01/01/2011&nbsp; &nbsp;&nbsp;</strong><a href="/link/paragraphs/bio02">Author's name02</a></p>
   <p><strong>28/08/2013&nbsp; &nbsp; </strong><a href="/link/paragraphs/bio03">Author's name03</a>&nbsp;|&nbsp;<a href="/link/paragraphs/speech">Speech</a></p>
   <p><strong>01/01/2011&nbsp; &nbsp;&nbsp;</strong><a href="/link/paragraphs/bio04">Author's name04</a></p>
   <p><strong>01/01/2011&nbsp; &nbsp;&nbsp;</strong><a href="/link/paragraphs/bio05">Author's name05</a></p>
   <p><strong>28/08/2013&nbsp; &nbsp; </strong><a href="/link/paragraphs/bio03">Author's name06</a>&nbsp;|&nbsp;<a href="/link/paragraphs/speech">Speech</a></p>
</div>

使用漂亮的With,我目前正在创建单独的时间列表,语音作者,语音链接,到传记的链接,然后把它们放在字典或数据中。但我在两件事上有困难:

  • (1)正如您在上文所示的html示例中所看到的,在某些段落中,它具有要提取的五种信息,而在其他段落中,只有三条信息。这样,当涉及到合并列表时,它就不起作用了。是否可以逐段迭代并从每个段落中提取内部信息?
  • (2)在href链接中有需要提取信息的段落(我可以这样做),但我很难将其集成到上述同一词典中。
代码语言:javascript
复制
url = 'www.example.com/'
html = urlopen(url)
bs = BeautifulSoup(html.read(), 'html.parser')

for time in bs.find_all('strong'):
    times.append(times.get_text())

times_tmp2 = times[2:]
time_tmp2 = "".join([str(_) for _ in times_tmp2])
time_tmp2 = unicodedata.normalize("NFKD", time_tmp2)
time_tmp2 = re.split("(\d{2}[-/]\d{2}[-/]\d{4})", time_tmp2)
time_tmp2 = [x for x in time_tmp2 if x != '']
time_tmp2 = [elem for elem in time_tmp2 if elem.strip()]
times_final = list(set(time_tmp2))


links_to_speech = []
for link in bs.find_all('a', string='Speech'):
    # print(urllib.parse.urljoin(url, link.get('href')))
    links_to_speech.append(urllib.parse.urljoin(url, link.get('href')))


authors = []
for author in bs.find_all('a'):
    authors.append(author.get_text())

authors_final = []
for author in authors:
    init = 'First Author'
    final = 'Not Author'
    index_init = authors.index(init)
    index_final = authors.index(final)
    a = autores[index_inito:index_final]
    a = [x for x in a if x != 'Speech']
    authors_final = a


links_bio = []
p = bs.find_all('p')
for link_bio in p:
    a = link_bio.find('a') 
    links_bio.append(a)
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-06-13 02:24:12

根据上面提供的目标数据结构,您似乎在使用字典。还不清楚您希望您的密钥是什么,所以我可能会建议使用一个列表/数组。

我建议一种稍微不同的方法来剖析problem.One潜在的实现,那就是迭代每一行(表的<p>段(div <div>),并使用现有的数据。这允许我们一次填充一个索引的data数组。

从这里开始,如果存在链接,那么您可以查询外部数据源(或者从页面上的不同位置读取)来收集相应的数据。在下面的示例中,我选择在不同的数据迭代中这样做,以帮助使代码更具可读性。

我以前没有使用过BeautifulSoap4库。如果我的解决方案对于库的使用不是最优雅的,我很抱歉。

代码语言:javascript
复制
from typing import List
from urllib.request import urlopen

import bs4.element
from bs4 import BeautifulSoup

data: List = []  # <- we want the data here.

# Parse the webpage html
bs = BeautifulSoup('''\
<div class="item-page">
   <p><strong>TITLE</strong></p>
   <p><strong>18/05/2016</strong>&nbsp; &nbsp; <a href="/link//paragraphs/bio">Author's name</a>&nbsp;|&nbsp;<a href="/link/paragraphs/speech">Speech</a></p>
   <p><strong>01/01/2011&nbsp; &nbsp;&nbsp;</strong><a href="/link/paragraphs/bio02">Author's name02</a></p>
   <p><strong>28/08/2013&nbsp; &nbsp; </strong><a href="/link/paragraphs/bio03">Author's name03</a>&nbsp;|&nbsp;<a href="/link/paragraphs/speech">Speech</a></p>
   <p><strong>01/01/2011&nbsp; &nbsp;&nbsp;</strong><a href="/link/paragraphs/bio04">Author's name04</a></p>
   <p><strong>01/01/2011&nbsp; &nbsp;&nbsp;</strong><a href="/link/paragraphs/bio05">Author's name05</a></p>
   <p><strong>28/08/2013&nbsp; &nbsp; </strong><a href="/link/paragraphs/bio03">Author's name06</a>&nbsp;|&nbsp;<a href="/link/paragraphs/speech">Speech</a></p>
</div>''', features='html.parser')

# Grab the paragraphs within the `item-page` div, checkout CSS selectors :).
entries = bs.select('div.item-page p')

# Populate the entries with time and links (if they are present)
for entry in entries:
    entry: bs4.element.Tag  # https://github.com/il-vladislav/BeautifulSoup4/blob/master/bs4/element.py

    time = entry.select_one('strong').get_text()
    if time == 'TITLE':
        continue  # skip this entry

    # Grab a list of the links (may be of size 0-2 depending on the contents).
    links = [link.get('href') for link in entry.select('a')]

    # Populate the array with a document.
    data.append({
        'time': time,
        'speech_link': links[0] if len(links) > 0 else '',
        'speech': [],
        'bio_link': links[1] if len(links) > 1 else '',
        'bio': [],
    })

# Collect speeches and bios if present.
for person in data:
    if person['speech_link']:  # empty strings evaluate as False and would be skipped.
        html = urlopen('https://baconipsum.com/api/?type=all-meat&paras=2&start-with-lorem=1&format=html')
        person['speech'] = [para.get_text() for para in BeautifulSoup(html, 'html.parser').select('p')]

    if person['bio_link']:
        html = urlopen('https://baconipsum.com/api/?type=all-meat&paras=2&start-with-lorem=1&format=html')
        person['bio'] = [para.get_text() for para in BeautifulSoup(html, 'html.parser').select('p')]

print(data)
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/67953915

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档