我正试图在网上搜索一个包含演讲和部长传记的政府公共页面。最后,我想要一本这样的字典:
data = {
{ "time": "18/05/2016",
"author_speech": name
"bio": [list, of, paragraphs_bio]
"speech": [list, of, paragraphs_speech]
"bio_link": "url"
"speech_link": "url"
}
{ "time": "01/01/2011",
"author_speech": "name"
"bio": [list, of, paragraphs]
"speech": [list, of, paragraphs]
"bio_link": "url"
"speech_link": "url"
}
}页面示例
<div class="item-page">
<p><strong>TITLE</strong></p>
<p><strong>18/05/2016</strong> <a href="/link//paragraphs/bio">Author's name</a> | <a href="/link/paragraphs/speech">Speech</a></p>
<p><strong>01/01/2011 </strong><a href="/link/paragraphs/bio02">Author's name02</a></p>
<p><strong>28/08/2013 </strong><a href="/link/paragraphs/bio03">Author's name03</a> | <a href="/link/paragraphs/speech">Speech</a></p>
<p><strong>01/01/2011 </strong><a href="/link/paragraphs/bio04">Author's name04</a></p>
<p><strong>01/01/2011 </strong><a href="/link/paragraphs/bio05">Author's name05</a></p>
<p><strong>28/08/2013 </strong><a href="/link/paragraphs/bio03">Author's name06</a> | <a href="/link/paragraphs/speech">Speech</a></p>
</div>使用漂亮的With,我目前正在创建单独的时间列表,语音作者,语音链接,到传记的链接,然后把它们放在字典或数据中。但我在两件事上有困难:
url = 'www.example.com/'
html = urlopen(url)
bs = BeautifulSoup(html.read(), 'html.parser')
for time in bs.find_all('strong'):
times.append(times.get_text())
times_tmp2 = times[2:]
time_tmp2 = "".join([str(_) for _ in times_tmp2])
time_tmp2 = unicodedata.normalize("NFKD", time_tmp2)
time_tmp2 = re.split("(\d{2}[-/]\d{2}[-/]\d{4})", time_tmp2)
time_tmp2 = [x for x in time_tmp2 if x != '']
time_tmp2 = [elem for elem in time_tmp2 if elem.strip()]
times_final = list(set(time_tmp2))
links_to_speech = []
for link in bs.find_all('a', string='Speech'):
# print(urllib.parse.urljoin(url, link.get('href')))
links_to_speech.append(urllib.parse.urljoin(url, link.get('href')))
authors = []
for author in bs.find_all('a'):
authors.append(author.get_text())
authors_final = []
for author in authors:
init = 'First Author'
final = 'Not Author'
index_init = authors.index(init)
index_final = authors.index(final)
a = autores[index_inito:index_final]
a = [x for x in a if x != 'Speech']
authors_final = a
links_bio = []
p = bs.find_all('p')
for link_bio in p:
a = link_bio.find('a')
links_bio.append(a)发布于 2021-06-13 02:24:12
根据上面提供的目标数据结构,您似乎在使用字典。还不清楚您希望您的密钥是什么,所以我可能会建议使用一个列表/数组。
我建议一种稍微不同的方法来剖析problem.One潜在的实现,那就是迭代每一行(表的<p>段(div <div>),并使用现有的数据。这允许我们一次填充一个索引的data数组。
从这里开始,如果存在链接,那么您可以查询外部数据源(或者从页面上的不同位置读取)来收集相应的数据。在下面的示例中,我选择在不同的数据迭代中这样做,以帮助使代码更具可读性。
我以前没有使用过BeautifulSoap4库。如果我的解决方案对于库的使用不是最优雅的,我很抱歉。
from typing import List
from urllib.request import urlopen
import bs4.element
from bs4 import BeautifulSoup
data: List = [] # <- we want the data here.
# Parse the webpage html
bs = BeautifulSoup('''\
<div class="item-page">
<p><strong>TITLE</strong></p>
<p><strong>18/05/2016</strong> <a href="/link//paragraphs/bio">Author's name</a> | <a href="/link/paragraphs/speech">Speech</a></p>
<p><strong>01/01/2011 </strong><a href="/link/paragraphs/bio02">Author's name02</a></p>
<p><strong>28/08/2013 </strong><a href="/link/paragraphs/bio03">Author's name03</a> | <a href="/link/paragraphs/speech">Speech</a></p>
<p><strong>01/01/2011 </strong><a href="/link/paragraphs/bio04">Author's name04</a></p>
<p><strong>01/01/2011 </strong><a href="/link/paragraphs/bio05">Author's name05</a></p>
<p><strong>28/08/2013 </strong><a href="/link/paragraphs/bio03">Author's name06</a> | <a href="/link/paragraphs/speech">Speech</a></p>
</div>''', features='html.parser')
# Grab the paragraphs within the `item-page` div, checkout CSS selectors :).
entries = bs.select('div.item-page p')
# Populate the entries with time and links (if they are present)
for entry in entries:
entry: bs4.element.Tag # https://github.com/il-vladislav/BeautifulSoup4/blob/master/bs4/element.py
time = entry.select_one('strong').get_text()
if time == 'TITLE':
continue # skip this entry
# Grab a list of the links (may be of size 0-2 depending on the contents).
links = [link.get('href') for link in entry.select('a')]
# Populate the array with a document.
data.append({
'time': time,
'speech_link': links[0] if len(links) > 0 else '',
'speech': [],
'bio_link': links[1] if len(links) > 1 else '',
'bio': [],
})
# Collect speeches and bios if present.
for person in data:
if person['speech_link']: # empty strings evaluate as False and would be skipped.
html = urlopen('https://baconipsum.com/api/?type=all-meat¶s=2&start-with-lorem=1&format=html')
person['speech'] = [para.get_text() for para in BeautifulSoup(html, 'html.parser').select('p')]
if person['bio_link']:
html = urlopen('https://baconipsum.com/api/?type=all-meat¶s=2&start-with-lorem=1&format=html')
person['bio'] = [para.get_text() for para in BeautifulSoup(html, 'html.parser').select('p')]
print(data)https://stackoverflow.com/questions/67953915
复制相似问题