这是我在StackOverflow上的第一个问题,因为我真的很困惑。我使用BeautifulSoup (当然还有python )来抓取一个web数据库,这个数据库在过去一直是一致的,而且很容易抓取,但是现在变得很困难了。
以前,web内容是从HTML中抓取的:
<div class="title-class" valign="top">"Unique Title String"</div>
<div class="body-class" valign="top">"Unique Body String"</div>
<div class="title-class" valign="top">"Unique Title String 2"</div>
<div class="body-class" valign="top">"Unique Body String 2"</div>这些div的数量是可变的,但并不重要。我从标题和正文中列出了列表以及其他相关值,然后填充了一个电子表格。很简单。
然而,现在看来,一些后端的人已经脱离了深渊(请注意,这是政府的数据),页面看起来如下:
<div class="title-class" valign="top">"Unique Title String"</div>
(HTML that is totally unique in every instance and contains random amount of tags and formatting.)
<div class="title-class" valign="top">"Unique Title String 2"</div>
(More HTML that is a totally unrelated brand of complete anarchy. If any element between these is the same twice it is pure coincidence.)我正在刮的所有东西都包含在一个独一无二的类中。在这个范围内,所有这些标签似乎都没有孩子(根据我的研究)。这只是一系列没有层次结构的标签。
所以很明显,我需要做的是刮掉title-class的每个div之间的所有内容,在每个页面的最后一个title-class中,刮掉其余的内容。问题是,在我的一生中,我无法弄清楚如何在BeautifulSoup中指定这一点。
对于我如何能够做到这一点,任何帮助都是非常感谢的。非常感谢!
发布于 2020-10-14 19:11:15
我希望我已经正确理解了你的问题。您希望找到不同标题和最后一节之间的章节。此示例将将节分组为字典,其中的键是节的标题:
from pprint import pprint
from bs4 import BeautifulSoup
txt = '''
<b>I don't want this</b>
<div class="title-class" valign="top">"Unique Title String 1"</div>
<a>111</a><b>some</b><i>tags</i><b>I want</b><i>to scrap</i>
<div class="title-class" valign="top">"Unique Title String 2"</div>
<a>222</a><b>some</b><i>tags</i><b>I want</b><i>to scrap</i>
'''
soup = BeautifulSoup(txt, 'html.parser')
titles = soup.find_all('div', class_='title-class')
out = {}
for tag in soup.find_all(recursive=False):
prev_title = tag.find_previous('div', class_='title-class')
if prev_title and tag not in titles:
out.setdefault(prev_title, []).append(tag)
pprint(out)指纹:
{<div class="title-class" valign="top">"Unique Title String 2"</div>: [<a>222</a>,
<b>some</b>,
<i>tags</i>,
<b>I want</b>,
<i>to scrap</i>],
<div class="title-class" valign="top">"Unique Title String 1"</div>: [<a>111</a>,
<b>some</b>,
<i>tags</i>,
<b>I want</b>,
<i>to scrap</i>]}发布于 2020-10-14 18:56:46
如果我理解你的话,那么这里有一个使用兄弟姐妹的方法
from bs4 import BeautifulSoup
from io import StringIO
data = '''\
<div class="title-class" valign="top">Some title</div>
<div>Lorem ipsum dolor sit amet, consectetur adipiscing elit,</div>
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
<p>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris</p>
<div class="title-class" valign="top">Some other title</div>
nisi ut aliquip ex ea commodo consequat.
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse</p>
<div class="title-class" valign="top">Yet another title</div>
<p>cillum dolore eu fugiat nulla pariatur.</p>
Excepteur sint occaecat cupidatat non proident,
sunt in culpa qui officia deserunt mollit anim id est laborum.
'''
f = StringIO(data)
soup = BeautifulSoup(f, 'html.parser')
sections = []
for d in soup.select('div.title-class'):
sections.append((d.text, []))
n = d.next_sibling
while n:
if n.name=='div' and 'title-class' in n.get('class', []):
break
sections[-1][-1].append(str(n))
n = n.next_sibling
from pprint import pprint
pprint(sections)https://stackoverflow.com/questions/64359328
复制相似问题