我试图从以下HTML代码块中提取信息:
<div class="topicicons"><span title="This is a marketplace ad topic." class="icon icon-tag"></span></div>
<a data-nologvisit href="/pinball/forum/forum/games-for-sale" rel="7" class="subforum subforum-7" title="Pinball machines for sale">MFS</a>
<a class="t" href="/pinball/forum/topic/for-sale-pirates-of-the-caribbean-le-58">FS: Pirates of the Caribbean (LE)<span class="tag tag-price">$ 25,000 </span><span class="tag tag-loc">Whiteland, IN</span></a>
<span class="by">By ARW55 (1 year ago)<span class="last"> - Last post 3 days ago</span></span>
</div><div rel="319235" data-vu="" class="topic topic-mb0 sf-7 has-new sfbox-1 topic-featured">我想提取的字段是名称(在本例中是“加勒比海盗(LE)")、价格(25,000美元)、位置(Whiteland,IN)和最后一篇文章(最后一篇文章3天前)。到目前为止,我已经使用了这行代码
soup.findAll(True, {'class': ['t', 'by']})要获得以下输出:
FS: Pirates of the Caribbean (LE)$ 25,000 Whiteland, IN
By ARW55 (1 year ago) - Last post 3 days ago然而,我不知道如何从这些字符串中提取我想要的信息。还有其他数百个类似的条目。
FS: Teenage Mutant Ninja Turtles (Pro)$ 8,000 (OBO) Downers Grove, IL
By Thorn-in-pinball (3 days ago) - Last post 3 days ago我不知道该从哪里开始。如有任何建议或指导,我将不胜感激。
谢谢!
发布于 2022-07-27 12:10:19
使用Beautiful,有一种从元素中提取属性的简单方法,因为这些元素是嵌套的,因此我们可以单独查看每个查找的内容,并获取相应的文本属性以获取您想要的信息。
# Parent elements
movie_post_element = soup.find("a", class_="t")
# Child element
movie_element = movie_post_element.contents[0]
# Child text
movie = movie_element.text那么一个完整的例子就是..。
import bs4
html = """<div class="topicicons"><span title="This is a marketplace ad topic." class="icon icon-tag"></span></div>
<a data-nologvisit href="/pinball/forum/forum/games-for-sale" rel="7" class="subforum subforum-7" title="Pinball machines for sale">MFS</a>
<a class="t" href="/pinball/forum/topic/for-sale-pirates-of-the-caribbean-le-58">FS: Pirates of the Caribbean (LE)<span class="tag tag-price">$ 25,000 </span><span class="tag tag-loc">Whiteland, IN</span></a>
<span class="by">By ARW55 (1 year ago)<span class="last"> - Last post 3 days ago</span></span>
</div><div rel="319235" data-vu="" class="topic topic-mb0 sf-7 has-new sfbox-1 topic-featured">"""
soup = bs4.BeautifulSoup(html)
# Parent elements
movie_element = soup.find("a", class_="t")
author_element = soup.find("span", class_="by")
movie = movie_element.contents[0].text
price = movie_element.contents[1].text
location = movie_element.contents[2].text
author = author_element.contents[0].text
post_date = author_element.contents[1].text
by_text = author_element.text发布于 2022-07-27 11:53:55
下面的代码将为您提供所需的数据:
from bs4 import BeautifulSoup
html = '''
<div class="topicicons"><span title="This is a marketplace ad topic." class="icon icon-tag"></span></div>
<a data-nologvisit href="/pinball/forum/forum/games-for-sale" rel="7" class="subforum subforum-7" title="Pinball machines for sale">MFS</a>
<a class="t" href="/pinball/forum/topic/for-sale-pirates-of-the-caribbean-le-58">FS: Pirates of the Caribbean (LE)<span class="tag tag-price">$ 25,000 </span><span class="tag tag-loc">Whiteland, IN</span></a>
<span class="by">By ARW55 (1 year ago)<span class="last"> - Last post 3 days ago</span></span>
</div><div rel="319235" data-vu="" class="topic topic-mb0 sf-7 has-new sfbox-1 topic-featured">
'''
soup = BeautifulSoup(html, 'html.parser')
name = soup.select_one('a.t').contents[0].strip()
price = soup.select_one('a.t').contents[1].text.strip()
location = soup.select_one('a.t').contents[2].text.strip()
last_post = soup.select_one('span.last').text.strip()
author = soup.select_one('span.by').contents[0].strip()
print(name)
print(price)
print(location)
print(last_post)
print(author)结果:
FS: Pirates of the Caribbean (LE)
$ 25,000
Whiteland, IN
- Last post 3 days ago
By ARW55 (1 year ago)Bs4的文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/
发布于 2022-07-27 12:23:40
分别提取信息类型。因此,如果<a class="t">元素是名称,例如,查找这些元素并将结果保存在变量中,然后转到“by”元素,等等。
但是,对于更结构化的数据,最好在每个容器div中搜索:
document.querySelectorAll('.containerClass').forEach((div) => {
const name = div.querySelector('a.t');
// etc
});我不知道这个.findAll(True, {'class': ['t', 'by']})是什么,也不知道它是如何工作的,但你知道。
https://stackoverflow.com/questions/73137295
复制相似问题