我已经决定学习python2.7编码的数据分析,并一直在youtube上观看许多教程,以了解基本知识。
我正处于这样的阶段,我想为了教育目的而创建简单的网络爬虫,只是为了学习不同的技术,并且只是习惯了一些编码。
我遵循一个网站爬虫教程,但我不确定一些事情。到目前为止,这就是我所拥有的:
import requests
from bs4 import BeautifulSoup
url = 'http://www.aflcio.org/Legislation-and-Politics/Legislative-Alerts'
r = requests.get(url)
plain_text = r.text
soup = BeautifulSoup(plain_text, 'html.parser')
statements = soup.findAll('div','ec_statements')
for link in statements:
print (link.contents)我似乎无法使href链接分开,并显示文本和日期信息。
我想让它看起来像这样
有人能提供一些关于为什么采取这些步骤的信息吗?
非常感谢!
发布于 2016-10-31 02:38:21
一个帮助you.In bs4的小代码,所有节点都是连接的,你都读了一个“链接”节点(实际上是一个div),你想要得到他的孩子就像标签a,所以link.a是可以的。
然后,节点有两个部分的值,一个是属性、a['href']访问和a.text访问内容。
for link in statements:
print(link.a['href'])ps:这是链接变量:
<div id="legalert_title"><a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Representatives-opposing-the-Fairness-in-Class-Action-Litigation-and-Furthering-Asbestos-Claim-Transparency-Act">Letter to Representatives opposing the "Fairness in Class Action Litigation and Furthering Asbestos Claim Transparency Act"</a></div>这是链接。A:
<a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Representatives-opposing-the-Fairness-in-Class-Action-Litigation-and-Furthering-Asbestos-Claim-Transparency-Act">Letter to Representatives opposing the "Fairness in Class Action Litigation and Furthering Asbestos Claim Transparency Act"</a>这是链接,a‘’href‘
/Legislation-and-Politics/Legislative-Alerts/Letter-to-Representatives-opposing-the-Fairness-in-Class-Action-Litigation-and-Furthering-Asbestos-Claim-Transparency-Act这是.text:
Letter to Representatives opposing the "Fairness in Class Action Litigation and Furthering Asbestos Claim Transparency Act"所有的html都是这样的,也许你需要学习一些html。
https://stackoverflow.com/questions/40335301
复制相似问题