我有以下部分的HTML数据,想提取日期信息(例如31-12月18日)。如果任何人都能分享使用BS4的指导之手,我将不胜感激。
<th scope="column"><time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg">31-Dec-19</time></th><th scope="column"><time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg">31-Dec-18</time></th><th scope="column"><time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg">31-Dec-17</time></th><th scope="column"><time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg">31-Dec-16</time></th><th scope="column"><time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg">31-Dec-15</time></th>我使用bs4解析器选项'time',所有条目都缺少文本数据(例如,12月31日-15日),有人知道为什么吗?
import requests
page = equests.get("https://www.reuters.com/companies/MBBM.KL/financials")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('time')
[<time class="TextLabel__text-label___3oCVw TextLabel__gray___1V4fk TextLabel__regular___2X0ym"></time>, <time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg"></time>, <time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg"></time>, <time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg"></time>, <time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg"></time>, <time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg"></time>]
>>>发布于 2020-07-25 00:20:39
试试这个:
from bs4 import BeautifulSoup
URL = 'th scope="column"><time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg">31-Dec-19</time></th><th scope="column"><time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg">31-Dec-18</time></th><th scope="column"><time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg">31-Dec-17</time></th><th scope="column"><time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg">31-Dec-16</time></th><th scope="column"><time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg">31-Dec-15</time></th>'
soup = BeautifulSoup(URL, "html.parser")
times = [time.get_text() for time in soup.select('time')]
for time in times:
print(time)打印:
31-Dec-19
31-Dec-18
31-Dec-17
31-Dec-16
31-Dec-15编辑以获取来自site use selenium的python时间:
from selenium import webdriver
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
driver.get('https://www.reuters.com/companies/MBBM.KL/financials')
driver.implicitly_wait(5)
times = driver.find_elements_by_css_selector('time')
for time in times[1:]:
print(time.text)
driver.close()输出:
31-Dec-19
31-Dec-18
31-Dec-17
31-Dec-16
31-Dec-15注意,您需要selenium和geckodriver,在本例中,我从c:/program/geckodriver.exe导入它
https://stackoverflow.com/questions/63074408
复制相似问题