首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >从段落中提取整数

从段落中提取整数
EN

Stack Overflow用户
提问于 2020-12-24 22:38:01
回答 3查看 77关注 0票数 0

我试图从段落中提取费用金额,但我面临着问题。有两笔费用,我想要两笔。这是我的代码:http://www.reading.ac.uk/ready-to-study/study/subject-area/modern-languages-and-european-studies-ug/ba-spanish-and-history.aspx

代码语言:javascript
复制
fees_div = soup.find('div', class_='Fees hiddenContent pad-around-large tabcontent')
if fees_div:
    fees_list = fees_div.find_all('\d+','p')
    course_data['Fees'] = fees_list
    print('fees : ', fees_list)
EN

回答 3

Stack Overflow用户

发布于 2020-12-24 22:45:22

请试试这个:

代码语言:javascript
复制
In [10]: import requests
In [11]: from bs4 import *
In [12]: page = requests.get('http://www.reading.ac.uk/ready-to-study/study/subject-area/modern-languages-and-european-studies-ug/ba-spanish-and-history.aspx')
In [13]: soup = BeautifulSoup(page.content, 'html.parser')
In [14]: [x for x in soup.find('div', class_='Fees hiddenContent pad-around-large tabcontent').text.split() if u"\xA3" in x]
Out[14]: ['£9,250*', '£17,320']
票数 0
EN

Stack Overflow用户

发布于 2020-12-24 23:07:02

代码语言:javascript
复制
import requests
from bs4 import BeautifulSoup
import re

r = requests.get('http://www.reading.ac.uk/ready-to-study/study/subject-area/modern-languages-and-european-studies-ug/ba-spanish-and-history.aspx')
soup = BeautifulSoup(r.text,  'html.parser')
fees_div = soup.find('div', class_='Fees hiddenContent pad-around-large tabcontent')
m = re.search(r'£[\d,]+', fees_div.select('p:nth-of-type(2)')[0].get_text())
fee1 = m[0]
m = re.search(r'£[\d,]+', fees_div.select('p:nth-of-type(3)')[0].get_text())
fee2 = m[0]
print(fee1, fee2)

打印:

代码语言:javascript
复制
£9,250 £17,320

更新

您也可以使用Selenium抓取页面,尽管在这种情况下它没有任何优势。例如(使用Chhrome):

代码语言:javascript
复制
from selenium import webdriver
from bs4 import BeautifulSoup
import re


options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)

driver.get('http://www.reading.ac.uk/ready-to-study/study/subject-area/modern-languages-and-european-studies-ug/ba-spanish-and-history.aspx')
soup = BeautifulSoup(driver.page_source,  'html.parser')
fees_div = soup.find('div', class_='Fees hiddenContent pad-around-large tabcontent')
m = re.search(r'£[\d,]+', fees_div.select('p:nth-of-type(2)')[0].get_text())
fee1 = m[0]
m = re.search(r'£[\d,]+', fees_div.select('p:nth-of-type(3)')[0].get_text())
fee2 = m[0]
print(fee1, fee2)
driver.quit()

更新

考虑只使用以下代码:只需扫描整个HTML源,而不使用BeautifulSoup,使用简单的正则表达式findall查找费用

代码语言:javascript
复制
import requests
import re

r = requests.get('http://www.reading.ac.uk/ready-to-study/study/subject-area/modern-languages-and-european-studies-ug/ba-spanish-and-history.aspx')
print(re.findall(r'£[\d,]+', r.text))

打印:

代码语言:javascript
复制
['£9,250', '£17,320']
票数 0
EN

Stack Overflow用户

发布于 2020-12-24 23:45:49

试一试:

代码语言:javascript
复制
import re
import requests
from bs4 import BeautifulSoup

r = requests.get('http://www.reading.ac.uk/ready-to-study/study/subject-area/modern-languages-and-european-studies-ug/ba-spanish-and-history.aspx')
soup = BeautifulSoup(r.text,'html.parser')
item = soup.find(id='Panel5').text
fees = re.findall(r"students:[^£]+(.*?)[*\s]",item)
print(fees)

输出:

代码语言:javascript
复制
['£9,250', '£17,320']
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/65439746

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档