我试图使用BeautifulSoup从Y!Finance网站中提取数据,并将所有数据存储在列表中。在列表中,可扩展行(总收入、运营费用)的标题缺失,但数字仍然存在。有没有一种方法可以在输出中包含标头?
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request as ur
url = 'https://finance.yahoo.com/quote/AAPL/financials?p=AAPL'
read_data = ur.urlopen(url).read()
soup= BeautifulSoup(read_data,'lxml')
ls= [] # Create empty list
for l in soup.find_all('div'):
ls.append(l.string)
new_ls = list(filter(None,ls))当前产出:
'Expand All',
'ttm',
'9/30/2019',
'9/30/2018',
'9/30/2017',
'9/30/2016',
'273,857,000',
'260,174,000',
'265,595,000',
'229,234,000',
'215,639,000',预期产出:
'Expand All',
'ttm',
'9/30/2019',
'9/30/2018',
'9/30/2017',
'9/30/2016',
'Total Revenue',
'273,857,000',
'260,174,000',
'265,595,000',
'229,234,000',
'215,639,000',更新:如果从"span“中提取,则输出中缺少0的数字将在稍后构造数据帧时产生另一个问题。
for l in soup.select('div.D\(tbr\)'):
for n in l.select('span'):
print(n.text)发布于 2020-08-11 02:50:57
下面将获得所有数据,然后可以筛选出不需要的内容:
for row in soup.select('div[data-test="fin-row"]'):
for r in row:
for l in r:
print(l.text)
print('-------\n')输出:
Total Revenue
273,857,000
260,174,000
265,595,000
-
215,639,000
-------
Cost of Revenue
169,277,000
161,782,000
163,756,000
-
131,376,000
-------
Gross Profit等。
如果还希望以编程方式获取标头,请尝试:
head_ind = [55,58,60,62,64,66]
for i in head_ind:
heads = f'span[data-reactid="{i}"]:not([class])'
for head in soup.select(heads):
print(head.text)输出:
Breakdown
ttm
9/30/2019
9/30/2018
9/30/2017
9/30/2016发布于 2020-08-11 02:27:58
我知道这个话题有点离题,但看起来你只是想要雅虎金融公司的数据,对吧?如果是这样的话,他们已经有了一个python包,这个包可能更容易使用,然后再进行web抓取。
https://pypi.org/project/yahoo-finance/
你可以输入一份
apple = Share('AAPL')还可以使用以下命令获取大量数据
from pprint import pprint
pprint(yahoo.get_historical('2019-08-10', '2020-01-10'))https://stackoverflow.com/questions/63350605
复制相似问题