文章/答案/技术大牛

发布

社区首页 >问答首页 >用Python抓取雅虎财务资产负债表

问用Python抓取雅虎财务资产负债表
EN

Stack Overflow用户

提问于 2016-07-27 05:36:11

回答 2查看 1.3K关注 0票数 3

我的问题是对一个被问到的这里的后续问题.

职能：

periodic_figure_values()

除了搜索到的行项的名称出现两次外，其他情况似乎都很好。我所指的具体情况是试图获得“长期债务”的数据。上面链接中的函数将返回以下错误：

Traceback (most recent call last):
  File "test.py", line 31, in <module>
    LongTermDebt=(periodic_figure_values(soup, "Long Term Debt"))
  File "test.py", line 21, in periodic_figure_values
    value = int(str_value)
ValueError: invalid literal for int() with base 10: 'Short/Current Long Term Debt'

因为它似乎被“短期/当前长期债务”绊倒了。你看，这一页既有“短期/当前长期债务”，也有“长期债务”。您可以看到一个使用苹果资产负债表这里的源页面示例。

我试图为函数找到一种方法来返回“长期债务”的数据，而不会被“短期/当前长期债务”绊倒。

以下是获取“现金和现金等价物”的函数和一个示例，它运行良好，而“长期债务”则不起作用：

import requests, bs4, re

def periodic_figure_values(soup, yahoo_figure):
    values = []
    pattern = re.compile(yahoo_figure)
    title = soup.find("strong", text=pattern)    # works for the figures printed in bold
    if title:
        row = title.parent.parent
    else:
        title = soup.find("td", text=pattern)    # works for any other available figure
        if title:
            row = title.parent
        else:
            sys.exit("Invalid figure '" + yahoo_figure + "' passed.")
    cells = row.find_all("td")[1:]    # exclude the <td> with figure name
    for cell in cells:
        if cell.text.strip() != yahoo_figure:    # needed because some figures are indented
            str_value = cell.text.strip().replace(",", "").replace("(", "-").replace(")", "")
            if str_value == "-":
                str_value = 0
            value = int(str_value)
            values.append(value)
    return values

res = requests.get('https://ca.finance.yahoo.com/q/bs?s=AAPL')
res.raise_for_status
soup = bs4.BeautifulSoup(res.text, 'html.parser')
Cash=(periodic_figure_values(soup, "Cash And Cash Equivalents"))
print(Cash)
LongTermDebt=(periodic_figure_values(soup, "Long Term Debt"))
print(LongTermDebt)

python

regex

web-scraping

beautifulsoup

python-requests

回答 2

Stack Overflow用户

发布于 2016-07-27 05:52:53

最简单的方法是使用try/except组合，使用引发的ValueError

import requests, bs4, re

def periodic_figure_values(soup, yahoo_figure):
    values = []
    pattern = re.compile(yahoo_figure)
    title = soup.find("strong", text=pattern)    # works for the figures printed in bold
    if title:
        row = title.parent.parent
    else:
        title = soup.find("td", text=pattern)    # works for any other available figure
        if title:
            row = title.parent
        else:
            sys.exit("Invalid figure '" + yahoo_figure + "' passed.")
    cells = row.find_all("td")[1:]    # exclude the <td> with figure name
    for cell in cells:
        if cell.text.strip() != yahoo_figure:    # needed because some figures are indented
            str_value = cell.text.strip().replace(",", "").replace("(", "-").replace(")", "")
            if str_value == "-":
                str_value = 0
### from here
            try:
                value = int(str_value)
                values.append(value)
            except ValueError:
                continue
### to here
    return values

res = requests.get('https://ca.finance.yahoo.com/q/bs?s=AAPL')
res.raise_for_status
soup = bs4.BeautifulSoup(res.text, 'html.parser')
Cash=(periodic_figure_values(soup, "Cash And Cash Equivalents"))
print(Cash)
LongTermDebt=(periodic_figure_values(soup, "Long Term Debt"))
print(LongTermDebt)

这张把你的号码打印得很好。

请注意，在这种情况下，您实际上并不需要re模块，因为您只检查文字(没有通配符，没有边界)等等。

票数 1

Stack Overflow用户

发布于 2016-07-27 05:52:24

您可以更改该函数，以便它接受正则表达式而不是普通字符串。然后，您可以搜索^Long Term Debt，以确保在此之前没有文本。你要做的就是改变

if cell.text.strip() != yahoo_figure:

至

if not re.match(yahoo_figure, cell.text.strip()):

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/38604506

复制

相似问题

问用Python抓取雅虎财务资产负债表
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Python抓取雅虎财务资产负债表EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Python抓取雅虎财务资产负债表
EN