文章/答案/技术大牛

发布

社区首页 >问答首页 >Python在大型html网页上的抓取

问Python在大型html网页上的抓取
EN

Stack Overflow用户

提问于 2018-04-20 17:04:50

回答 1查看 257关注 0票数 1

我正试图从雅虎金融公司获得有关某只股票的所有历史信息。我对python和web抓取很陌生。

我想下载所有的历史数据到一个CSV文件。问题是，代码只下载网站上任何股票的前100条条目。当在浏览器上查看任何股票时，我们必须滚动到页面底部，以便加载更多的表条目。

我认为，当我使用库下载时，也会发生同样的事情。某种优化似乎正在阻止网页完全下载。在这里试试(https://in.finance.yahoo.com/quote/TVSMOTOR.NS/history?period1=-19800&period2=1524236374&interval=1d&filter=history&frequency=1d)。有什么办法可以克服这个问题吗？

这是我的密码：

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url= 'https://in.finance.yahoo.com/quote/TVSMOTOR.NS/history?period1=-19800&period2=1524236374&interval=1d&filter=history&frequency=1d'

page=uReq(my_url)

page_html = page.read()
page_data = soup(page_html,"html.parser")
container= page_data.findAll("table",{"data-test":"historical-prices"})
container= container[0].tbody
rows=container.findAll("tr")
filename="tvs.csv"
f=open(filename,"w")
headers = "date, open, low, close, adjusted_close_price, vol \n"
f.write(headers)


for row in rows:
    if len(row.find_all("td",{"colspan":""}))==7 :
        col=row.findAll("td")
        date=col[0].span.text.strip()
        opend=col[1].span.text.strip().replace(",","")
        if opend!='null':       
            high=col[2].span.text.strip().replace(",","")
            low=col[3].span.text.strip().replace(",","")
            close=col[4].span.text.strip().replace(",","")
            adjclose=col[5].span.text.strip().replace(",","")
            vol=col[6].span.text.strip().replace(",","")
            f.write(date+","+opend+","+low+","+close+","+adjclose+","+vol+","+"\n")

f.close();

提前谢谢！

编辑：

好吧，我发现另一段代码很好用。但我不知道它是怎么工作的。任何帮助都将不胜感激。

#!/usr/bin/env python

"""
get-yahoo-quotes.py:  Script to download Yahoo historical quotes using the new cookie authenticated site.
 Usage: get-yahoo-quotes SYMBOL
 History
 06-03-2017 : Created script
"""

__author__ = "Brad Luicas"
__copyright__ = "Copyright 2017, Brad Lucas"
__license__ = "MIT"
__version__ = "1.0.0"
__maintainer__ = "Brad Lucas"
__email__ = "brad@beaconhill.com"
__status__ = "Production"


import re
import sys
import time
import datetime
import requests


def split_crumb_store(v):
    return v.split(':')[2].strip('"')


def find_crumb_store(lines):
    # Looking for
    # ,"CrumbStore":{"crumb":"9q.A4D1c.b9
    for l in lines:
        if re.findall(r'CrumbStore', l):
            return l
    print("Did not find CrumbStore")


def get_cookie_value(r):
    return {'B': r.cookies['B']}


def get_page_data(symbol):
    url = "https://finance.yahoo.com/quote/%s/?p=%s" % (symbol, symbol)
    r = requests.get(url)
    cookie = get_cookie_value(r)

    # Code to replace possible \u002F value
    # ,"CrumbStore":{"crumb":"FWP\u002F5EFll3U"
    # FWP\u002F5EFll3U
    lines = r.content.decode('unicode-escape').strip(). replace('}', '\n')
    return cookie, lines.split('\n')


def get_cookie_crumb(symbol):
    cookie, lines = get_page_data(symbol)
    crumb = split_crumb_store(find_crumb_store(lines))
    return cookie, crumb


def get_data(symbol, start_date, end_date, cookie, crumb):
    filename = '%s.csv' % (symbol)
    url = "https://query1.finance.yahoo.com/v7/finance/download/%s?period1=%s&period2=%s&interval=1d&events=history&crumb=%s" % (symbol, start_date, end_date, crumb)
    response = requests.get(url, cookies=cookie)
    with open (filename, 'wb') as handle:
        for block in response.iter_content(1024):
            handle.write(block)


def get_now_epoch():
    # @see https://www.linuxquestions.org/questions/programming-9/python-datetime-to-epoch-4175520007/#post5244109
    return int(time.time())


def download_quotes(symbol):
    start_date = 0
    end_date = get_now_epoch()
    cookie, crumb = get_cookie_crumb(symbol)
    get_data(symbol, start_date, end_date, cookie, crumb)


if __name__ == '__main__':
    # If we have at least one parameter go ahead and loop overa all the parameters assuming they are symbols
    if len(sys.argv) == 1:
        print("\nUsage: get-yahoo-quotes.py SYMBOL\n\n")
    else:
        for i in range(1, len(sys.argv)):
            symbol = sys.argv[i]
            print("--------------------------------------------------")
            print("Downloading %s to %s.csv" % (symbol, symbol))
            download_quotes(symbol)
print("--------------------------------------------------")

urllib2

urllib

python-3.x

web-scraping

beautifulsoup

回答 1

Stack Overflow用户

发布于 2018-04-20 17:20:00

最初，只有100个结果被下载到浏览器。当您滚动到页面底部时，会发生JS事件，触发AJAX函数在后台下载下一个50/100数据条目，然后将其显示给浏览器。在您的python代码中，没有可能创建JS事件，因为python不执行javascript，因此AJAX调用请求是不可能的。所以最好使用https://intrinio.com/或https://www.alphavantage.co

您可以尝试雅虎财务python包。https://pypi.org/project/yahoo-finance/

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/49946597

复制

相似问题

问Python在大型html网页上的抓取
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python在大型html网页上的抓取EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python在大型html网页上的抓取
EN