首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Python在大型html网页上的抓取

Python在大型html网页上的抓取
EN

Stack Overflow用户
提问于 2018-04-20 17:04:50
回答 1查看 257关注 0票数 1

我正试图从雅虎金融公司获得有关某只股票的所有历史信息。我对python和web抓取很陌生。

我想下载所有的历史数据到一个CSV文件。问题是,代码只下载网站上任何股票的前100条条目。当在浏览器上查看任何股票时,我们必须滚动到页面底部,以便加载更多的表条目。

我认为,当我使用库下载时,也会发生同样的事情。某种优化似乎正在阻止网页完全下载。在这里试试(https://in.finance.yahoo.com/quote/TVSMOTOR.NS/history?period1=-19800&period2=1524236374&interval=1d&filter=history&frequency=1d)。有什么办法可以克服这个问题吗?

这是我的密码:

代码语言:javascript
复制
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url= 'https://in.finance.yahoo.com/quote/TVSMOTOR.NS/history?period1=-19800&period2=1524236374&interval=1d&filter=history&frequency=1d'

page=uReq(my_url)

page_html = page.read()
page_data = soup(page_html,"html.parser")
container= page_data.findAll("table",{"data-test":"historical-prices"})
container= container[0].tbody
rows=container.findAll("tr")
filename="tvs.csv"
f=open(filename,"w")
headers = "date, open, low, close, adjusted_close_price, vol \n"
f.write(headers)


for row in rows:
    if len(row.find_all("td",{"colspan":""}))==7 :
        col=row.findAll("td")
        date=col[0].span.text.strip()
        opend=col[1].span.text.strip().replace(",","")
        if opend!='null':       
            high=col[2].span.text.strip().replace(",","")
            low=col[3].span.text.strip().replace(",","")
            close=col[4].span.text.strip().replace(",","")
            adjclose=col[5].span.text.strip().replace(",","")
            vol=col[6].span.text.strip().replace(",","")
            f.write(date+","+opend+","+low+","+close+","+adjclose+","+vol+","+"\n")

f.close();

提前谢谢!

编辑:

好吧,我发现另一段代码很好用。但我不知道它是怎么工作的。任何帮助都将不胜感激。

代码语言:javascript
复制
#!/usr/bin/env python

"""
get-yahoo-quotes.py:  Script to download Yahoo historical quotes using the new cookie authenticated site.
 Usage: get-yahoo-quotes SYMBOL
 History
 06-03-2017 : Created script
"""

__author__ = "Brad Luicas"
__copyright__ = "Copyright 2017, Brad Lucas"
__license__ = "MIT"
__version__ = "1.0.0"
__maintainer__ = "Brad Lucas"
__email__ = "brad@beaconhill.com"
__status__ = "Production"


import re
import sys
import time
import datetime
import requests


def split_crumb_store(v):
    return v.split(':')[2].strip('"')


def find_crumb_store(lines):
    # Looking for
    # ,"CrumbStore":{"crumb":"9q.A4D1c.b9
    for l in lines:
        if re.findall(r'CrumbStore', l):
            return l
    print("Did not find CrumbStore")


def get_cookie_value(r):
    return {'B': r.cookies['B']}


def get_page_data(symbol):
    url = "https://finance.yahoo.com/quote/%s/?p=%s" % (symbol, symbol)
    r = requests.get(url)
    cookie = get_cookie_value(r)

    # Code to replace possible \u002F value
    # ,"CrumbStore":{"crumb":"FWP\u002F5EFll3U"
    # FWP\u002F5EFll3U
    lines = r.content.decode('unicode-escape').strip(). replace('}', '\n')
    return cookie, lines.split('\n')


def get_cookie_crumb(symbol):
    cookie, lines = get_page_data(symbol)
    crumb = split_crumb_store(find_crumb_store(lines))
    return cookie, crumb


def get_data(symbol, start_date, end_date, cookie, crumb):
    filename = '%s.csv' % (symbol)
    url = "https://query1.finance.yahoo.com/v7/finance/download/%s?period1=%s&period2=%s&interval=1d&events=history&crumb=%s" % (symbol, start_date, end_date, crumb)
    response = requests.get(url, cookies=cookie)
    with open (filename, 'wb') as handle:
        for block in response.iter_content(1024):
            handle.write(block)


def get_now_epoch():
    # @see https://www.linuxquestions.org/questions/programming-9/python-datetime-to-epoch-4175520007/#post5244109
    return int(time.time())


def download_quotes(symbol):
    start_date = 0
    end_date = get_now_epoch()
    cookie, crumb = get_cookie_crumb(symbol)
    get_data(symbol, start_date, end_date, cookie, crumb)


if __name__ == '__main__':
    # If we have at least one parameter go ahead and loop overa all the parameters assuming they are symbols
    if len(sys.argv) == 1:
        print("\nUsage: get-yahoo-quotes.py SYMBOL\n\n")
    else:
        for i in range(1, len(sys.argv)):
            symbol = sys.argv[i]
            print("--------------------------------------------------")
            print("Downloading %s to %s.csv" % (symbol, symbol))
            download_quotes(symbol)
print("--------------------------------------------------")
EN

回答 1

Stack Overflow用户

发布于 2018-04-20 17:20:00

最初,只有100个结果被下载到浏览器。当您滚动到页面底部时,会发生JS事件,触发AJAX函数在后台下载下一个50/100数据条目,然后将其显示给浏览器。在您的python代码中,没有可能创建JS事件,因为python不执行javascript,因此AJAX调用请求是不可能的。所以最好使用https://intrinio.com/https://www.alphavantage.co

您可以尝试雅虎财务python包。https://pypi.org/project/yahoo-finance/

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/49946597

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档