文章/答案/技术大牛

发布

社区首页 >问答首页 >使用matplotlib绘制股票代码库数据的Web爬虫

问使用matplotlib绘制股票代码库数据的Web爬虫
EN

Code Review用户

提问于 2015-12-21 06:01:11

回答 1查看 769关注 0票数 5

我使用BeautifulSoup库构建了一个网络爬虫，它从雅虎金融公司的CSV文件中提取股票代码，并使用matplotlib绘制数据图表。我想知道是否有任何方法来改进我编写的代码，因为我认为有些部分可能会更好。

import urllib.request
from matplotlib import pyplot as plt
from bs4 import BeautifulSoup
import requests

def chartStocks(*tickers):

    # Run loop for each ticker passed in as an argument
    for ticker in tickers:

        # Convert URL into text for parsing
        url = "http://finance.yahoo.com/q/hp?s=" + str(ticker) + "+Historical+Prices"
        sourceCode = requests.get(url)
        plainText = sourceCode.text
        soup = BeautifulSoup(plainText, "html.parser")

        # Find all links on the page
        for link in soup.findAll('a'):
            href = link.get('href')
            link = []
            for c in href[:48]:
                link.append(c)
            link = ''.join(link)

            # Find the URL for the stock ticker CSV file and convert the data to text
            if link == "http://real-chart.finance.yahoo.com/table.csv?s=":
                csv_url = href
                res = urllib.request.urlopen(csv_url)
                csv = res.read()
                csv_str = str(csv)

                # Parse the CSV to create a list of data points
                point = []
                points = []
                curDay = 0
                day = []
                commas = 0               
                lines = csv_str.split("\\n")
                lineOne = True
                for line in lines:
                    commas = 0
                    if lineOne == True:
                        lineOne = False
                    else:
                        for c in line:
                            if c == ",":
                                commas += 1
                            if commas == 4:
                                point.append(c)
                            elif commas == 5:
                                for x in point:
                                    if x == ",":
                                        point.remove(x)
                                point = ''.join(point)
                                point = float(point)
                                points.append(point)
                                day.append(curDay)
                                curDay += 1
                                point = []
                                commas = 0
                points = list(reversed(points))

                # Plot the data
                plt.plot(day,points)
                plt.ylabel(ticker)
                plt.show()

beautifulsoup

matplotlib

python

web-scraping

回答 1

Code Review用户

回答已采纳

发布于 2015-12-21 07:33:15

组成小函数的

如果将chartStocks分成几个较小的函数，那么它将更具可读性，大致如下所示：

def chartStocks(*tickers):
    for ticker in tickers:
        page = getTickerPage(ticker)
        csv_url = findCSVUrl(page)
        csv = getCSV(csv_url)
        day, points = parseCSV(csv)
        plot_data(ticker, day, points)

        # Or, if you're allergic to temporary variables:
        day, points = parseCSV(getCSV(findCSVUrl(getTickerPage(ticker))))

这种方法可以让您清楚地看到您的数据正在传递的“管道”，并允许您单独测试和重用较小的部分。

可以说，将def chartStock(ticker)定义为处理一个滴答器的情况会更干净，这样chartStocks就只是

def chartStocks(*tickers):
    for ticker in tickers:
        chartStock(ticker)

这里唯一需要注意的是确保您的函数能够正确地处理错误--要么在调用下一个函数之前检查每个返回值是否为None，要么允许它们将None作为参数，在这种情况下不返回任何内容。

str.startswith

这是：

# Find all links on the page
for link in soup.findAll('a'):
    href = link.get('href')
    link = []
    for c in href[:48]:
        link.append(c)
    link = ''.join(link)

    # Find the URL for the stock ticker CSV file and convert the data to text
    if link == "http://real-chart.finance.yahoo.com/table.csv?s=":
        # ...

可以使用str.startswith进行简化：

def findCSVUrl(soupPage):
    CSV_URL_PREFIX = 'http://real-chart.finance.yahoo.com/table.csv?s='
    for link in soupPage.findAll('a'):
        href = link.get('href', '')
        if href.startswith(CSV_URL_PREFIX):
            return href

我还提供了一个未找到的''值，以便如果link没有href，那么startswith就不会被调用到None上。

跳过第一行

而不是在lineOne上循环时使用lines标志：

lineOne = True
for line in lines:
    if lineOne == True:
        lineOne = False
    else:
         # continue parsing line...

您可以从第一行之后开始使用一个片段：

for line in lines[1:]:
    # ... continue parsing line

CSV解析

Python有一个内置的CSV解析模块，可以简化很多解析工作。它将为您执行按逗号分隔的操作，并根据所要求的内容，为每一行返回一个字段列表或字段切分。你最终会得出这样的结论：

def parseCSV(csv_text):
    csv_rows = csv.reader(csv_text.split('\n'))
    days = []
    points = []

    for day, row in enumerate(csv_rows):
        close = float(row[4])
        days.append(day)
        points.append(close)

    return days, points

其中，enumerate函数将为您提供与当前相同的基于零的天数列表。

实际上，由于days似乎只是列表[0 .. len(points)]，所以您可以跳过enumerate，只需在分析完所有分数之后简单地定义days，如果您想要更好的度量，可以抛出一个列表理解：

def parseCSV(csv_text):
    csv_rows = csv.reader(csv_text.split('\n'))

    points = [float(row[4]) for row in csv_rows]
    days = list(range(len(points)))

    return days, points

票数 4

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/114612

复制

相似问题

问使用matplotlib绘制股票代码库数据的Web爬虫
EN

回答 1

Code Review用户

str.startswith

跳过第一行

CSV解析

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用matplotlib绘制股票代码库数据的Web爬虫EN

回答 1

Code Review用户

str.startswith

跳过第一行

CSV解析

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用matplotlib绘制股票代码库数据的Web爬虫
EN