我使用BeautifulSoup库构建了一个网络爬虫,它从雅虎金融公司的CSV文件中提取股票代码,并使用matplotlib绘制数据图表。我想知道是否有任何方法来改进我编写的代码,因为我认为有些部分可能会更好。
import urllib.request
from matplotlib import pyplot as plt
from bs4 import BeautifulSoup
import requests
def chartStocks(*tickers):
# Run loop for each ticker passed in as an argument
for ticker in tickers:
# Convert URL into text for parsing
url = "http://finance.yahoo.com/q/hp?s=" + str(ticker) + "+Historical+Prices"
sourceCode = requests.get(url)
plainText = sourceCode.text
soup = BeautifulSoup(plainText, "html.parser")
# Find all links on the page
for link in soup.findAll('a'):
href = link.get('href')
link = []
for c in href[:48]:
link.append(c)
link = ''.join(link)
# Find the URL for the stock ticker CSV file and convert the data to text
if link == "http://real-chart.finance.yahoo.com/table.csv?s=":
csv_url = href
res = urllib.request.urlopen(csv_url)
csv = res.read()
csv_str = str(csv)
# Parse the CSV to create a list of data points
point = []
points = []
curDay = 0
day = []
commas = 0
lines = csv_str.split("\\n")
lineOne = True
for line in lines:
commas = 0
if lineOne == True:
lineOne = False
else:
for c in line:
if c == ",":
commas += 1
if commas == 4:
point.append(c)
elif commas == 5:
for x in point:
if x == ",":
point.remove(x)
point = ''.join(point)
point = float(point)
points.append(point)
day.append(curDay)
curDay += 1
point = []
commas = 0
points = list(reversed(points))
# Plot the data
plt.plot(day,points)
plt.ylabel(ticker)
plt.show()发布于 2015-12-21 07:33:15
组成小函数的
如果将chartStocks分成几个较小的函数,那么它将更具可读性,大致如下所示:
def chartStocks(*tickers):
for ticker in tickers:
page = getTickerPage(ticker)
csv_url = findCSVUrl(page)
csv = getCSV(csv_url)
day, points = parseCSV(csv)
plot_data(ticker, day, points)
# Or, if you're allergic to temporary variables:
day, points = parseCSV(getCSV(findCSVUrl(getTickerPage(ticker))))这种方法可以让您清楚地看到您的数据正在传递的“管道”,并允许您单独测试和重用较小的部分。
可以说,将def chartStock(ticker)定义为处理一个滴答器的情况会更干净,这样chartStocks就只是
def chartStocks(*tickers):
for ticker in tickers:
chartStock(ticker)这里唯一需要注意的是确保您的函数能够正确地处理错误--要么在调用下一个函数之前检查每个返回值是否为None,要么允许它们将None作为参数,在这种情况下不返回任何内容。
这是:
# Find all links on the page
for link in soup.findAll('a'):
href = link.get('href')
link = []
for c in href[:48]:
link.append(c)
link = ''.join(link)
# Find the URL for the stock ticker CSV file and convert the data to text
if link == "http://real-chart.finance.yahoo.com/table.csv?s=":
# ...可以使用str.startswith进行简化:
def findCSVUrl(soupPage):
CSV_URL_PREFIX = 'http://real-chart.finance.yahoo.com/table.csv?s='
for link in soupPage.findAll('a'):
href = link.get('href', '')
if href.startswith(CSV_URL_PREFIX):
return href我还提供了一个未找到的''值,以便如果link没有href,那么startswith就不会被调用到None上。
而不是在lineOne上循环时使用lines标志:
lineOne = True
for line in lines:
if lineOne == True:
lineOne = False
else:
# continue parsing line...您可以从第一行之后开始使用一个片段:
for line in lines[1:]:
# ... continue parsing linePython有一个内置的CSV解析模块,可以简化很多解析工作。它将为您执行按逗号分隔的操作,并根据所要求的内容,为每一行返回一个字段列表或字段切分。你最终会得出这样的结论:
def parseCSV(csv_text):
csv_rows = csv.reader(csv_text.split('\n'))
days = []
points = []
for day, row in enumerate(csv_rows):
close = float(row[4])
days.append(day)
points.append(close)
return days, points其中,enumerate函数将为您提供与当前相同的基于零的天数列表。
实际上,由于days似乎只是列表[0 .. len(points)],所以您可以跳过enumerate,只需在分析完所有分数之后简单地定义days,如果您想要更好的度量,可以抛出一个列表理解:
def parseCSV(csv_text):
csv_rows = csv.reader(csv_text.split('\n'))
points = [float(row[4]) for row in csv_rows]
days = list(range(len(points)))
return days, pointshttps://codereview.stackexchange.com/questions/114612
复制相似问题