I'm trying to make an app that scrapes my top ten favorite space related stock prices. but列表项我的代码有一些问题,我对抓取很陌生。一旦我得到这个工作,我想把它放到一个csv文件,并用它做一个条形图,我希望得到一些帮助和建议。我也是在Anaconda中这样做的:
#import libraries
import bs4
from bs4 import BeautifulSoup
#grequests is a unique library that allows you to use many urls with ease
#must install qrequest in annacode use : conda install -c conda-forge grequests
#if you know a better way to do this, please let me know
import grequests
#scraping my top ten favorite space companies, attempted to pick compaines with pure play interest in space
urls = ['https://finance.yahoo.com/quote/GILT/', 'https://finance.yahoo.com/quote/LORL?p=LORL&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/I?p=I&.tsrc=fin-srch' , 'https://finance.yahoo.com/quote/VSAT?p=VSAT&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/RTN?p=RTN&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/UTX?ltr=1', 'https://finance.yahoo.com/quote/TDY?ltr=1', 'https://finance.yahoo.com/quote/ORBC?ltr=1', 'https://finance.yahoo.com/quote/SPCE?p=SPCE&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/BA?p=BA&.tsrc=fin-srch',]
unsent_request = (grequests.get(url) for url in urls)
results = grequests.map(unsent_request)
def parsePrice(r):
soup = bs4.BeautifulSoup(r.text,"html")
price=soup.find_all('div',{'class':'Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)" data-reactid="52">4.1500'})[0].find('span').text
return price
for r in results:
parsePrice(r)那么是什么代码带来了这个错误:
IndexError Traceback (most recent call last)
<ipython-input-6-9ac8cb94b6fb> in <module>
5
6 for r in results:
----> 7 parsePrice(r)
<ipython-input-6-9ac8cb94b6fb> in parsePrice(r)
1 def parsePrice(r):
2 soup = bs4.BeautifulSoup(r.text,"html")
----> 3 price=soup.find_all('div',{'class':'Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)" data-reactid="52">4.1500'})[0].find('span').text
4 return price
5
IndexError: list index out of range怎么了?
发布于 2019-12-06 00:14:31
页面上的数据位于<table>标记中。使用pandas的.read_html(),因为它在内部使用了BeautifulSoup。这样你就能抓到更多。
这些数据也可以通过API/XHR获得,但不会深入讨论,因为这会稍微复杂一些。
import pandas as pd
#scraping my top ten favorite space companies, attempted to pick compaines with pure play interest in space
urls = ['https://finance.yahoo.com/quote/GILT/', 'https://finance.yahoo.com/quote/LORL?p=LORL&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/I?p=I&.tsrc=fin-srch' , 'https://finance.yahoo.com/quote/VSAT?p=VSAT&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/RTN?p=RTN&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/UTX?ltr=1', 'https://finance.yahoo.com/quote/TDY?ltr=1', 'https://finance.yahoo.com/quote/ORBC?ltr=1', 'https://finance.yahoo.com/quote/SPCE?p=SPCE&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/BA?p=BA&.tsrc=fin-srch',]
def parsePrice(r):
df = pd.read_html(r)[0].T
cols = list(df.iloc[0,:])
temp_df = pd.DataFrame([list(df.iloc[1,:])], columns=cols)
temp_df['url'] = r
return temp_df
df = pd.DataFrame()
for r in urls:
df = df.append(parsePrice(r), sort=True).reset_index(drop=True)
df.to_csv('path/filename.csv', index=False)输出:
print (df.to_string())
52 Week Range Ask Avg. Volume Bid Day's Range Open Previous Close Volume url
0 7.32 - 9.87 8.09 x 800 23415 8.06 x 800 8.01 - 8.11 8.10 8.01 6337 https://finance.yahoo.com/quote/GILT/
1 32.14 - 42.77 32.74 x 1100 41759 32.59 x 1000 32.28 - 32.75 32.32 32.28 14685 https://finance.yahoo.com/quote/LORL?p=LORL&.t...
2 5.55 - 27.29 6.64 x 800 5746553 6.63 x 2900 6.51 - 6.68 6.64 6.65 995245 https://finance.yahoo.com/quote/I?p=I&.tsrc=fi...
3 55.93 - 97.31 72.21 x 800 281600 72.16 x 1000 71.51 - 72.80 72.26 72.32 74758 https://finance.yahoo.com/quote/VSAT?p=VSAT&.t...
4 144.27 - 220.03 215.54 x 1000 1560562 215.37 x 800 214.87 - 217.45 215.85 214.86 203957 https://finance.yahoo.com/quote/RTN?p=RTN&.tsr...
5 100.48 - 149.81 145.03 x 800 2749725 144.96 x 800 144.41 - 145.56 145.49 144.52 489169 https://finance.yahoo.com/quote/UTX?ltr=1
6 189.35 - 351.53 343.34 x 800 280325 342.80 x 800 342.84 - 346.29 344.16 343.58 42326 https://finance.yahoo.com/quote/TDY?ltr=1
7 3.5800 - 9.7900 4.1400 x 1300 778343 4.1300 x 800 4.1200 - 4.2000 4.1700 4.1500 62335 https://finance.yahoo.com/quote/ORBC?ltr=1
8 6.90 - 12.09 7.37 x 900 2280333 7.38 x 800 7.24 - 7.48 7.30 7.22 539082 https://finance.yahoo.com/quote/SPCE?p=SPCE&.t...
9 292.47 - 446.01 348.73 x 800 4420225 348.79 x 800 345.70 - 350.42 350.22 348.84 1258813 https://finance.yahoo.com/quote/BA?p=BA&.tsrc=...但是如果你必须走BeautifulSoup的路线,你的find_all()是不正确的。首先,该类严格是class=之后双引号之间的文本。您已经包含了元素的其他属性,比如datareact-id,以及您想要拉取的实际内容/文本。其次,该类是<span>标记的子类,而不是div标记的子类。如果您拉出div标记,这很好,但是仍然需要进入该元素以获取文本。
试一试:
import bs4
import requests
#scraping my top ten favorite space companies, attempted to pick compaines with pure play interest in space
urls = ['https://finance.yahoo.com/quote/GILT/', 'https://finance.yahoo.com/quote/LORL?p=LORL&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/I?p=I&.tsrc=fin-srch' , 'https://finance.yahoo.com/quote/VSAT?p=VSAT&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/RTN?p=RTN&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/UTX?ltr=1', 'https://finance.yahoo.com/quote/TDY?ltr=1', 'https://finance.yahoo.com/quote/ORBC?ltr=1', 'https://finance.yahoo.com/quote/SPCE?p=SPCE&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/BA?p=BA&.tsrc=fin-srch',]
def parsePrice(r):
resp = requests.get(r)
soup = bs4.BeautifulSoup(resp.text,"html")
price=soup.find_all('span',{'class':'Trsdu(0.3s) Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(b)'})[0].text
return price
for r in urls:
print (parsePrice(r))输出:
8.06
32.76
6.60
72.22
215.54
145.14
343.28
4.1550
7.43
348.32https://stackoverflow.com/questions/59198682
复制相似问题