下面是抓取的链接:https://stockanalysis.com/stocks/
我正在尝试获取表的所有行(6000+行),但我只得到前500个结果。我想这与要显示多少行的条件有关。
我几乎尽我所能。我也是网络抓取的初学者。
我的代码:
# Importing libraries
import numpy as np # numerical computing library
import pandas as pd # panel data library
import requests # http requests library
from bs4 import BeautifulSoup
url = 'https://stockanalysis.com/stocks/'
headers = {'User-Agent': ' user agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36'}
r = requests.get(url, headers)
soup = BeautifulSoup(r.text, 'html')
league_table = soup.find('table', class_ = 'symbol-table index')
col_df = ['Symbol', 'Company_name', 'Industry', 'Market_Cap']
for team in league_table.find_all('tbody'):
# i = 1
rows = team.find_all('tr')
df = pd.DataFrame(np.zeros([len(rows), len(col_df)]))
df.columns = col_df
for i, row in enumerate(rows):
s_symbol = row.find_all('td')[0].text
s_company_name = row.find_all('td')[1].text
s_industry = row.find_all('td')[2].text
s_market_cap = row.find_all('td')[3].text
df.iloc[i] = [s_symbol, s_company_name, s_industry, s_market_cap]
len(df) # should > 6000我该怎么办?
发布于 2021-12-07 03:54:29
向下看一下html的底部,您将看到以下内容
<script id="__NEXT_DATA__" type="application/json">
尝试使用bs4查找这个标记并从其中加载数据,我认为这是您所需要的一切。
发布于 2021-12-07 09:23:06
如前所述,它在<script>标记中。把它拉进去读一下。
import requests
from bs4 import BeautifulSoup
import json
import re
import pandas as pd
url = 'https://stockanalysis.com/stocks/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
jsonStr = str(soup.find('script', {'id':'__NEXT_DATA__'}))
jsonStr = re.search('({.*})', jsonStr).group(0)
jsonData = json.loads(jsonStr)
df = pd.DataFrame(jsonData['props']['pageProps']['stocks'])输出:
print(df)
s ... i
0 A ... Life Sciences Tools & Services
1 AA ... Metals & Mining
2 AAC ... Blank Check / SPAC
3 AACG ... Diversified Consumer Services
4 AACI ... Blank Check / SPAC
... ... ...
6033 ZWS ... Utilities-Regulated Water
6034 ZY ... Chemicals
6035 ZYME ... Biotechnology
6036 ZYNE ... Pharmaceuticals
6037 ZYXI ... Health Care Equipment & Supplies
[6038 rows x 4 columns]https://stackoverflow.com/questions/70254481
复制相似问题