我试图从网页中获取与不同参与者相关的表格的内容。为了你的理解,我想要的信息已经在图像中被删除了。目前,我的脚本只能给出不同参与者的名字。我也要分析与这些与会者有关的信息。
https://www.bet365.com.au/#/AC/B151/C1/D50/E2/F163/
由于内容是动态的,我不得不使用一些可以使用dev工具检索的公共API。
https://filebin.net/ybwver7vt5mp1dju表示信息在该页面中的显示方式。我想抓住的就是划过线的人。
https://pastebin.com/QsA3Pprr API响应的样子。
我已经试过了:
import re
import requests
url = 'https://www.bet365.com.au/SportsBook.API/web?'
params = {
'lid': '30',
'zid': '0',
'pd': '#AC#B151#C1#D50#E2#F163#',
'cid': '13',
'ctid': '13'
}
r = requests.get(url, params=params,headers={'User-Agent':'Mozilla/5.0'})
games = re.finditer(r'NA=(.*?);', r.text)
for game in games:
if not 'v' in game.group(): continue
print(game.group(1))我得到的输出类似于(部分):
FunPlus Phoenix v Bilibili Gaming
Top Esports v Royal Never Give Up
Moops v Brute
eSuba v eXtatus
CS:GO - V4 Future Sports Festival
PACT v Capri Sun我希望得到类似于(部分)的输出:
26:42 FunPlus Phoenix v Bilibili Gaming 1-1 - - 21
09:00 Top Esports v Royal Never Give Up - 2.00 1.72 49
12:00 Moops v Brute - 2.10 1.66 17我怎样才能抓取附在不同参与者身上的表格内容?
PS这里可见的信息可能并不相同,因为它们每隔几分钟更新一次,我希望像我已经尝试过的那样使用请求来完成任务。
发布于 2019-09-03 16:12:41
我帮你做了你的第一个问题的代码,这与这个网站有关。虽然其他两个答案使用Selenium,但这是不必要的,因为网站的api端点用于查找游戏。这种方法应该比硒更快。我能够再次使用正则表达式解析其他信息。然而,在实际的网站上,我无法找到任何类似的'1-1‘,如您的预期输出。希望这能有所帮助。“时代”可能有问题,我对他们不太确定。
代码
import re
import requests
from datetime import datetime, timedelta
import pandas as pd
url = 'https://www.bet365.com.au/SportsBook.API/web?'
params = {
'lid': '30',
'zid': '0',
'pd': '#AC#B151#C1#D50#E2#F163#',
'cid': '13',
'ctid': '13'
}
r = requests.get(url, params=params, headers={'User-Agent': 'Mozilla/5.0'})
games = re.finditer(r'NA=([\w\s\-._]+? v [\w\s\-._]+?);', r.text)
col_games = []
for game in games:
# if 'v' in game.group() and '-' not in game.group():
col_games.append(game.group(1))
prices_text = re.finditer(r'NA=1;.*?((?:OD=\d+/\d+;(?:.*?))+?)NA=', r.text)
col_1 = []
for text in prices_text:
segments = text.group(1).split('|')
for segment in segments:
price = re.search(r'OD=(\d+/\d+);', segment)
if price:
price = int(eval(price.group(1) + '+1') * 100) / 100
col_1.append(price)
prices_text = re.finditer(r'NA=2;.*?((?:OD=\d+/\d+;(?:.*?))+?)NA=', r.text)
col_2 = []
for text in prices_text:
segments = text.group(1).split('|')
for segment in segments:
price = re.search(r'OD=(\d+/\d+);', segment)
if price:
price = int(eval(price.group(1) + '+1') * 100) / 100
col_2.append(price)
times = re.finditer(r'BC=(\d+);', r.text)
col_times = []
for time in times:
datetime_time = datetime.strptime(time.group(1)[:-2], '%Y%m%d%H%M')
datetime_time = datetime_time + timedelta(hours=-1)
col_times.append(datetime_time.time())
df = pd.DataFrame({'Time': col_times, "Games": col_games, "1": col_1, "2": col_2})
print(df)输出
Time Games 1 2
0 19:00:00 DETONA v Falkol 1.25 3.75
1 19:00:00 paiN Gaming v Keyd 1.53 2.37
2 19:00:00 W7M v Bulldozer 1.22 4.00
3 03:00:00 VP Game v Team WE Academy 2.62 1.44
4 05:00:00 Invictus Gaming Young v Top Esports Challenger 1.22 4.00
5 07:00:00 Vici Gaming Potential v FunPlus Phoenix Blaze 1.36 3.00
6 09:00:00 Edward Gaming Youth v Bilibili Gaming Junior 2.00 1.72
7 09:00:00 Gama Dream v LinGan e-Sports 1.80 1.90
8 03:00:00 Royal Club v Suning Gaming-S 1.66 2.10
9 05:00:00 Joy Dream v Oh My Dream 2.37 1.53
10 07:00:00 LNG Academy v Bilibili Gaming Junior 3.25 1.33
11 07:00:00 TS Gaming v Victorious Gaming 1.72 2.00
12 09:00:00 D7G Esports Club v Legend Esport Gaming 3.75 1.25
13 09:00:00 Dominus Esports.Y v Rogue Warriors Shark 2.50 1.50
14 05:00:00 Team WE Academy v Vici Gaming Potential 3.25 1.33
15 07:00:00 87 v Gama Dream 2.00 1.72
16 07:00:00 Invictus Gaming Young v LNG Academy 1.16 4.50
17 09:00:00 FunPlus Phoenix Blaze v VP Game 1.50 2.50
18 09:00:00 Scorpio Game v Young Miracles 3.40 1.30
19 09:00:00 Top Esports v Bilibili Gaming 1.53 2.37
20 08:00:00 FunPlus Phoenix v Royal Never Give Up 1.57 2.25
21 09:30:00 Maru v Solar 1.40 2.75
22 10:15:00 Stats v Rogue 1.57 2.25
23 04:00:00 Classic v RagnaroK 1.22 4.00
24 04:45:00 Dear v Zest 2.62 1.44
25 08:00:00 SANDBOX Gaming v KINGZONE DragonX 1.66 2.10
26 13:00:00 ENCE v Renegades 1.25 3.75
27 16:30:00 Team Vitality v AVANGAR 1.22 4.00
28 13:00:00 NRG v Natus Vincere 1.66 2.10
29 16:30:00 Astralis v Team Liquid 2.00 1.72
30 23:00:00 Vancouver Titans v Seoul Dynasty 1.33 3.25
31 02:00:00 Hangzhou Spark v Los Angeles Gladiators 1.72 2.00
32 08:00:00 MAD Team v G-Rex 1.53 2.37
33 08:00:00 Flash Wolves v Hong Kong Attitude 3.25 1.33
34 19:00:00 Clutch Gaming v FlyQuest 1.25 3.75
35 16:00:00 Flamengo v INTZ 1.16 4.50
36 16:00:00 Fnatic v Schalke 04 1.20 4.33
37 16:00:00 Origen v Splyce 3.50 1.28
38 09:00:00 GAM Esports v Team Flash 1.25 3.75发布于 2019-08-31 15:39:14
您可以使用selenium
from selenium import webdriver
from bs4 import BeautifulSoup as soup
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://www.bet365.com.au/#/AC/B151/C1/D50/E2/F163/')
def scrape_block(b):
p = {'date':b.find('div', {'class':'gll-MarketColumnHeader sl-MarketHeaderLabel sl-MarketHeaderLabel_Date '}).text}
c1, c2 = b.find_all('div', {'class':'sl-CouponParticipantWithBookCloses sl-CouponParticipantWithBookCloses_NoAdditionalMarkets sl-CouponParticipantIPPGBase '}), b.find_all('div', {'class':'sl-CouponParticipantWithBookCloses sl-CouponParticipantWithBookCloses_NoAdditionalMarkets sl-CouponParticipantIPPGBase sl-CouponParticipantWithBookCloses_ClockPaddingLeft '})
if c1:
pl = [[i.find('div', {'class':'sl-CouponParticipantWithBookCloses_BookCloses '}).text, i.find('div', {'class':'sl-CouponParticipantWithBookCloses_Name '}).text] for i in c1]
else:
pl = [[i.find('div', {'class':'pi-CouponParticipantClockInPlay '}).text, i.find('div', {'class':'sl-CouponParticipantWithBookCloses_Name '}).text, i.find('div', {'class':'pi-ScoreVariantDefault '}).text] for i in c2]
odds1, odds2 = [[i.text for i in c.find_all('div', {'class':'gll-ParticipantOddsOnlyDarker gll-Participant_General gll-ParticipantOddsOnly '})] for c in b.find_all('div', {'class':'sl-MarketCouponValuesExplicit2 gll-Market_General gll-Market_PWidth-15-4 '})]
return {**p, 'data':[{'player':a, 1:b, 2:c} for a, b, c in zip(pl, [None] if not odds1 else odds1, [None] if not odds2 else odds2)]}
new_d = list(map(scrape_block, soup(d.page_source, 'html.parser').find_all('div', {'class':'gll-MarketGroupContainer gll-MarketGroupContainer_HasLabels '})))
final_result = list(filter(lambda x:bool(x['data']), new_d))输出:
[{'date': 'Sat 31 Aug', 'data': [{'player': ['22:42', 'Royal Youth v SuperMassive', '1-2'], 1: None, 2: None}]}, {'date': 'Sun 01 Sep', 'data': [{'player': ['11:56', 'G2 Esports v Fnatic', '0-0'], 1: None, 2: None}]}, {'date': 'Sun 01 Sep', 'data': [{'player': ['01:20', 'Hjarnan (G2) v h$hjukken'], 1: '1.10', 2: '1.10'}]}, {'date': 'Sun 01 Sep', 'data': [{'player': ['02:00', 'Thijs v Kolento'], 1: '1.83', 2: '1.83'}, {'player': ['03:00', 'Orange v Hunterace'], 1: '2.25', 2: '1.57'}, {'player': ['04:00', 'Gallon v StrifeCro'], 1: '2.00', 2: '1.72'}, {'player': ['04:00', 'Rdu v SilverName'], 1: '2.00', 2: '1.72'}, {'player': ['05:00', 'Monsanto v PNC'], 1: '1.61', 2: '2.20'}, {'player': ['06:00', 'bloodyface v Amnesiac'], 1: '1.80', 2: '1.90'}, {'player': ['07:00', 'Eddie v Purple'], 1: '1.80', 2: '1.90'}, {'player': ['08:00', 'muzzy v Firebat'], 1: '1.72', 2: '2.00'}, {'player': ['09:00', 'ETC v Nalguidan'], 1: '2.10', 2: '1.66'}]}, {'date': 'Sun 01 Sep', 'data': [{'player': ['12:00', 'Mindfreak v ORDER'], 1: '1.53', 2: '2.37'}]}, {'date': 'Sun 01 Sep', 'data': [{'player': ['15:00', 'LinGan e-Sports v Bilibili Gaming Junior'], 1: '1.66', 2: '2.10'}, {'player': ['17:00', 'Scorpio Game v Suning Gaming-S'], 1: '3.00', 2: '1.36'}, {'player': ['17:00', 'Victorious Gaming v FunPlus Phoenix Blaze'], 1: '3.00', 2: '1.36'}, {'player': ['19:00', '87 v Top Esports Challenger'], 1: '1.66', 2: '2.10'}, {'player': ['19:00', 'Rogue Warriors Shark v Legend Esport Gaming'], 1: '2.62', 2: '1.44'}]}]发布于 2019-09-01 17:05:10
如果要使用JS API,您需要找到一种方法来了解如何解码网站的输出,以及JS部分如何在实际网站中呈现我们可以看到的内容。我觉得这不是一件容易的事。这就是为什么我建议您使用Selenium和BeautifulSoup在浏览器选项卡下加载网站,然后使用Beautifulsoup使用最终的HTML,这将降低从网站中提取什么内容的复杂性。
下面是一个如何使用tournments, dates and matches的headless模式刮除Chrome的示例
PS: cookie部分并不是必需的,但它将有助于自动加载我们正在尝试刮取的页面。
首先您需要安装:pip install webdriver-manager,然后:
import pickle
import time
from collections import defaultdict
from pprint import pprint
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup as bs
CHROME_OPTIONS = Options()
CHROME_OPTIONS.add_argument("--headless")
class Bet365:
DRIVER = webdriver.Chrome(ChromeDriverManager().install(), options=CHROME_OPTIONS)
DUMMY_URL = 'https://www.bet365.com'
URL = 'https://www.bet365.com/#/AC/B1/C1/D13/E37628398/F2/:/AC/B1/C1/D13/E42294995/F2/:/AC/B1/C1/D13/E42535433/F2/'
COOKIES_FILE = 'cookies.pkl'
def __init__(self):
self.DRIVER.get(self.DUMMY_URL)
# Comment the next line if cookies file is not set
self.setup_cookies()
self.DRIVER.get(self.URL)
# self.DRIVER.maximize_window()
# Wait for JS to populate the page
time.sleep(15)
self.source = self.DRIVER.page_source
# Store new cookies for next run
self.dump_cookies()
def dump_cookies(self):
"""Store cookies"""
pickle.dump(self.DRIVER.get_cookies(), open(self.COOKIES_FILE, "wb"))
def setup_cookies(self):
"""Add cookies"""
cookies = pickle.load(open(self.COOKIES_FILE, "rb"))
for cookie in cookies:
if 'expiry' in cookie:
del cookie['expiry']
self.DRIVER.add_cookie(cookie)
def get_source(self):
"""Get page HTML source"""
return bs(self.source, "html.parser")
def is_last_child(self, event):
"""Is last child"""
out = {}
out['last_child'] = True if 'sl-MarketCouponAdvancedBase_LastChild' in event['class'] else False
event_date = event.find('div', {'class': 'sl-CouponParticipantWithBookCloses_BookCloses'})
out['date'] = event_date.get_text() if event_date else 'None'
teams = event.findAll('div', {'class': 'sl-CouponParticipantWithBookCloses_Name'})
if len(teams) > 1:
out['teams'] = ' v '.join(k.text for k in teams)
elif len(teams) == 1:
out['teams'] = teams[0].text
else:
out['teams'] = 'None'
return out
def get_events(self, data):
"""Return all events"""
dates, teams = [], []
for event in data.findAll('div', {'class': 'sl-MarketCouponFixtureLabelBase gll-Market_General gll-Market_HasLabels'}):
dates = [elm.text for elm in event.find_all('div', {'class': lambda x: all(k in x for k in 'gll-MarketColumnHeader sl-MarketHeaderLabel sl-MarketHeaderLabel_Date'.split())})]
teams_events = event.findAll("div", {'class': lambda x: x and x.startswith("sl-CouponParticipantWithBookCloses sl-CouponParticipantIPPGBase")})
teams = [self.is_last_child(elm) for elm in teams_events]
if len(dates) == 1:
if teams:
teams[-1]['last_child'] = True
return dates, teams
def pretty_print_events(self, dates, teams):
"""Pretty print events"""
def groupby_last_child(data):
out, tmp = [], []
for elm in data:
tmp.append(elm)
if elm['last_child']:
out.append(tmp)
tmp = []
return out
out = defaultdict(list)
for date, groupped in zip(dates, groupby_last_child(teams)):
# use += instead of append in order to have flatten list
# instead of list of lists
out[date] += groupped
return dict(out)
def scrape_events(self):
"""Return all ligues"""
for block in self.get_source().findAll('div', {'class': 'gll-MarketGroup cm-CouponMarketGroup cm-CouponMarketGroup_Open'}):
ligue_name = block.find('span', {'class': 'cm-CouponMarketGroupButton_Text'}).get_text()
dates, teams = self.get_events(block)
out = self.pretty_print_events(dates, teams)
yield ligue_name, out
def to_dict(self):
"""Scrape events and return a dict"""
return dict((ligue, events) for ligue, events in self.scrape_events())
if __name__ == '__main__':
instance = Bet365()
out = instance.to_dict()
pprint(out)输出:
{'England League 2 - Full Time Result': {'Sat 07 Sep': [{'date': '15:00',
'last_child': False,
'teams': 'Bradford v '
'Northampton'},
{'date': '15:00',
'last_child': False,
'teams': 'Cambridge '
'Utd v '
'Forest '
'Green'},
{'date': '15:00',
'last_child': False,
'teams': 'Carlisle v '
'Exeter'},
{'date': '15:00',
'last_child': False,
'teams': 'Cheltenham '
'v '
'Stevenage'},
{'date': '15:00',
'last_child': False,
'teams': 'Colchester '
'v Walsall'},
{'date': '15:00',
'last_child': False,
'teams': 'Grimsby v '
'Crewe'},
{'date': '15:00',
'last_child': False,
'teams': 'Leyton '
'Orient v '
'Swindon'},
{'date': '15:00',
'last_child': False,
'teams': 'Macclesfield '
'v Crawley '
'Town'},
{'date': '15:00',
'last_child': False,
'teams': 'Mansfield v '
'Scunthorpe'},
{'date': '15:00',
'last_child': False,
'teams': 'Morecambe v '
'Salford '
'City'},
{'date': '15:00',
'last_child': False,
'teams': 'Newport '
'County v '
'Port Vale'},
{'date': '15:00',
'last_child': True,
'teams': 'Plymouth v '
'Oldham'}]},...https://stackoverflow.com/questions/57737881
复制相似问题