我有一个网球比赛结果的CSV数据集,格式为成功者和失败者,其结构与此相似:https://www.kaggle.com/jordangoblet/atp-tour-20002016
我想创建另一个表,将显示胜利的数量,损失,自上次比赛以来的天数,平均发球百分比等每名球员。
一位朋友建议尝试一下这样的方式:
games = [
["Elliot", 7, "Maya", 10],
["Elliot", 10, "Brendan", 8],
["Jamie", 7, "Omid", 10],
["Elliot", 6, "Jamie", 10],
["Brendan", 7, "Jamie", 10],
]
players = {}
K = 32.0
game_number = 1
for game in games:
p1name, p1score, p2name, p2score = game
if p1name not in players:
players[p1name] = {
'wins': 0,
'losses': 0,
'score': 1000.0,
}
if p2name not in players:
players[p2name] = {
'wins': 0,
'losses': 0,
'score': 1000.0,
}更多代码...
..。
print('Rank Player Elo Wins Losses')
rank = 1
for player, p in sorted(players.items(), key=lambda item: (item[1]['score'], item[0]), reverse=True):
print('%4s %20s %7.1f %5s %7s' % (rank, player, p['score'], p['wins'], p['losses']))
rank += 1 我尝试通过使我的df成为列表列表来调整此代码以满足我的需要:
lol = df.values.tolist()
players = {}
for game in lol:
game_id, tournament_name, tournament_id, tournament_date, round_match, winner, winner_id, loser, loser_id, winner_elo,
loser_elo, winner_delta, loser_delta, winner_set_1, loser_set_1, winner_set_2, loser_set_2, winner_set_3, loser_set_3,
winner_set_4, loser_set_4, winner_set_5, loser_set_5, winner_sets, loser_sets, winner_serve_points_won,
loser_serve_points_won, winner_serve_hold, loser_serve_hold, winner_points_won, loser_points_won, winner_true_serve_pct,
loser_true_serve_pct = game其中,以上所有内容都是原始df中的列名。
但是我得到了这个错误:NameError: name "game_id" is not defined
原始代码运行良好,但我不明白为什么=游戏在我的情况下不起作用。
我是不是遗漏了什么?有没有更好的方法来解决这个问题?我的数据集大约有60k个实例和33列
编辑:得到了解决方案,变量=游戏应该只在一行中。但是,如果有更好的解决方案来实现我在评论中描述的内容,欢迎您发表评论
编辑:列信息
game/tournament/winner/loser_id = unique id's
tournament_name = matches are played in daily tournaments
tournament_date = date of tournament/match
round_match = round of tournament the match takes place
winner/loser= winner/loser name
winner/loser_elo = current elo for the day
winner/loser_delta = delta change after match
winner/loser_set_1,2,3,4,5 = score for winner/loser per game, match ends when a player reaches 3 games, 11 points needed to win a game or 2 point differential (extra points if game is tied at 10-10)
winner/loser_sets= number of sets
winner/loser_points_won = total points per player in the match
winner/loser_true_serve_pct = % of points player wins when he is serving
winner_serve_points_won/loser_serve_points_won is obsolete
winner_serve_hold/loser_serve_hold will be replaced by winner/loser_true_serve_pct发布于 2020-04-09 07:48:08
您需要查询'Winner‘或’Loser‘列中出现的球员名称的数据帧,以使 matches**,获得涉及该球员的所有比赛的数据帧。这比乍一看更难--我们不能只使用简单的** df.groupby()**.**,很抱歉低估了你的问题。
下面的
df[['Winner','Loser']].isin([player]).any(axis=1)- Note that parameterizes the list of columns where the name can occur, it doesn't hardcode them. More elegant idiom than `df['Winner'] == player or df['Loser'] == player`
- I also looked into `df.query("...")` which takes arbitrary SQL-style query string, but its syntax is annoying
df['Winner']中有879个独特的玩家,df['Loser']中有1383个。我们将这两个组合到一个集合player_names中。你可以使用列表,但使用set来删除重复项更容易:set(df['Winner'].unique()) | set(df['Loser'].unique())其中'|‘是处理球员名称的列表效率,我们使用分类而不是字符串。注意代码是如何拥有一个统一的类别,并且合并了'Winner','Loser‘两个列的类别值(参见代码)。defaultdict(dict) defaultdict(dict),这很优雅,因为现在你可以直接定义字段,而不管是否定义了players players[player]['WhateverField'],如果没有定义,它将自动被players[player]。这里显示了你想创建的聚合字段的示例。mean(),sum()等对于简单的东西matches.agg(...)。
代码:
import pandas as pd
# https://www.kaggle.com/jordangoblet/atp-tour-20002016
df = pd.read_csv('../input/Data.csv', encoding='latin1', usecols=range(12+1), na_values='NR',
parse_dates=['Date'], dayfirst=True)
# make sure that the 'NR' in WRank/LRank column is recognized as NaN, so they read in as float columns
# Columns 0-12 (basic data), 13-25 (results by set), 26-53 (betting odds: float)
# Create table of number of wins, losses, days since last match, average serve percentage etc. per player.
pd.options.display.max_rows = 100
pd.options.display.precision = 3
player_names = set(df['Winner'].unique()) | set(df['Loser'].unique())
player_names = sorted(player_names) # 1.0.x BUG: order seems totally unstable(!)
player_dtype = pd.CategoricalDtype(categories = player_names) # merge across multiple columns
# WRONG: categories only taken from each individual column, not merged
# df[['WinnerCat1','LoserCat1']] = df[['Winner','Loser']].astype('category')
# RIGHT:
df[['WinnerCat','LoserCat']] = df[['Winner','Loser']].astype(player_dtype)
from collections import defaultdict
players = defaultdict(dict)
for _,player in (player_names):
# query df where player occurs in either column, can't use `df[['Winner','Loser']].groupby(player)`
matches = df[ df[['Winner','Loser']].isin([player]).any(axis=1) ]
print(f'\n[MATCHES FOR {_} {player}]')
print(matches.iloc[:, 0:13])
players[player]['Wins'] = matches['Winner'].eq(player).sum()
players[player]['Losses'] = matches['Loser'].eq(player).sum()
players[player]['Avg'] = matches['Winner'].eq(player).mean().round(4)
# You can do any arbitrary calculation involving both players' columns, e.g....
players[player]['RankDiff'] = (matches['WRank'] - matches['LRank']).mean()
#if _ >= 2: break # for debugging发布于 2020-04-08 06:13:02
例如,如果您有一个csv文件,可以像这样加载它
import pandas as pd
dataframe = pd.read.csv('file.csv')现在,选择您感兴趣的列(我将使用kaggle数据库作为参考)。
请注意,我不会写下你问的所有问题,因为我没有完全理解你的问题( kaggle数据库中没有日期,每个玩家的平均数据是什么?)
dataframe = dataframe['Winner', 'Loser']现在,我们使用groupby函数和count函数来获得每个玩家的输赢数量(并使其在"Winner"/"Loser“列中的显示是唯一的。
dataframe['Wins'] = dataframe.groupby(labels='Winner', axis=1).count()
dataframe['Losses'] = dataframe.groupby(labels='Loser', axis=1).count()发布于 2020-04-08 06:55:29
我运行了下面的代码,没有任何错误,所以我不认为这是游戏的问题。
df = pd.read_csv(r'Ten.csv')
lol = df.values.tolist()
players = {}
for game in lol:
Player,Tournament,Wins,Losses = game
if Player not in players:
players[Player] = {
'Tournament': 'foo',
'wins': 1,
'losses': 2,
}
print(players)但是,如果播放器不同,如下所示(小写"p")
player,Tournament,Wins,Losses = game或者:(播放器末尾的"s“)
players[Players] = {
'Tournament': 'foo',
'wins': 1,
'losses': 2,
}将弹出错误NameError: name 'Player‘is not defined。你百分之百确定game_id没有拼错吗?
https://stackoverflow.com/questions/61090014
复制相似问题