我在做一个关于酒精消费的潘达斯项目。关于您的信息,dataset有以下列:
欧洲大陆乡村啤酒精神葡萄酒
以下是我的代码:
# Separating data by continent
# ----------------------------
data_asia = data[data['Continent'] == 'Asia']
data_africa = data[data['Continent'] == 'Africa']
data_europe = data[data['Continent'] == 'Europe']
data_north = data[data['Continent'] == 'North America']
data_south = data[data['Continent'] == 'South America']
data_ocean = data[data['Continent'] == 'Oceania']
top_5_asia_beer = data_asia.nlargest(5, ['Beer Servings'])[['Country', 'Beer Servings']]
top_5_asia_spir = data_asia.nlargest(5, ['Spirit Servings'])[['Country', 'Spirit Servings']]
top_5_asia_wine = data_asia.nlargest(5, ['Wine Servings'])[['Country', 'Wine Servings']]
top_5_asia_pure = data_asia.nlargest(5, ['Total Litres of Pure Alcohol'])[['Country', 'Total Litres of Pure Alcohol']]
top_5_africa_beer = data_africa.nlargest(5, ['Beer Servings'])[['Country', 'Beer Servings']]
top_5_africa_spir = data_africa.nlargest(5, ['Spirit Servings'])[['Country', 'Spirit Servings']]
top_5_africa_wine = data_africa.nlargest(5, ['Wine Servings'])[['Country', 'Wine Servings']]
top_5_africa_pure = data_africa.nlargest(5, ['Total Litres of Pure Alcohol'])[['Country', 'Total Litres of Pure Alcohol']]
top_5_europe_beer = data_europe.nlargest(5, ['Beer Servings'])[['Country', 'Beer Servings']]
top_5_europe_spir = data_europe.nlargest(5, ['Spirit Servings'])[['Country', 'Spirit Servings']]
top_5_europe_wine = data_europe.nlargest(5, ['Wine Servings'])[['Country', 'Wine Servings']]
top_5_europe_pure = data_europe.nlargest(5, ['Total Litres of Pure Alcohol'])[['Country', 'Total Litres of Pure Alcohol']]
top_5_north_beer = data_north.nlargest(5, ['Beer Servings'])[['Country', 'Beer Servings']]
top_5_north_spir = data_north.nlargest(5, ['Spirit Servings'])[['Country', 'Spirit Servings']]
top_5_north_wine = data_north.nlargest(5, ['Wine Servings'])[['Country', 'Wine Servings']]
top_5_north_pure = data_north.nlargest(5, ['Total Litres of Pure Alcohol'])[['Country', 'Total Litres of Pure Alcohol']]
top_5_south_beer = data_south.nlargest(5, ['Beer Servings'])[['Country', 'Beer Servings']]
top_5_south_spir = data_south.nlargest(5, ['Spirit Servings'])[['Country', 'Spirit Servings']]
top_5_south_wine = data_south.nlargest(5, ['Wine Servings'])[['Country', 'Wine Servings']]
top_5_south_pure = data_south.nlargest(5, ['Total Litres of Pure Alcohol'])[['Country', 'Total Litres of Pure Alcohol']]
top_5_ocean_beer = data_ocean.nlargest(5, ['Beer Servings'])[['Country', 'Beer Servings']]
top_5_ocean_spir = data_ocean.nlargest(5, ['Spirit Servings'])[['Country', 'Spirit Servings']]
top_5_ocean_wine = data_ocean.nlargest(5, ['Wine Servings'])[['Country', 'Wine Servings']]
top_5_ocean_pure = data_ocean.nlargest(5, ['Total Litres of Pure Alcohol'])[['Country', 'Total Litres of Pure Alcohol']]我从重复和重复的角度理解我的代码的荒谬之处。谁能分享一下重构代码的技巧和技巧吗?
发布于 2020-02-25 09:03:17
取决于你想用它做什么。将前5位存储在自己的变量中似乎有点奇怪。
首先,您可以使用DataFrame按大陆使用.groupby进行切片:
for continent, continent_data in data.groupby("Continent"):
# `continent` is now the name of the continent (you don't have to type the continent names manually)
# `continent_data` is a dataframe, being a subset of the `data` dataframe根据第一个注释进行编辑:如果您想绘制变量,那么将每个变量存储在一个单独的变量中肯定不是一个好主意。您是否已经知道如何可视化您的数据?这是你需要努力的事情。我看不出每一个大陆的每一种含酒精饮料的前5名国家都是一个地区。
continents = []
top5s = {}
for continent, continent_data in data.groupby("Continent"):
continents.append(continent)
for beverage_column in ["Beer Servings", "Spirit Servings", "Wine Servings"]:
topcountries = continent_data.nlargest(5, beverage_column)
# do something with the data, such as:
print(f"Top 5 countries in {continent} for {beverage}:")
for row in topcountries.iterrows():
print(f"- {row.Country}: {row['beverage_column']} servings")确切地说:groupby()不返回可迭代的元组,实际上只是实现可迭代性的GroupBy对象(即这个__iter__()方法)。
发布于 2020-02-26 13:11:11
np.random.seed(42)
drinks = ["Beer", "Spirit", "Wine"]
continents = [
"Asia",
"Africa",
"Europe",
"North America",
"South America",
"Oceania",
]
countries = [f"country_{i}" for i in range(10)]
index = pd.MultiIndex.from_product(
(continents, countries), names=["continent", "country"]
)
data = np.random.randint(1_000_000, size=(len(index), len(drinks )))
df = pd.DataFrame(index=index, columns=columns, data=data).reset_index()最让人不快的是,每个数据点都有自己的变量。
第一步是使用字典:
data_by_continent = {
continent: df.loc[df["continent"] == continent]
for continent in continents
}注意,我使用.loc显式地创建了一个副本,而不是一个视图,以防止代码的一个部分中的更改污染另一个部分。
那么,每个大陆的精神消费是:
spirit_per_continent = {
continent: data.loc[
data["Spirit"].nlargest(5).index, ["country", "Spirit"]
]
for continent, data in data_by_continent.items()
}和每种饮料嵌套
consumption_per_drink_continent = {
drink: {
continent: data.loc[
data[drink].nlargest(5).index, ["country", drink]
]
for continent, data in data_by_continent.items()
}
for drink in drinks
})
如果你把你的数据转换成一个整洁的格式,你可以使用一个简单的组。
pandas.melt是一种非常方便的数据格式化方法。
df2 = pd.melt(
df,
id_vars=["continent", "country"],
var_name="drink",
value_name="consumption",
)大陆国家的饮料消费..。175个大洋洲country_5葡萄酒456551 176号大洋洲country_6葡萄酒894498 177号大洋洲country_7葡萄酒899684 178大洋洲country_8葡萄酒158338 179大洋洲country_9葡萄酒623094
现在您可以使用groupby,然后加入df2索引,介绍国家。
(
df2.groupby(["continent", "drink"])["consumption"]
.nlargest(5)
.reset_index(["continent", "drink"])
.sort_values(
["continent", "drink", "consumption"], ascending=[True, True, False]
)
.join(df2["country"])
)continent drink consumption country 17 Africa Beer 953277 country\_7 19 Africa Beer 902648 country\_9 15 Africa Beer 527035 country\_5 13 Africa Beer 500186 country\_3 14 Africa Beer 384681 country\_4 ... ... ... ... ... 162 South America Wine 837646 country\_2 160 South America Wine 742139 country\_0 167 South America Wine 688519 country\_7 161 South America Wine 516588 country\_1 166 South America Wine 136330 country\_6 90 rows × 4 columns
https://codereview.stackexchange.com/questions/237876
复制相似问题