我试图清理代码片段,但是当将代码的一部分迁移到函数中时,它开始推给我一个异常,如下所示:
下面是我想清理的片段:
import pandas as pd
import os
df = pd.read_csv('winequality-red.csv', sep=';')
labels = list(df.columns)
for index, label in enumerate(labels):
labels[index] = labels[index].replace(' ', '_')
substance = 'pH'
median = df[substance].mean()
for index, substance in enumerate(df[substance]):
if substance >= median:
df.loc[index, substance] = 'high'
else:
df.loc[index, substance] = 'low'
print(df.groupby(substance).quality.mean())这样做的目的是创建两个函数,并在需要评估一种物质的时候调用它们,考虑到这一点,我做到了:
def substance_mean(substance):
return df[substance].mean()
def substance_evaluation(substance):
for index, substance in enumerate(df[substance]):
if substance >= substance_mean(substance):
df.loc[index, substance] = 'high'
else:
df.loc[index, substance] = 'low'
print(df.groupby(substance).quality.mean())
substance_evaluation('pH')当我运行代码时,会引发以下异常:
Traceback (most recent call last):
File "/home/atila/Desktop/estudos/udacity/aws_ML/venv-ml/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2646, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 3.51
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/atila/Desktop/estudos/udacity/aws_ML/app.py", line 34, in <module>
substance_evaluation('pH')
File "/home/atila/Desktop/estudos/udacity/aws_ML/app.py", line 28, in substance_evaluation
if substance >= substance_mean(substance):
File "/home/atila/Desktop/estudos/udacity/aws_ML/app.py", line 24, in substance_mean
return df[substance].mean()
File "/home/atila/Desktop/estudos/udacity/aws_ML/venv-ml/lib/python3.6/site-packages/pandas/core/frame.py", line 2800, in __getitem__
indexer = self.columns.get_loc(key)
File "/home/atila/Desktop/estudos/udacity/aws_ML/venv-ml/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 3.51发布于 2020-06-19 00:26:14
我无法运行它,但是您的所有问题是,在substance_evaluation()中,对于两个应该保持不同值的变量,您使用相同的名称substance。
首先您有substance在
def substance_evaluation(substance) 这个变量应该保留"ph",但是稍后您将使用
for ..., substance in ...: 将不同的值赋给此变量(代替"ph"),并在以后使用
... >= substance_mean(substance) 要为mean计算"ph",但此时substance没有值"ph",而是3.51 (显示错误KeyError: 3.51)
你不会有这个问题的,如果你在功能上
median = df[substance].mean()和
if substance >= median:此外,使用函数运行单行代码也是浪费时间。
保持这条线,你只计算中值一次前循环。在循环中使用函数,可以多次计算相同的值--这也是浪费时间。
我认为在这两个版本(有功能和没有功能)中,substance仍然会有问题,因为您也在df.loc[index, substance]中使用它,所以它可能尝试执行df.loc[index, 3.51]而不是df.loc[index, "ph"]。你应该用不同的名字.value
for ..., value in ...:
if value >= median:您应该有一个这样的函数:
def substance_evaluation(substance):
median = df[substance].mean()
for index, value in enumerate(df[substance]):
if value >= median:
df.loc[index, substance] = 'high'
else:
df.loc[index, substance] = 'low'
print(df.groupby(substance).quality.mean())但我觉得你可以写得很简单。
def substance_evaluation(substance):
median = df[substance].mean()
mask = (df[substance] >= mediam)
df[substance][ mask ] = 'high'
df[substance][ ~mask ] = 'low'
print(df.groupby(substance).quality.mean())最终使用np.where()
def substance_evaluation(substance):
median = df[substance].mean()
mask = (df[substance] >= mediam)
df[substance] = np.where(mask, 'high', 'low')
print(df.groupby(substance).quality.mean())在此版本中,您可以轻松地创建带有值的新列。
df["new column"] = np.where(mask, 'high', 'low')编辑:用于测试的最小工作代码
import pandas as pd
import random
import numpy as np
import time
def version1(df, substance):
median = df[substance].mean()
for index, value in enumerate(df[substance]):
if value >= median:
df.loc[index, substance] = 'high'
else:
df.loc[index, substance] = 'low'
def version2(df, substance):
median = df[substance].mean()
mask = (df[substance] >= median)
df[substance][ mask ] = 'high'
df[substance][ ~mask ] = 'low'
def version3(df, substance):
median = df[substance].mean()
mask = (df[substance] >= median)
df[substance] = np.where(mask, 'high', 'low')
# ---
random.seed(0) # to generate always the same values
df = pd.DataFrame({'pH': [random.randint(0,7) for _ in range(5)]})
substance = 'pH'
print('--- before ---')
print(df)
# ---
df1 = df.copy()
start = time.time()
version1(df1, substance)
end = time.time()
print('--- after --- time:', end-start)
print(df1)
# ---
df2 = df.copy()
start = time.time()
version2(df2, substance)
end = time.time()
print('--- after --- time:', end-start)
print(df1)
# ---
df3 = df.copy()
start = time.time()
version3(df3, substance)
end = time.time()
print('--- after --- time:', end-start)
print(df1)https://stackoverflow.com/questions/62459053
复制相似问题