我正在编写一个脚本,它从excel文件中的每个类别获取一个示例。有不同的百分比,取决于长度,但我想知道是否有办法设定一个限制5项每样本,即使1%带回,比如说,2项。任何帮助都将不胜感激。
import pandas as pd
df = pd.read_excel(r"C:\Users\****\Desktop\Audit_catalogs\****.xlsx")
df2 = df.loc[(df['Track Item']=='Y')]
print(len(df2))
def sample_per(df2):
if len(df2) >= 15000:
return (df2.groupby('Category').apply(lambda x: x.sample(frac=0.01)))
elif len(df2) < 15000 and len(df2) > 10000:
return (df2.groupby('Category').apply(lambda x: x.sample(frac=0.03)))
else:
return (df2.groupby('Category').apply(lambda x: x.sample(frac=0.05)))
final = sample_per(df2)
df.loc[df['Retailer Item ID'].isin(final['Retailer Item ID']), 'Track Item'] = 'Audit'
df.to_csv('****_Audit.csv',index=False)发布于 2020-04-08 16:01:37
您可以使用x.size * 0.01来检查可以得到多少值,并使用sample(n=5)而不是sample(frac=0.01)
.apply(lambda x: x.sample(n=5) if x.size*0.01 < 5 else x.sample(frac=0.01))import pandas as pd
import random
random.seed(1) # to generate always the same random data
data = {'Category': [random.choice([1,2,2,2,3]) for x in range(1000)]} # columns
df = pd.DataFrame(data)
print(df)
# --- before ---
df1 = df.groupby('Category').apply(lambda x: x.sample(frac=0.01))
print('--- before ---')
print(df1['Category'].value_counts())
# --- after ---
df2 = df.groupby('Category').apply(lambda x: x.sample(n=5) if x.size*.01 < 5 else x.sample(frac=0.01))
print('--- after ---')
print(df2['Category'].value_counts())结果
--- before ---
2 6
3 2
1 2
Name: Category, dtype: int64
--- after ---
2 6
3 5
1 5
Name: Category, dtype: int64 编辑:以更易读的方式使用
def myfunction(x):
if x.size*0.01 < 5:
return x.sample(n=5)
else:
return x.sample(frac=0.01)
df1 = df.groupby('Category').apply(myfunction)https://stackoverflow.com/questions/61104380
复制相似问题