我正在尝试从我的数据集中提取离群值,并相应地标记它们。
样本数据
Doctor Name Hospital Assigned Region Claims Illness Claimed
1 Albert Some hospital Center R-1 20 Sepsis
2 Simon Another hospital Center R-2 21 Pneumonia
3 Alvin ... ... ... ...
4 Robert
5 Benedict
6 Cruz因此,我尝试将每个与某个Doctor关联的Illness分组到某个Region中,并尝试找出其中的异常值。
Doctor Name Hospital Assigned Region Claims Illness Claimed is_outlier
1 Albert Some hospital Center R-1 20 Sepsis 1
2 Simon Another hospital Center R-2 21 Pneumonia 0
3 Alvin ... ... ... ...
4 Robert
5 Benedict
6 Cruz我可以在Power BI中做到这一点。但是作为Python的新手,我似乎无法理解这一点。
这就是我想要实现的:

Algo是这样的:
Read data
Group data by Illness
Group by Region
get IQR based on Claims Count
if claims count > than (Q3 + 1.5) * IQR
then tag it as outlier = 1
else
not an outlier = 0
Export data有什么想法吗?
发布于 2019-01-28 16:21:25
假设您使用pandas进行数据分析(您应该这样做!)您可以使用pandas dataframe boxplot来生成与您的类似的图:
import pandas as pd
import numpy as np
df.boxplot(column=['b'], whis=[10, 90], vert=False,
flierprops=dict(markerfacecolor='g', marker='D'))或者,如果您想按照请求将它们标记为0,1,则使用dataframe quantile()方法https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html
df.assign(outlier=df[df>=df.quantile(.9)].any(axis=1)).astype(np.int8)
a b outlier
0 1 1 0
1 2 10 0
2 3 100 1
3 4 100 1https://stackoverflow.com/questions/54396826
复制相似问题