我有一个Pandas DataFrame,其中一个列中有一些分类数据。在对特定专栏执行value_counts时,我得到类似于以下内容的内容:
HR 176
Coding 81
Reject 74
Database Administration 21
Finance 17
Project Management 16
Sales 15
DevOps 13
Core Electronics 10
Networking 10
Medical Science 9
Core Mechanical 8
Web Development 4
Puzzles 3
behavioural 3
not a question 2
civil engineering 1
Mathematics 1
Finance, Medical Science 1
Sales, HR 1我想做的是只保留一个计数>=的类别某个阈值(例如10)。所有较小的类别都应单独归入“其他”类别,即结果应该如下所示:
HR 176
Coding 81
Reject 74
*Other* 33
Database Administration 21
Finance 17
Project Management 16
Sales 15
DevOps 13
Core Electronics 10
Networking 10我过去做过这样的工作,方法是将一个defaultdict(int)合并在一起,并且只使用计数>=阈值的实例。我想知道是否有一种Pandas规范的方法来实现同样的目标。
发布于 2022-08-23 09:36:23
这就是你要找的答案吗?
Pandas: Selecting rows based on value counts of a particular column
否则也许这就是你想要的:
data = pd.DataFrame([["researcher",150],["politician",15],["builder",1],["teacher",5],])
data.columns = ["category", "count"]
filter_value = 10
d1 = data[data['count'] >= filter_value]
d2 = data[data['count'] < filter_value]
d1["tag"] = "filter_passed"
d2["tag"] = "Others"
data = pd.concat([d1,d2])
>>> data
category count tag
0 researcher 150 filter_passed
1 politician 15 filter_passed
2 builder 1 Others
3 teacher 5 Others发布于 2022-08-23 09:45:44
我会使用一个掩码来执行布尔索引和concat
m = s>=10
out = (pd.concat([s[m], pd.Series(s[~m].sum(), index=['Others'])])
.sort_values(ascending=False)
)产出:
HR 176
Coding 81
Reject 74
Others 33
Database Administration 21
Finance 17
Project Management 16
Sales 15
DevOps 13
Core Electronics 10
Networking https://stackoverflow.com/questions/73456315
复制相似问题