文章/答案/技术大牛

发布

社区首页 >问答首页 >根据多个(和类似的)字符串条件(行只能使用一次)，将Dataframe随机拆分为较小的块。

问根据多个(和类似的)字符串条件(行只能使用一次)，将Dataframe随机拆分为较小的块。
EN

Stack Overflow用户

提问于 2021-04-18 04:26:41

回答 1查看 283关注 0票数 2

我有一个包含800项(行)的数据框架，每一行都位于不同的区域。这些地区包括:奥尔斯顿、波士顿、布莱顿、芬威、布鲁克林、剑桥、牛顿。

例如Pandas Dataframe：

       area       price      location                   Bedroom
1      boston     3074        1 Devonshire Place        1
2      boston     3310       72 Staniford Street        2
3      allston    1825  1156 Commonwealth Avenue        1
4      cambridge  3895         39 Clinton Street        3
5      fenway     2325     98 Queensberry Street        1

我尝试将这个数据帧的行随机分成三组：

Group A拥有数据框架中60%的行，只能包含以下区域：“Allston”、“波士顿”、“Brighton”、“芬威”、“Brookline”、“剑桥”、“newton”

B组拥有数据帧中30%的行:并且只能有以下区域：“Allston”、“波士顿”、“Brighton”、“芬威”、“

”

Group C拥有数据帧中10%的行，只能包含以下区域：“波士顿”、“布赖顿”、“芬威

”

每个项目/行只能分发一次。如果其中一个群体没有覆盖其中一些地区，这并不重要。如果C组只有‘波士顿和/或布莱顿’的项目，那就没问题，但C组不能有牛顿的项目。

我已经尝试过dataframe.sample()、np.split()、np.random.choice()，但是使用所有这些技术，行都会被复制。我计划编写一个循环，这样每次创建组时，随机选择的行都会有所不同。

知道怎么解决吗？

python

dataframe

random

conditional-statements

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-04-19 04:49:32

这是这个特殊情况的代码。我不认为它能被广泛地概括。

根据C组允许的区域，样本0.1 * 800 = 80，标记为C组，然后从B组允许区域的未标记行中选择240，标记为B组。其余的必须在A组中。

import pandas as pd
import random

allowed = {
    'A':"Allston Boston Brighton Fenway Brookline Cambridge Newton".split(),
    'B':"Allston Boston Brighton Fenway".split(),
    'C':"Boston Brighton Fenway".split()
}

weight = {
    'A':0.6,
    'B':0.3,
    'C':0.1
}

# create random areas that meet the requirements 10% group C, 30% group B and 60% group A
rows = []
for area in random.choices(list(weight.keys()), weights=weight.values(), k=800):
    rows.append(random.choice(allowed[area]))

# create a dummy data frame
df = pd.DataFrame({'areas':rows,
                   'price':[random.randrange(1000, 5000) for _ in range(len(rows))]})

# add a column for the group, set to '' to indicate unassigned
df['group'] = ['']*len(rows)

for group in 'CBA':
    # Select rows that are not assigned to a group and that have areas that are
    # allowed for the current group. Then randomly sample the selected rows.
    xs = df[(df.group=='') & df['areas'].isin(allowed[group])].sample(n=int(len(rows)*weight[group]))

    # Mark the sampled rows with the group
    df.loc[xs.index,'group'] = group

    # this just to see what's happening
    print(group, len(xs))
    print(df.head())
    print()

最终的结果是DataFrame有一个列‘组’，并根据给定的约束随机赋值。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/67145174

复制

相似问题

问根据多个(和类似的)字符串条件(行只能使用一次)，将Dataframe随机拆分为较小的块。
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问根据多个(和类似的)字符串条件(行只能使用一次)，将Dataframe随机拆分为较小的块。EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问根据多个(和类似的)字符串条件(行只能使用一次)，将Dataframe随机拆分为较小的块。
EN