首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何在pandas数据框上生成具有随机值的合成数据?

如何在pandas数据框上生成具有随机值的合成数据?
EN

Stack Overflow用户
提问于 2019-12-30 03:52:22
回答 5查看 925关注 0票数 1

我有一个50K行的数据帧。我想用随机值替换20%的数据(给出随机数的间隔)。其目的是生成合成异常值来测试算法。下面的数据帧是我所拥有的df的一小部分。应该用随机异常值替换的值是“value”列。

代码语言:javascript
复制
import pandas as pd
dict = {'date':["2016-11-10", "2016-11-10", "2016-11-11", "2016-11-11","2016-11-11","2016-11-11","2016-11-11", "2016-11-11" ], 
        'time': ["22:00:00", "23:00:00", "00:00:00", "01:00:00", "02:00:00", "03:00:00", "04:00:00", "04:00:00"], 
        'value':[90, 91, 80, 87, 84,94, 91, 94]} 

df = pd.DataFrame(dict) 

print(df)
        date      time  value
0  2016-11-10  22:00:00     90
1  2016-11-10  23:00:00     91
2  2016-11-11  00:00:00     80
3  2016-11-11  01:00:00     87
4  2016-11-11  02:00:00     84
5  2016-11-11  03:00:00     94
6  2016-11-11  04:00:00     91
7  2016-11-11  05:00:00     94

例如,我想给出一个从1到50的随机值的间隔,所需的df将如下所示:

代码语言:javascript
复制
        date      time  value
0  2016-11-10  22:00:00     90
1  2016-11-10  23:00:00     91
2  2016-11-11  00:00:00     80
3  2016-11-11  01:00:00     4
4  2016-11-11  02:00:00     84
5  2016-11-11  03:00:00     94
6  2016-11-11  04:00:00     32
7  2016-11-11  05:00:00     94

如果您有任何想法,我将不胜感激。谢谢!

EN

回答 5

Stack Overflow用户

回答已采纳

发布于 2019-12-30 04:43:49

下面是一个应该很快的numpy示例。包含较高和较低替换的示例假设您希望均匀地替换较高和较低的值(50-50),如果不是这样,您可以将mask_high = np.random.choice([0,1], p=[.5, .5], size=rand.shape).astype(np.bool)中的p更改为您想要的任何值。

代码语言:javascript
复制
d = {'date':["2016-11-10", "2016-11-10", "2016-11-11", "2016-11-11","2016-11-11","2016-11-11","2016-11-11", "2016-11-11" ], 
        'time': ["22:00:00", "23:00:00", "00:00:00", "01:00:00", "02:00:00", "03:00:00", "04:00:00", "04:00:00"], 
        'value':[90, 91, 80, 87, 84,94, 91, 94]} 

df = pd.DataFrame(d) 

# create a function
def myFunc(df, replace_pct, start_range, stop_range, replace_col):
    # create an array of the col you want to replace
    val = df[replace_col].values 
    # create a boolean mask for the percent you want to replace
    mask = np.random.choice([0,1], p=[1-replace_pct, replace_pct], size=val.shape).astype(np.bool)
    # create a random ints between the range
    rand = np.random.randint(start_range, stop_range, size=len(mask[mask == True]))
    # replace values in the original array
    val[mask] = rand
    # update column
    df[replace_col] = val
    return df

myFunc(df, .2, 1, 50, 'value')

         date      time  value
0  2016-11-10  22:00:00     90
1  2016-11-10  23:00:00     91
2  2016-11-11  00:00:00     80
3  2016-11-11  01:00:00     87
4  2016-11-11  02:00:00     46
5  2016-11-11  03:00:00     94
6  2016-11-11  04:00:00     91
7  2016-11-11  04:00:00     94

时间

代码语言:javascript
复制
%%timeit
myFunc(df, .2, 1, 50, 'value')

397 µs ± 27.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

既有高替换又有低替换的示例

代码语言:javascript
复制
# create a function
def myFunc2(df, replace_pct, start_range_low, stop_range_low,
            start_range_high, stop_range_high, replace_col):
    # create array of col you want to replace
    val = df[replace_col].values 
    # create a boolean mask for the percent you want to replace
    mask = np.random.choice([0,1], p=[1-replace_pct, replace_pct], size=val.shape).astype(np.bool)
    # create a random int between ranges
    rand = np.random.randint(start_range_low, stop_range_low, size=len(mask[mask == True]))
    # create a mask for the higher range
    mask_high = np.random.choice([0,1], p=[.5, .5], size=rand.shape).astype(np.bool)
    # create random ints between high ranges
    rand_high = np.random.randint(start_range_high, stop_range_high, size=len(mask_high[mask_high == True]))
    # replace values in the rand array
    rand[mask_high] = rand_high
    # replace values in the original array
    val[mask] = rand
    # update column
    df[replace_col] = val
    return df

myFunc2(df, .2, 1, 50, 200, 300, 'value')


         date      time  value
0  2016-11-10  22:00:00     90
1  2016-11-10  23:00:00    216
2  2016-11-11  00:00:00     80
3  2016-11-11  01:00:00     49
4  2016-11-11  02:00:00     84
5  2016-11-11  03:00:00     94
6  2016-11-11  04:00:00    270
7  2016-11-11  04:00:00     94

时间

代码语言:javascript
复制
%%timeit
myFunc2(df, .2, 1, 50, 200, 300, 'value')

493 µs ± 41.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
票数 1
EN

Stack Overflow用户

发布于 2019-12-30 04:40:59

这可能行得通。

代码语言:javascript
复制
outliers = []
def get_outlier(x):
    num = 3
    mean_ = np.mean(x)
    std_ = np.std(x)
    for y in x:
        z_score = (y - mean_) / std_
        if np.abs(z_score) > num:
            outliers.append(y)
    return get_outlier

detect_outliers = get_outlier(df['value'])
sorted(df['value'])
q1, q3 = np.percentile(df['value'], [25, 75])
iqr = q3 - q1
lb = q1 - (1.5 * iqr)
ub = q3 - (1.5 * iqr)

for i in range(len(df)):
    if ((df['value'][i] < lb) | (df['value'][i] > ub)):
        df['value'][i] = np.random.randint(1, 50)
票数 0
EN

Stack Overflow用户

发布于 2019-12-30 05:22:22

另一种尝试,使用DataFrame.sample()

代码语言:javascript
复制
import numpy as np
import pandas as pd

d = {'date':["2016-11-10", "2016-11-10", "2016-11-11", "2016-11-11","2016-11-11","2016-11-11","2016-11-11", "2016-11-11" ],
     'time': ["22:00:00", "23:00:00", "00:00:00", "01:00:00", "02:00:00", "03:00:00", "04:00:00", "04:00:00"],
     'value':[90, 91, 80, 87, 84,94, 91, 94]}

df = pd.DataFrame(d)

random_rows = df.sample(frac=.2)    # 20% random rows from `df`

# we are replacing these 20% random rows with values from 1..50 and 200..300 (in ~1:1 ratio)
random_values = np.random.choice( np.concatenate( [np.random.randint(1, 50, size=len(random_rows) // 2 + 1),
                                                   np.random.randint(200, 300, size=len(random_rows) // 2 + 1)] ),
                size=len(random_rows) )
df.loc[random_rows.index, 'value'] = random_values
print(df)

这将打印(例如):

代码语言:javascript
复制
         date      time  value
0  2016-11-10  22:00:00     31   <-- 31
1  2016-11-10  23:00:00     91
2  2016-11-11  00:00:00     80
3  2016-11-11  01:00:00     87
4  2016-11-11  02:00:00     84
5  2016-11-11  03:00:00    236   <-- 236
6  2016-11-11  04:00:00     91
7  2016-11-11  04:00:00     94
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/59522783

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档