文章/答案/技术大牛

发布

社区首页 >问答首页 >熊猫选择切片，然后将函数映射到列上

问熊猫选择切片，然后将函数映射到列上
EN

Stack Overflow用户

提问于 2017-03-12 01:32:00

回答 1查看 1.2K关注 0票数 0

我有一个csv数据库，我正在尝试使用k-匿名来“消毒”(用*‘s替换私有信息)。我尝试在多个列中选择一组值相同的行。如果有足够多的行，我想要为这些行修改列值。

subset= df.loc[(df['Gender'] == g) & (df['Date of birth'] == bd) & (df['Postal code'] == pc), :]
if subset.shape[0] < k :
    subset['Date of birth'] = subset['Date of birth'].apply(lambda db: f(db))

此代码产生错误。

> A value is trying to be set on a copy of a slice from a DataFrame. Try
> using .loc[row_indexer,col_indexer] = value instead
> 
> See the caveats in the documentation:
> http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
> subset['Date of birth'] = subset['Date of birth'].apply(lambda db:
> day_in_date_of_birth_with_stars(db))

不知道怎么解决这个问题？我可以重复查找行，而不是将其存储在变量中，但这将被分配多次运行，并希望它尽可能快。

此代码也不修改dataframe中的值。

我已将代码更改为

num_with_all = df.loc[(df['Gender'] == g) & (df['Date of birth'] == bd) & (df['Postal code'] == pc)].shape[0]
if num_with_all < k:
    df.ix[(df['Gender'] == g) & (df['Date of birth'] == bd) & (df['Postal code'] == pc), 'Date of birth'] = df.loc[(df['Gender'] == g) & (df['Date of birth'] == bd) & (df['Postal code'] == pc), 'Date of birth'].apply(lambda bd: f(bd))

这似乎运行，但需要很长时间的所有子集(性别，出生日期，邮政编码)在数据库中。有没有办法让这个更有效率？

示例：

我想把这个转过来

    Name    Gender  Date of birth   Telephone   Postal code Disease
0   *************   M   18-7-1981   ************    N2L 6B5 Avian Influenza
1   *********** F   28-11-1976  *********** N2L 4T6 Human Pulmonary Syndrome (HPS)
2   *************** F   4-3-1962    ************    N2L 1L9 Chlamydial infection
3   *************** F   10-8-1967   ************    N2L 4M5 Dandy fever
4   ****************    F   19-3-1963   ************    N2L 2L1 Chlamydial infection
5   ************    F   2-2-1979    ************    N2L 5J1 Scarlet fever
6   *********** M   21-1-1985   *********** N2L 1S6 Scarlet fever
7   *********** M   7-6-1977    ************    N2L 2Q9 Chlamydia
8   *************** F   9-11-1987   ************    N2L 7H9 Chlamydia
9   *****************   M   7-7-1989    ************    N2L 3B1 SARS- Severe Acute Respiratory Syndrome
10  *********** M   1-3-1969    ************    N2L 6N9 Malaria
11  **************  M   21-4-1990   ************    N2L 0B0 North American blastomycosis
12  *************** F   9-12-1964   ************    N2L 7F6 Chlamydia
13  **********  M   21-7-1960   ************    N2L 3P3 Chickenpox
14  ******************  F   11-10-1972  *********** N2L 6E4 Diphtheria
15  **************  M   25-12-1988  ************    N2L 1T4 SARS- Severe Acute Respiratory Syndrome

转到

    Name    Gender  Date of birth   Telephone   Postal code Disease
0   *************   M   **-7-1981   ************    N2L 6B5 Avian Influenza
1   *********** F   **-11-1976  *********** N2L 4T6 Human Pulmonary Syndrome (HPS)
2   *************** F   *-3-1962    ************    N2L 1L9 Chlamydial infection
3   *************** F   **-8-1967   ************    N2L 4M5 Dandy fever
4   ****************    F   19-3-1963   ************    N2L 2L1 Chlamydial infection
5   ************    F   2-2-1979    ************    N2L 5J1 Scarlet fever
6   *********** M   **-1-1985   *********** N2L 1S6 Scarlet fever
7   *********** M   *-6-1977    ************    N2L 2Q9 Chlamydia
8   *************** F   9-11-1987   ************    N2L 7H9 Chlamydia
9   *****************   M   *-7-1989    ************    N2L 3B1 SARS- Severe Acute Respiratory Syndrome
10  *********** M   1-3-1969    ************    N2L 6N9 Malaria
11  **************  M   21-4-1990   ************    N2L 0B0 North American blastomycosis
12  *************** F   *-12-1964   ************    N2L 7F6 Chlamydia
13  **********  M   **-7-1960   ************    N2L 3P3 Chickenpox
14  ******************  F   11-10-1972  *********** N2L 6E4 Diphtheria
15  **************  M   **-12-1988  ************    N2L 1T4 SARS- Severe Acute Respiratory Syndrome

只有一些行的出生日期发生了更改，而*s替换了一些数字。

python-2.7

pandas

dataframe

slice

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-03-12 10:48:40

如果提供了可以玩的数据，就容易多了。

执行时间清单：

df.loc..。df.ix...= df.loc....apply(lamda:) 循环100次-在0:00:02.225785中模拟1600条记录 The code in your Question
loc....apply(lamda:series) 循环100次-在0:00:00.757525中模拟1600条记录 def f2(系列)：df.locseries.name，“生日”=“**”+系列“出生日期”返回序列query_df =df.loc[(df‘’Gender‘== g) & (df'Date’== bd) &(df‘’Postal code‘== pc)] if len(query_df) < k: query_df.apply(lambda系列:f2(系列)，axis=1)
loc....loop索引 循环100次-在0:00:00.666067中模拟1600条记录 query_df =df.loc[(df‘’Gender‘== g)和(df'Date’== bd)和(df‘邮政编码’== pc)]如果len(query_df) < k: idx in query_df.index: df.locidx，“生日”= '**‘+ df.locidx，’生日‘

Note：关于if len(query_df) < k:，这难道不是if len(query_df) >= k:吗？

如果使用列的on索引(“性别”、“出生日期”、“邮政编码”)，则可以获得额外的加速比。

输出：(使用您的数据)

我只显示前5条记录，只更改了record==1，因为我的查询条件是“F”、“28-11-1976”、“N2L 4T6”

               Name Gender Date of birth     Telephone Postal code
0     *************      M     18-7-1981  ************     N2L 6B5
1       ***********      F    **-11-1976   ***********     N2L 4T6
2   ***************      F      4-3-1962  ************     N2L 1L9
3   ***************      F     10-8-1967  ************     N2L 4M5
4  ****************      F     19-3-1963  ************     N2L 2L1

用Python测试:3.4.2-熊猫:0.19.2

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/42742711

复制

相似问题

问熊猫选择切片，然后将函数映射到列上
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问熊猫选择切片，然后将函数映射到列上EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问熊猫选择切片，然后将函数映射到列上
EN