我有一个csv数据库,我正在尝试使用k-匿名来“消毒”(用*‘s替换私有信息)。我尝试在多个列中选择一组值相同的行。如果有足够多的行,我想要为这些行修改列值。
subset= df.loc[(df['Gender'] == g) & (df['Date of birth'] == bd) & (df['Postal code'] == pc), :]
if subset.shape[0] < k :
subset['Date of birth'] = subset['Date of birth'].apply(lambda db: f(db))此代码产生错误。
> A value is trying to be set on a copy of a slice from a DataFrame. Try
> using .loc[row_indexer,col_indexer] = value instead
>
> See the caveats in the documentation:
> http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
> subset['Date of birth'] = subset['Date of birth'].apply(lambda db:
> day_in_date_of_birth_with_stars(db))不知道怎么解决这个问题?我可以重复查找行,而不是将其存储在变量中,但这将被分配多次运行,并希望它尽可能快。
此代码也不修改dataframe中的值。
我已将代码更改为
num_with_all = df.loc[(df['Gender'] == g) & (df['Date of birth'] == bd) & (df['Postal code'] == pc)].shape[0]
if num_with_all < k:
df.ix[(df['Gender'] == g) & (df['Date of birth'] == bd) & (df['Postal code'] == pc), 'Date of birth'] = df.loc[(df['Gender'] == g) & (df['Date of birth'] == bd) & (df['Postal code'] == pc), 'Date of birth'].apply(lambda bd: f(bd))这似乎运行,但需要很长时间的所有子集(性别,出生日期,邮政编码)在数据库中。有没有办法让这个更有效率?
示例:
我想把这个转过来
Name Gender Date of birth Telephone Postal code Disease
0 ************* M 18-7-1981 ************ N2L 6B5 Avian Influenza
1 *********** F 28-11-1976 *********** N2L 4T6 Human Pulmonary Syndrome (HPS)
2 *************** F 4-3-1962 ************ N2L 1L9 Chlamydial infection
3 *************** F 10-8-1967 ************ N2L 4M5 Dandy fever
4 **************** F 19-3-1963 ************ N2L 2L1 Chlamydial infection
5 ************ F 2-2-1979 ************ N2L 5J1 Scarlet fever
6 *********** M 21-1-1985 *********** N2L 1S6 Scarlet fever
7 *********** M 7-6-1977 ************ N2L 2Q9 Chlamydia
8 *************** F 9-11-1987 ************ N2L 7H9 Chlamydia
9 ***************** M 7-7-1989 ************ N2L 3B1 SARS- Severe Acute Respiratory Syndrome
10 *********** M 1-3-1969 ************ N2L 6N9 Malaria
11 ************** M 21-4-1990 ************ N2L 0B0 North American blastomycosis
12 *************** F 9-12-1964 ************ N2L 7F6 Chlamydia
13 ********** M 21-7-1960 ************ N2L 3P3 Chickenpox
14 ****************** F 11-10-1972 *********** N2L 6E4 Diphtheria
15 ************** M 25-12-1988 ************ N2L 1T4 SARS- Severe Acute Respiratory Syndrome转到
Name Gender Date of birth Telephone Postal code Disease
0 ************* M **-7-1981 ************ N2L 6B5 Avian Influenza
1 *********** F **-11-1976 *********** N2L 4T6 Human Pulmonary Syndrome (HPS)
2 *************** F *-3-1962 ************ N2L 1L9 Chlamydial infection
3 *************** F **-8-1967 ************ N2L 4M5 Dandy fever
4 **************** F 19-3-1963 ************ N2L 2L1 Chlamydial infection
5 ************ F 2-2-1979 ************ N2L 5J1 Scarlet fever
6 *********** M **-1-1985 *********** N2L 1S6 Scarlet fever
7 *********** M *-6-1977 ************ N2L 2Q9 Chlamydia
8 *************** F 9-11-1987 ************ N2L 7H9 Chlamydia
9 ***************** M *-7-1989 ************ N2L 3B1 SARS- Severe Acute Respiratory Syndrome
10 *********** M 1-3-1969 ************ N2L 6N9 Malaria
11 ************** M 21-4-1990 ************ N2L 0B0 North American blastomycosis
12 *************** F *-12-1964 ************ N2L 7F6 Chlamydia
13 ********** M **-7-1960 ************ N2L 3P3 Chickenpox
14 ****************** F 11-10-1972 *********** N2L 6E4 Diphtheria
15 ************** M **-12-1988 ************ N2L 1T4 SARS- Severe Acute Respiratory Syndrome只有一些行的出生日期发生了更改,而*s替换了一些数字。
发布于 2017-03-12 10:48:40
如果提供了可以玩的数据,就容易多了。
执行时间清单:
The code in your QuestionNote:关于if len(query_df) < k:,这难道不是if len(query_df) >= k:吗?
如果使用列的on索引(“性别”、“出生日期”、“邮政编码”),则可以获得额外的加速比。
输出:(使用您的数据)
我只显示前5条记录,只更改了record==1,因为我的查询条件是“F”、“28-11-1976”、“N2L 4T6”
Name Gender Date of birth Telephone Postal code
0 ************* M 18-7-1981 ************ N2L 6B5
1 *********** F **-11-1976 *********** N2L 4T6
2 *************** F 4-3-1962 ************ N2L 1L9
3 *************** F 10-8-1967 ************ N2L 4M5
4 **************** F 19-3-1963 ************ N2L 2L1用Python测试:3.4.2-熊猫:0.19.2
https://stackoverflow.com/questions/42742711
复制相似问题