首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >重复项,drop_duplicates故障

重复项,drop_duplicates故障
EN

Stack Overflow用户
提问于 2019-10-16 15:38:56
回答 2查看 72关注 0票数 0

这是this question about merging two files with protein data的后续版本。

当我使用biopandas包导入数据帧时,我无法让duplicated/drop_duplicates丢弃我的副本。我的数据帧非常大:

代码语言:javascript
复制
# df:

col1    col2    col3    col4    col5    col6    col7    col8    col9

0   ATOM    N   SER     15  17.203  0.286   72.985  4pxz
1   ATOM    CA  SER     15  16.713  1.342   73.869  4pxz
2   ATOM    C   SER     15  17.885  2.188   74.412  4pxz
3   ATOM    O   SER     15  18.028  3.351   74.013  4pxz
4   ATOM    CB  SER     15  15.889  0.750   75.014  4pxz
...     ...     ...     ...     ...     ...     ...     ...     ...
3   ATOM    CD  ARG     93  12.319  8.102   61.886  hatp
4   ATOM    NE  ARG     93  11.978  6.754   61.425  hatp
5   ATOM    CZ  ARG     93  11.731  5.714   62.217  hatp
6   ATOM    NH2     ARG     93  11.430  4.535   61.694  hatp
7   ATOM    NH1     ARG     93  11.793  5.843   63.538  hatp

3148 rows × 8 columns

我想使用以下命令在重复范围内检查它:

代码语言:javascript
复制
df2 = df[df.duplicated(['col3','col4','col5'])] # show me duplicates containing identical type(col3), abbreviation(col4) and number(col5).

我得到了:

代码语言:javascript
复制
col1    col2    col3    col4    col5    col6    col7    col8

2132    ATOM    CA      HIS     1063    38.442  -16.479     -5.209  4pxz
2136    ATOM    CB      HIS     1063    37.502  -15.555     -6.008  4pxz
2138    ATOM    CG      HIS     1063    38.007  -15.211     -7.378  4pxz
2140    ATOM    ND1     HIS     1063    38.342  -16.194     -8.293  4pxz
2142    ATOM    CD2     HIS     1063    38.213  -14.000     -7.943  4pxz
2144    ATOM    CE1     HIS     1063    38.749  -15.553     -9.375  4pxz
2146    ATOM    NE2     HIS     1063    38.688  -14.231     -9.213  4pxz
0       ATOM    CA      ARG     93  11.357  9.429   58.493  hatp
1       ATOM    CB      ARG     93  12.236  9.564   59.757  hatp
2       ATOM    CG      ARG     93  11.569  9.166   61.087  hatp
3       ATOM    CD      ARG     93  12.319  8.102   61.886  hatp
4       ATOM    NE      ARG     93  11.978  6.754   61.425  hatp
5       ATOM    CZ      ARG     93  11.731  5.714   62.217  hatp
6       ATOM    NH2     ARG     93  11.430  4.535   61.694  hatp
7       ATOM    NH1     ARG     93  11.793  5.843   63.538  hatp

预期输出:

代码语言:javascript
复制
col1    col2    col3    col4    col5    col6    col7    col8    col9

606     ATOM    CA  ARG     93  11.357  9.429   58.493  4pxz
609     ATOM    CB  ARG     93  12.236  9.564   59.757  4pxz
610     ATOM    CG  ARG     93  13.088  8.333   60.120  4pxz
611     ATOM    CD  ARG     93  13.985  7.822   58.995  4pxz
612     ATOM    NE  ARG     93  14.503  6.485   59.295  4pxz
613     ATOM    CZ  ARG     93  15.012  5.642   58.400  4pxz
614     ATOM    NH1 ARG     93  15.074  5.979   57.116  4pxz
615     ATOM    NH2 ARG     93  15.455  4.453   58.780  4pxz
0   ATOM    CA      ARG     93  11.357  9.429   58.493  hatp
1   ATOM    CB      ARG     93  12.236  9.564   59.757  hatp
2   ATOM    CG      ARG     93  11.569  9.166   61.087  hatp
3   ATOM    CD      ARG     93  12.319  8.102   61.886  hatp
4   ATOM    NE      ARG     93  11.978  6.754   61.425  hatp
5   ATOM    CZ      ARG     93  11.731  5.714   62.217  hatp
6   ATOM    NH2     ARG     93  11.430  4.535   61.694  hatp
7   ATOM    NH1     ARG     93  11.793  5.843   63.538  hatp

正如您所看到的,它没有遵循duplicated()方法中的说明(drop_duplicates的工作原理完全相同)。我需要使用:

代码语言:javascript
复制
df2 = df[df['col5'] == 93]

怎么啦?

EN

回答 2

Stack Overflow用户

发布于 2019-10-16 16:08:38

命令不是df.duplicated

还要确保传递选项keep=False

票数 1
EN

Stack Overflow用户

发布于 2019-10-16 16:51:53

正确的答案是:

代码语言:javascript
复制
df2 = df[df.duplicated(subset = ['col3','col4','col5'], keep = False)]

非常感谢你们的朋友们!

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/58408151

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档