首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >比较具有不同DataTypes的行

比较具有不同DataTypes的行
EN

Stack Overflow用户
提问于 2022-01-22 04:34:03
回答 1查看 74关注 0票数 0

背景- - I正在构建一个基于python的QC脚本,该脚本通过API调用从平台中检索数据,然后将其转换为熊猫df,然后根据2x行条件计算行是否应该删除。本质上,biproduct是一个例外报告,它留下了需要调查的数据行。

待测试的标准-

  1. Criteria #1解释- IF (%所有权== 100) & IF (所有权审计注释∋包含“忽略”或“忽略”)&(如果实体ID % ==帐户# %),则删除行。

标准#1代码- df = df[(~(df['Ownership Audit Note'].str.contains('ignore|Ignore'))) & (~(df['% Ownership'] == 100)) & (~(df['Account # %'] == 'Entity ID %'))]

  1. Criteria #2 (在我需要帮助的地方)--下面的可视化更好地解释了这一点,但是.我需要首先测试“审核”或“评审”是否在“所有权审核说明”中,然后测试是否包含一个部分字符串(例如,0.33),该字符串也包含在“实体ID %”和“Account#%”中。如果符合条件,则删除行。

如您所知,我正在尝试测试具有不同数据类型的列,并且不知道如何实现这一点。有人能把我放在正确的方向,并建议我如何处理编码这个标准吗?

来自Dataframe-的提取是从df中提取的,其中的列是在2x标准中测试的。注:索引位置1后面的分隔符表示标准1和2的示例行:

代码语言:javascript
复制
  (float64)       (object)                                       (float64)       (float64) 
  % Ownership     Ownership Audit Note                           Entity ID %     Account # % 
0 100.00          [ignore] 100% Ownership                        100.0000000     100.0000000
1 100.00          [Ignore] 100% Ownership                          0.0000000       0.0000000  
--------------------------------------------------------------------------------------------  
2 100.00          [Reviewed] Recurring 67% as of 2022-01-07       67.0000000      67.0000000
3 67.00           [Reviewed: 67%] Recur. as of 2022-01-07         67.0000000      67.0000000
4 100.00          [Review: 67.003%] Recurring 100% Ownership      67.0000000      67.0000000  
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-01-22 07:47:25

因此,给定以下数据:

代码语言:javascript
复制
import pandas as pd

df = pd.DataFrame(
    {
        "% Ownership": {0: 100.0, 1: 100.0, 2: 100.0, 3: 67.0, 4: 100.0},
        "Ownership Audit Note": {
            0: "[ignore] 100% Ownership ",
            1: "[Ignore] 100% Ownership ",
            2: "[Reviewed] Recurring 67% as of 2022-01-07 ",
            3: "[reviewed: 67%] Recur. as of 2022-01-07",
            4: "[Review: 67.003%] Recurring 100% Ownership:",
        },
        "Entity ID": {
            0: 100.0,
            1: 0.0,
            2: 67.0,
            3: 67.0,
            4: 67.0,
        },
        "Account": {
            0: 100.0,
            1: 0.0,
            2: 67.0,
            3: 67.0,
            4: 67.0,
        },
    }
)

第3行是“所有权审计说明”中“审查”的唯一一项,在“所有权审计说明”、“实体ID”和“账户”中的百分比相同。

因此,它应该放弃,这是可以做到的:

代码语言:javascript
复制
# 1. Add a temporary column with the percentage, if any, in "Ownership Audit Note".
# Here, Pandas str.extract method is used in conjunction with a regex pattern meaning: 
# "a percentage, like 67% or 67.xxx%, in a string that begins and ends with brackets".
# mask1 garantees that this string contains the word "reviewed", so no need to specify
# that in the regex expression.
mask1 = df["Ownership Audit Note"].str.lower().str.contains("reviewed")
df["value"] = df.loc[mask1, "Ownership Audit Note"].str.extract(
    pat=r"\[\D*(\d+\.?\d*)%\]"
)
df["value"] = df["value"].astype("float")

# 2. Remove lines that match all conditions
mask2 = (
    ~df["value"].isna()
    & (df["Entity ID"] == df["value"])
    & (df["Account"] == df["value"])
)
df = df[~mask2].drop(columns="value")

因此,第3行现在移除:

代码语言:javascript
复制
print(df)

# Output
   % Ownership                         Ownership Audit Note  Entity ID  Account
0        100.0                     [ignore] 100% Ownership       100.0    100.0
1        100.0                     [Ignore] 100% Ownership         0.0      0.0
2        100.0   [Reviewed] Recurring 67% as of 2022-01-07        67.0     67.0
4        100.0  [Review: 67.003%] Recurring 100% Ownership:       67.0     67.0
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/70810014

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档