文章/答案/技术大牛

发布

社区首页 >问答首页 >比较具有不同DataTypes的行

问比较具有不同DataTypes的行
EN

Stack Overflow用户

提问于 2022-01-22 04:34:03

回答 1查看 74关注 0票数 0

背景- - I正在构建一个基于python的QC脚本，该脚本通过API调用从平台中检索数据，然后将其转换为熊猫df，然后根据2x行条件计算行是否应该删除。本质上，biproduct是一个例外报告，它留下了需要调查的数据行。

待测试的标准-

Criteria #1解释- IF (%所有权== 100) & IF (所有权审计注释∋包含“忽略”或“忽略”)&(如果实体ID % ==帐户# %)，则删除行。

标准#1代码- df = df[(~(df['Ownership Audit Note'].str.contains('ignore|Ignore'))) & (~(df['% Ownership'] == 100)) & (~(df['Account # %'] == 'Entity ID %'))]

Criteria #2 (在我需要帮助的地方)--下面的可视化更好地解释了这一点，但是.我需要首先测试“审核”或“评审”是否在“所有权审核说明”中，然后测试是否包含一个部分字符串(例如，0.33)，该字符串也包含在“实体ID %”和“Account#%”中。如果符合条件，则删除行。

如您所知，我正在尝试测试具有不同数据类型的列，并且不知道如何实现这一点。有人能把我放在正确的方向，并建议我如何处理编码这个标准吗？

来自Dataframe-的提取是从df中提取的，其中的列是在2x标准中测试的。注：索引位置1后面的分隔符表示标准1和2的示例行：

  (float64)       (object)                                       (float64)       (float64) 
  % Ownership     Ownership Audit Note                           Entity ID %     Account # % 
0 100.00          [ignore] 100% Ownership                        100.0000000     100.0000000
1 100.00          [Ignore] 100% Ownership                          0.0000000       0.0000000  
--------------------------------------------------------------------------------------------  
2 100.00          [Reviewed] Recurring 67% as of 2022-01-07       67.0000000      67.0000000
3 67.00           [Reviewed: 67%] Recur. as of 2022-01-07         67.0000000      67.0000000
4 100.00          [Review: 67.003%] Recurring 100% Ownership      67.0000000      67.0000000

python

pandas

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-01-22 07:47:25

因此，给定以下数据：

import pandas as pd

df = pd.DataFrame(
    {
        "% Ownership": {0: 100.0, 1: 100.0, 2: 100.0, 3: 67.0, 4: 100.0},
        "Ownership Audit Note": {
            0: "[ignore] 100% Ownership ",
            1: "[Ignore] 100% Ownership ",
            2: "[Reviewed] Recurring 67% as of 2022-01-07 ",
            3: "[reviewed: 67%] Recur. as of 2022-01-07",
            4: "[Review: 67.003%] Recurring 100% Ownership:",
        },
        "Entity ID": {
            0: 100.0,
            1: 0.0,
            2: 67.0,
            3: 67.0,
            4: 67.0,
        },
        "Account": {
            0: 100.0,
            1: 0.0,
            2: 67.0,
            3: 67.0,
            4: 67.0,
        },
    }
)

第3行是“所有权审计说明”中“审查”的唯一一项，在“所有权审计说明”、“实体ID”和“账户”中的百分比相同。

因此，它应该放弃，这是可以做到的：

# 1. Add a temporary column with the percentage, if any, in "Ownership Audit Note".
# Here, Pandas str.extract method is used in conjunction with a regex pattern meaning: 
# "a percentage, like 67% or 67.xxx%, in a string that begins and ends with brackets".
# mask1 garantees that this string contains the word "reviewed", so no need to specify
# that in the regex expression.
mask1 = df["Ownership Audit Note"].str.lower().str.contains("reviewed")
df["value"] = df.loc[mask1, "Ownership Audit Note"].str.extract(
    pat=r"\[\D*(\d+\.?\d*)%\]"
)
df["value"] = df["value"].astype("float")

# 2. Remove lines that match all conditions
mask2 = (
    ~df["value"].isna()
    & (df["Entity ID"] == df["value"])
    & (df["Account"] == df["value"])
)
df = df[~mask2].drop(columns="value")

因此，第3行现在移除：

print(df)

# Output
   % Ownership                         Ownership Audit Note  Entity ID  Account
0        100.0                     [ignore] 100% Ownership       100.0    100.0
1        100.0                     [Ignore] 100% Ownership         0.0      0.0
2        100.0   [Reviewed] Recurring 67% as of 2022-01-07        67.0     67.0
4        100.0  [Review: 67.003%] Recurring 100% Ownership:       67.0     67.0

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/70810014

复制

相似问题

问比较具有不同DataTypes的行
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问比较具有不同DataTypes的行EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问比较具有不同DataTypes的行
EN