背景- - I正在构建一个基于python的QC脚本,该脚本通过API调用从平台中检索数据,然后将其转换为熊猫df,然后根据2x行条件计算行是否应该删除。本质上,biproduct是一个例外报告,它留下了需要调查的数据行。
待测试的标准-
标准#1代码- df = df[(~(df['Ownership Audit Note'].str.contains('ignore|Ignore'))) & (~(df['% Ownership'] == 100)) & (~(df['Account # %'] == 'Entity ID %'))]

如您所知,我正在尝试测试具有不同数据类型的列,并且不知道如何实现这一点。有人能把我放在正确的方向,并建议我如何处理编码这个标准吗?
来自Dataframe-的提取是从df中提取的,其中的列是在2x标准中测试的。注:索引位置1后面的分隔符表示标准1和2的示例行:
(float64) (object) (float64) (float64)
% Ownership Ownership Audit Note Entity ID % Account # %
0 100.00 [ignore] 100% Ownership 100.0000000 100.0000000
1 100.00 [Ignore] 100% Ownership 0.0000000 0.0000000
--------------------------------------------------------------------------------------------
2 100.00 [Reviewed] Recurring 67% as of 2022-01-07 67.0000000 67.0000000
3 67.00 [Reviewed: 67%] Recur. as of 2022-01-07 67.0000000 67.0000000
4 100.00 [Review: 67.003%] Recurring 100% Ownership 67.0000000 67.0000000 发布于 2022-01-22 07:47:25
因此,给定以下数据:
import pandas as pd
df = pd.DataFrame(
{
"% Ownership": {0: 100.0, 1: 100.0, 2: 100.0, 3: 67.0, 4: 100.0},
"Ownership Audit Note": {
0: "[ignore] 100% Ownership ",
1: "[Ignore] 100% Ownership ",
2: "[Reviewed] Recurring 67% as of 2022-01-07 ",
3: "[reviewed: 67%] Recur. as of 2022-01-07",
4: "[Review: 67.003%] Recurring 100% Ownership:",
},
"Entity ID": {
0: 100.0,
1: 0.0,
2: 67.0,
3: 67.0,
4: 67.0,
},
"Account": {
0: 100.0,
1: 0.0,
2: 67.0,
3: 67.0,
4: 67.0,
},
}
)第3行是“所有权审计说明”中“审查”的唯一一项,在“所有权审计说明”、“实体ID”和“账户”中的百分比相同。
因此,它应该放弃,这是可以做到的:
# 1. Add a temporary column with the percentage, if any, in "Ownership Audit Note".
# Here, Pandas str.extract method is used in conjunction with a regex pattern meaning:
# "a percentage, like 67% or 67.xxx%, in a string that begins and ends with brackets".
# mask1 garantees that this string contains the word "reviewed", so no need to specify
# that in the regex expression.
mask1 = df["Ownership Audit Note"].str.lower().str.contains("reviewed")
df["value"] = df.loc[mask1, "Ownership Audit Note"].str.extract(
pat=r"\[\D*(\d+\.?\d*)%\]"
)
df["value"] = df["value"].astype("float")
# 2. Remove lines that match all conditions
mask2 = (
~df["value"].isna()
& (df["Entity ID"] == df["value"])
& (df["Account"] == df["value"])
)
df = df[~mask2].drop(columns="value")因此,第3行现在移除:
print(df)
# Output
% Ownership Ownership Audit Note Entity ID Account
0 100.0 [ignore] 100% Ownership 100.0 100.0
1 100.0 [Ignore] 100% Ownership 0.0 0.0
2 100.0 [Reviewed] Recurring 67% as of 2022-01-07 67.0 67.0
4 100.0 [Review: 67.003%] Recurring 100% Ownership: 67.0 67.0https://stackoverflow.com/questions/70810014
复制相似问题