我正在尝试对我已有的一些数据进行QC。我正在识别整个数据帧中的所有电话号码和电子邮件。到目前为止,我可以找到它们,但我还想显示列名和其中的数据。
我不确定如何在布尔值为true的地方创建包含数据列表的列。
紧身衣?还可以将输出限制为匹配号码/电子邮件的列表,而不是该匹配的整个数据集。
import pandas as pd, re
#create small dummy data set
df = pd.DataFrame({
'ID':[1,2,5,25,26],
'Lineage':['apple', 'square', 'please Gino Mcneill gm@yahoo.com', 'ball', '888-555-5556 Ryan Parkes rp@abc.io'],
'ShortDesc':['618-552-2255','Gino Mcneill gm@github.com','',' please call now','if you have trouble you should call 816-455-5599 ASAP' ],
'LongDesc':['Eesha Hinton', 'for help with product 56789, call 618-578-0055 immediately, or email Gino Mcneill gm@yahoo.com', 'maybe six five today for ever','more random text that could be really long and annoying','over the hills and through the woods']
})
#Find and list where email/phones exist
PEList =(pd.DataFrame({'ID': df['ID'],
'Email': df.select_dtypes(object)
.applymap(lambda x: bool(re.findall(r'(?:[a-z0-9!#$%&''*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&''*+/=?^_`{|}~-]+)*|\"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*\")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])', str(x))))
.any(axis=1),
'Phone': df.select_dtypes(object)
.applymap(lambda x: bool(re.findall(r'(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?', str(x))))
.any(axis=1)}
).replace({True: 'x', False: None})
)
cols = ['Email','Phone']
PEList = PEList.dropna(subset=cols, thresh=1)
#Desired Result:
#ID, Email, Phone, E_Col, E_Col_Dat, P_Col, P_Col_Dat
#1 x ShortDesc '618-552-2255'
#2 x x ShortDesc, LongDec 'Gino Mcneill gm@github.com', 'for help with product 56789, call 618-578-0055 immediately, or email Gino Mcneill gm@yahoo.com'
#5 x Lineage, 'please Gino Mcneill gm@yahoo.com'
#26 x x Lineage '888-555-5556 Ryan Parkes rp@abc.io' Lineage, ShortDesc '888-555-5556 Ryan Parkes rp@abc.io', 'if you have trouble you should call 816-455-5599 ASAP' 发布于 2021-07-23 23:31:16
您可以测试每列是否包含某些正则表达式模式,如果包含,则创建一个复制值的新列。否则,该列的行中的值将为null:
for col in df.columns:
if df[col].dtype == 'object':
df.loc[df[col].str.contains('[0-9]{3}-[0-9]{3}-[0-9]{4}') | df[col].str.contains('[A-z0-9]{1,}@[A-z0-9]{1,}'), col+'_phone_or_email'] = df[col]要获得
ID ... LongDesc_phone_or_email
0 1 ... NaN
1 2 ... for help with product 56789, call 618-578-0055...
2 5 ... NaN
3 25 ... NaN
4 26 ... NaN您还可以使用str.extract()提取每个命中。
https://stackoverflow.com/questions/68501485
复制相似问题