以下是来自date.txt的数据片段:https://github.com/BRAKESH3336/sample/blob/master/dates.txt
任务是提取以下格式的日期: 04/20/2009,04/20/09,4/20/09,4/3/09
如果数据以单字符串的形式导入,则正则表达式可以工作
df='''
03/25/93 Total time of visit (in minutes):
6/18/85 Primary Care Doctor:
sshe plans to move as of 7/8/71 In-Home Services: None
7 on 9/27/75 Audit C Score Current:
2/6/96 sleep studyPain Treatment Pain Level (Numeric Scale): 7
.Per 7/06/79 Movement D/O note:
4, 5/18/78 Patient's thoughts about current substance abuse:
10/24/89 CPT Code: 90801 - Psychiatric Diagnosis Interview
3/7/86 SOS-10 Total Score:
(4/10/71)Score-1Audit C Score Current:
(5/11/85) Crt-1.96, BUN-26; AST/ALT-16/22; WBC_12.6Activities of Daily Living (ADL) Bathing: Independent
4/09/75 SOS-10 Total Score:
'''
pattern= re.compile(r'\d{0,2}[/]\d{1,2}[/]\d{2,4}')
matches=pattern.finditer(df)
for match in matches:
print(match)但是,当使用open()导入数据时,正则表达式不起作用
doc = []
with open('dates.txt') as file:
for line in file:
doc.append(line)
df = pd.Series(doc)
df.head(10)
pattern= re.compile(r'\d{0,2}[/]\d{1,2}[/]\d{2,4}')
matches=pattern.finditer(df)
for match in matches:
print(match)为什么会这样呢?我得到的错误是:
--------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-83-c6639f3c12f4> in <module>
1 pattern= re.compile(r'\d{0,2}[/]\d{1,2}[/]\d{2,4}')
----> 2 matches=pattern.finditer(df)
3 for match in matches:
4 print(match)
TypeError: expected string or bytes-like object发布于 2020-03-05 23:58:26
错误消息不言自明:finditer方法期望第二个参数是string或bytes-like对象,但您向其传递的是Series的实例。由于您已经将文件作为字符串读入到doc变量中,因此您的代码应该是:
matches=pattern.finditer(''.join(doc))另外,您的正则表达式应该是:
r'\d{1,2}/\d{1,2}/\d{2}(?:\d{2})?'\d{1,2}匹配1位或2位数字。您让\d{0,2}将月份设置为可选的(例如,允许匹配/5/2020 ),这实际上并不是您的want./匹配正斜杠。不需要[/] (尽管这不是错误的),如果您想要允许多个分隔符,这将更有用,例如,[/-].\d{1,2}匹配1或2 digits./匹配正向slash.\d{2}(?:\d{2})?匹配2或4位(可选地匹配2位,后面再匹配2位。这比你拥有的匹配2位、3位或4位数字的精度更高。此外,创建由文本文件中每行组成的字符串列表的更"Pythonic“(高效)方法是:
with open('dates.txt') as file:
doc = [line for line in file]在这一点上,pandas的使用有什么作用吗?如果不是,则将整个文件作为单个字符串读取:
with open('dates.txt') as file:
doc = file.read()然后,以后就不需要进行任何行的连接了。
https://stackoverflow.com/questions/60530323
复制相似问题