文章/答案/技术大牛

发布

社区首页 >问答首页 >在dates.txt上执行正则表达式

问在dates.txt上执行正则表达式
EN

Stack Overflow用户

提问于 2020-03-05 00:15:09

回答 1查看 65关注 0票数 0

以下是来自date.txt的数据片段：https://github.com/BRAKESH3336/sample/blob/master/dates.txt

任务是提取以下格式的日期: 04/20/2009，04/20/09，4/20/09，4/3/09

如果数据以单字符串的形式导入，则正则表达式可以工作

df='''
03/25/93 Total time of visit (in minutes):
6/18/85 Primary Care Doctor:
sshe plans to move as of 7/8/71 In-Home Services: None
7 on 9/27/75 Audit C Score Current:
2/6/96 sleep studyPain Treatment Pain Level (Numeric Scale): 7
.Per 7/06/79 Movement D/O note:
4, 5/18/78 Patient's thoughts about current substance abuse:
10/24/89 CPT Code: 90801 - Psychiatric Diagnosis Interview
3/7/86 SOS-10 Total Score:
(4/10/71)Score-1Audit C Score Current:
(5/11/85) Crt-1.96, BUN-26; AST/ALT-16/22; WBC_12.6Activities of Daily Living (ADL) Bathing: Independent
4/09/75 SOS-10 Total Score:
'''
pattern= re.compile(r'\d{0,2}[/]\d{1,2}[/]\d{2,4}')
matches=pattern.finditer(df)
for match in matches:
    print(match)

但是，当使用open()导入数据时，正则表达式不起作用

doc = []
with open('dates.txt') as file:
    for line in file:
        doc.append(line)

df = pd.Series(doc)
df.head(10)

pattern= re.compile(r'\d{0,2}[/]\d{1,2}[/]\d{2,4}')
matches=pattern.finditer(df)
for match in matches:
    print(match)

为什么会这样呢？我得到的错误是：

--------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-83-c6639f3c12f4> in <module>
      1 pattern= re.compile(r'\d{0,2}[/]\d{1,2}[/]\d{2,4}')
----> 2 matches=pattern.finditer(df)
      3 for match in matches:
      4     print(match)

TypeError: expected string or bytes-like object

python

regex

回答 1

Stack Overflow用户

发布于 2020-03-05 23:58:26

错误消息不言自明：finditer方法期望第二个参数是string或bytes-like对象，但您向其传递的是Series的实例。由于您已经将文件作为字符串读入到doc变量中，因此您的代码应该是：

matches=pattern.finditer(''.join(doc))

另外，您的正则表达式应该是：

r'\d{1,2}/\d{1,2}/\d{2}(?:\d{2})?'

\d{1,2}匹配1位或2位数字。您让\d{0,2}将月份设置为可选的(例如，允许匹配/5/2020 )，这实际上并不是您的want.
/匹配正斜杠。不需要[/] (尽管这不是错误的)，如果您想要允许多个分隔符，这将更有用，例如，[/-].
\d{1,2}匹配1或2 digits.
/匹配正向slash.
\d{2}(?:\d{2})?匹配2或4位(可选地匹配2位，后面再匹配2位。这比你拥有的匹配2位、3位或4位数字的精度更高。

此外，创建由文本文件中每行组成的字符串列表的更"Pythonic“(高效)方法是：

with open('dates.txt') as file:
    doc = [line for line in file]

在这一点上，pandas的使用有什么作用吗？如果不是，则将整个文件作为单个字符串读取：

with open('dates.txt') as file:
    doc = file.read()

然后，以后就不需要进行任何行的连接了。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/60530323

复制

相似问题

问在dates.txt上执行正则表达式
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在dates.txt上执行正则表达式EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在dates.txt上执行正则表达式
EN