我在这样的数据文件中有字符串
140 "14 Feb 1995 Primary Care Doctor:
"
141 "30 May 2016 SOS-10 Total Score:
"
142 "22 January 1996 @ 11 AMCommunication with referring physician?: Done
"我想分别提取几天和几个月。所以我列了一个清单
list=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
for i in range(500):
for month in list:
a= 'r(\d\d) '+month+'[a-z]{,8}'
b=df[0].str.findall(a)[i]
df['day'][i]=b当我寻找df‘’day‘时,我只想得到1422
发布于 2020-08-06 17:43:41
尝试以下模式:
pattern = re.compile(r"(?P<day>\d{1,2}) (?P<month>[A-Z][a-z]{2,}) (?P<year>\d{2,4})")命名捕获组(如(?P<day> \d{0,2} )意味着您可以访问返回的三元组并仅提取该字段。
然后你可以做这样的事情:
>>> for match in re.finditer(pattern, str):
>>> .... print(match.group("day"))我还会使用apply而不是for循环来访问您的DataFrame:
>>> data = {"string": ["14 Feb 1995 Primary Care Doctor:",
"30 May 2016 SOS-10 Total Score:",
"22 January 1996 @ 11 AMCommunication with referring physician?: Done"] }
>>> df = pd.DataFrame.from_dict(data)
>>> df.string.apply(lambda x: re.search(pattern, x).group("day"))
0 14
1 30
2 22
Name: string, dtype: object然后,如果您想要:
>>> df["day"] = df.string.apply(lambda x: re.search(pattern, x).group("day"))
>>> df["month"] = df.string.apply(lambda x: re.search(pattern, x).group("month"))
>>> df
string day month
0 14 Feb 1995 Primary Care Doctor: 14 Feb
1 30 May 2016 SOS-10 Total Score: 30 May
2 22 January 1996 @ 11 AMCommunication with refe... 22 JanuaryETA:如果您想要将它调整为只提取缩短的月份,不管它是否完整地拼写出来,请尝试用以下方式替换上面的regex模式:
pattern = re.compile(r"(?P<day>\d{1,2}) (?P<month>[A-Z][a-z]{2})[a-z]*? (?P<year>\d{2,4})")这将只捕获月份名称的前3个字符,但即使它们有较长的版本,也会找到日期。
发布于 2020-08-06 17:17:54
尝试使用以下正则表达式:
...
a = r"(\d{1,2}) \w+ \d{4}"
b = df[0].str.findall(a)[i]
df['day'][i] = bhttps://stackoverflow.com/questions/63288613
复制相似问题