我有一个很长的元素列表,每个元素都是一个字符串。请参见以下示例:
data = ['BAT.A.100', 'Regulation 2020-1233', 'this is the core text of', 'the regulation referenced ',
'MOC to BAT.A.100', 'this', 'is', 'one method of demonstrating compliance to BAT.A.100',
'BAT.A.120', 'Regulation 2020-1599', 'core text of the regulation ...', ' more free text','more free text',
'BAT.A.145', 'Regulation 2019-3333', 'core text of' ,'the regulation1111',
'MOC to BAT.A.145', 'here is how you can show compliance to BAT.A.145','more free text',
'MOC2 to BAT.A.145', ' here is yet another way of achieving compliance']我想要的输出最终是一个Pandas DataFrame,如下所示:

发布于 2021-01-12 21:14:29
由于字符串可能需要连接,我首先使用##将所有元素连接到单个字符串,以分隔已连接的文本。我将使用所有正则表达式,因为否则会有很多条件需要检查。
re_req = re.compile(r'##(?P<Short_ref>BAT\.A\.\d{3})'
r'##(?P<Full_Reg_ref>Regulation\s\d{4}-\d{4})'
r'##(?P<Reg_text>.*?MOC to \1|.*?(?=##BAT\.A\.\d{3})(?!\1))'
r'(?:##)?(?:(?P<Moc_text>.*?MOC2 to \1)(?P<MOC2>(?:##)?.*?(?=##BAT\.A\.\d{3})(?!\1)|.+)'
r'|(?P<Moc_text_temp>.*?(?=##BAT\.A\.\d{3})(?!\1)))')
final_list = []
for match in re_req.finditer("##" + "##".join(data)):
inner_list = [match.group('Short_ref').replace("##", " "),
match.group('Full_Reg_ref').replace("##", " "),
match.group('Reg_text').replace("##", " ")]
if match.group('Moc_text_temp'): # just Moc_text is present
inner_list += [match.group('Moc_text_temp').replace("##", " "), ""]
elif match.group('Moc_text') and match.group('MOC2'): # both Mock_text and MOC2 is present
inner_list += [match.group('Moc_text').replace("##", " "), match.group('MOC2').replace("##", " ")]
else: # neither Moc_text nor MOC2 is present
inner_list += ["", ""]
final_list.append(inner_list)
final_df = pd.DataFrame(final_list, columns=['Short_ref', 'Full_Reg_ref', 'Reg_text', 'Moc_text', 'MOC2'])正则表达式的第一行和第二行与您之前发布的相同,并标识了前两列。
在正则表达式的第三行,r'##(?P<Reg_text>.*?MOC to \1|.*?(?=##BAT\.A\.\d{3})(?!\1))' -将MOC之前的所有文本与Short_ref匹配,或者匹配下一个Reg_text之前的所有文本。(?=##BAT\.A\.\d{3})(?!\1)部分是将文本向上转换为Short_ref模式,如果Short_ref不是当前的Reg_text。
第四行用于当Moc_text和MOC2都存在时,它是or,第五行用于只存在Moc_text的情况。正则表达式的这一部分类似于第三行。
最后使用finditer遍历所有匹配并构造dataframe final_df的行

https://stackoverflow.com/questions/65683975
复制相似问题