首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Python regex在特定标签之后解析组

Python regex在特定标签之后解析组
EN

Stack Overflow用户
提问于 2022-09-13 09:40:36
回答 2查看 45关注 0票数 0

我有这样的短信清单:

代码语言:javascript
复制
Something at the beginning
    
References
1. Ryff, C.D. (2014) Psychological Well-Being Revisited: Advances in the Science and Practice of Eudaimonia. 
2. Deci, E.L. & Ryan, R.M. (2002) Self-determination research: reflections and future directions. 
3. Acedo, F. J., & Casillas, J. C. (2005). Current paradigms in the international management field.
        
Other References
1. Tarelli, E. (2003), “How to transfer responsibilities from expatriates to local nationals”.
2. Riusala, K. and Suutari, V. (2004), “International knowledge transfers through expatriates”.
3. Wallace, J. (2001), “The benefits of mentoring for female lawyers”.

Something at the end
12. Wallace, J. (2001), “The benefits of mentoring for female lawyers”.
Something else at the end

“其他参考”部分在一些文本中存在,而在另一些文本中不存在。同样,文本中的任何地方都可能出现类似的字符串。

我需要regex在re.findall中使用,并在这样的字符串列表中,在“引用”之后返回所有字符串。

代码语言:javascript
复制
['Ryff, C.D. (2014) Psychological Well-Being Revisited: Advances in the Science and Practice of Eudaimonia.', 'Deci, E.L. & Ryan, R.M. (2002) Self-determination research: reflections and future directions.', 'Acedo, F. J., & Casillas, J. C. (2005). Current paradigms in the international management field.']

但只在“引用”之后,而不是在前面或后面的任何地方。

我已经试过这个准则了

代码语言:javascript
复制
r = 'References\s*(\d+[.].*[.])'

但是它只返回第一个字符串出现,我需要所有

有谁能建议一种更好的正则表达式吗?

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-09-13 09:51:03

你可以用re.findall,两次。下面的策略是首先将所有引用块匹配为单独的字符串。然后我们将所有这样的字符串连接在一起,然后使用re.findall查找所有引用。

代码语言:javascript
复制
inp = """Something at the beginning

References
1. Ryff, C.D. (2014) Psychological Well-Being Revisited: Advances in the Science and Practice of Eudaimonia. 
2. Deci, E.L. & Ryan, R.M. (2002) Self-determination research: reflections and future directions. 
3. Acedo, F. J., & Casillas, J. C. (2005). Current paradigms in the international management field.
    
Other References
1. Tarelli, E. (2003), “How to transfer responsibilities from expatriates to local nationals”.
2. Riusala, K. and Suutari, V. (2004), “International knowledge transfers through expatriates”.
3. Wallace, J. (2001), “The benefits of mentoring for female lawyers”.

Something at the end
12. Wallace, J. (2001), “The benefits of mentoring for female lawyers”.
Something else at the end"""

refs = re.findall(r'^References\n((?:\d+\.\s*.*?\n)+)', inp, flags=re.M)
data = ''.join(refs)
output = re.findall(r'\d+\.\s*(.*?)\n', data)
print(output)

这些指纹:

代码语言:javascript
复制
[
    'Ryff, C.D. (2014) Psychological Well-Being Revisited: Advances in the Science and Practice of Eudaimonia. ',
    'Deci, E.L. & Ryan, R.M. (2002) Self-determination research: reflections and future directions. ',
    'Acedo, F. J., & Casillas, J. C. (2005). Current paradigms in the international management field.'
]
票数 1
EN

Stack Overflow用户

发布于 2022-09-14 11:04:26

这是对linked question的回答,它是作为副本关闭的;所以我不能在那里回答。这个重复的问题更多地是这个问题的扩展,因为它使输入复杂化。

用作输入示例

代码语言:javascript
复制
text = """Something at the beginning

References 1. Ryff, C.D. (2014) Psychological Well-Being Revisited: Advances in the Science and Practice of Eudaimonia. Additional Fields. 2. Deci, E.L. & Ryan, R.M. (2002) Self-determination research: reflections and future directions. 3. Wallace, J. (2001), “The benefits of mentoring for female lawyers”. 315 – 326. DOI: doi.org/10.2224/sbp.2008.36.3.315 4. Acedo, F. J., & Casillas, J. C. (2005). Current paradigms in the international management field. Other References 1. Tarelli, E. (2003), “How to transfer responsibilities from expatriates to local nationals”.
2. Riusala, K. and Suutari, V. (2004), “International knowledge transfers through expatriates”.
3. Wallace, J. (2001), “The benefits of mentoring for female lawyers”.

Something at the end
12. Wallace, J. (2001), “The benefits of mentoring for female lawyers”.
Something else at the end"""

并根据以下假设开展工作:

“引用”(带有大写"R")一词可区分为“adjust).

  • references

  • ”引用节,该词可能前面有另一个大写单词(仅为一个单词;“真正不好的引用”目前被排除为分隔符),尽管对于一个节中的直接来说,这是由以下几个部分分隔的:

一个或多个数字,后面是句号,后面是空格,后面是非数字字符(字母、逗号、-符号等),后面是开头括号(表示年份)。

  • 只有第一部分“引用”是感兴趣的部分(可能可以直接修改)。

然后,以下代码就可以工作了:

代码语言:javascript
复制
# split the sections
pattern = r'[A-Z][a-z\s]+References'
sections = re.split(pattern, text, flags=re.M)

# split the individual references, by conditions as mentioned in point 2 above
pattern = r'\s*(\d+\.\s+\D+\()'
# The first section is blank (`''`), so `sections[1]` is 
# the first actual reference section, "References"
parts = re.split(, sections[1], flags=re.M)

# the split includes the part before the year, and the year + rest. 
# We need to concatenate those items for each reference. 
# Also here, the first group is blank, so skip that
refs = [part1 + part2 for part1, part2 in zip(parts[1::2], parts[2::2])]

# Show the result
for ref in refs:
    print(ref)

产额

代码语言:javascript
复制
1. Ryff, C.D. (2014) Psychological Well-Being Revisited: Advances in the Science and Practice of Eudaimonia. Additional Fields.
2. Deci, E.L. & Ryan, R.M. (2002) Self-determination research: reflections and future directions.
3. Wallace, J. (2001), “The benefits of mentoring for female lawyers”. 315 – 326. DOI: doi.org/10.2224/sbp.2008.36.3.315
4. Acedo, F. J., & Casillas, J. C. (2005). Current paradigms in the international management field.
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/73700753

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档