文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在一个正则表达式中捕获所有regex组？

问如何在一个正则表达式中捕获所有regex组？
EN

Stack Overflow用户

提问于 2016-04-18 06:20:28

回答 4查看 113关注 0票数 4

给出这样的文件：

# For more information about CC-CEDICT see:
# http://cc-cedict.org/wiki/
A A [A] /(slang) (Tw) to steal/
AA制 AA制 [A A zhi4] /to split the bill/to go Dutch/
AB制 AB制 [A B zhi4] /to split the bill (where the male counterpart foots the larger portion of the sum)/(theater) a system where two actors take turns in acting the main role, with one actor replacing the other if either is unavailable/
A咖 A咖 [A ka1] /class "A"/top grade/
A圈兒 A圈儿 [A quan1 r5] /at symbol, @/
A片 A片 [A pian4] /adult movie/pornography/

我想要构建一个json对象：

以#开头的跳过行
将行分成4部分
1. 传统字符(从开始^到下一个空格)
2. 简体字符(从第一个空格到第二个空格)
3. 拼音(位于方括号[...]之间)
4. 从第一个/到最后一个/之间的光泽空间(注意，有些情况下可以在光泽中使用斜线，例如/adult movie/pornography/ )。

我目前是这样做的：

>>> for line in text.split('\n'):
...     if line.startswith('#'): continue;
...     line = line.strip()
...     simple, _, line = line.partition(' ')
...     trad, _, line = line.partition(' ')
...     print simple, trad
... 
A A
AA制 AA制
AB制 AB制
A咖 A咖
A圈兒 A圈儿
A片 A片

为了获得[...]，我不得不：

>>> import re
>>> line = "A片 A片 [A pian4] /adult movie/pornography/"
>>> simple, _, line = line.partition(' ')
>>> trad, _, line = line.partition(' ')
>>> re.findall(r'\[.*\]', line)[0].strip('[]')
'A pian4'

为了找到/.../，我不得不：

>>> line = "A片 A片 [A pian4] /adult movie/pornography/"
>>> re.findall(r'\/.*\/$', line)[0].strip('/')
'adult movie/pornography'

如何使用regex组同时捕获执行多个分区/拆分/findall的所有分区？

delimiter

regex-group

python

regex

string

回答 4

Stack Overflow用户

回答已采纳

发布于 2016-04-18 06:48:52

我可以使用正则表达式来提取信息。这样，您就可以分组捕获块，然后按需要处理它们：

import re

with open("myfile") as f:
    data = f.read().split('\n')
    for line in data:
        if line.startswith('#'): continue
        m = re.search(r"^([^ ]*) ([^ ]*) \[([^]]*)\] \/(.*)\/$", line)
        if m:
            print(m.groups())

即正则表达式将字符串拆分为以下组：

^([^ ]*) ([^ ]*) \[([^]]*)\] \/(.*)\/$
  ^^^^^   ^^^^^     ^^^^^       ^^
   1)      2)        3)         4)

这就是：

第一个词。
第二个词。
[和]中的文本。
从/到行尾之前的/的文本。

它返回：

('A', 'A', 'A', '(slang) (Tw) to steal')
('AA制', 'AA制', 'A A zhi4', 'to split the bill/to go Dutch')
('AB制', 'AB制', 'A B zhi4', 'to split the bill (where the male counterpart foots the larger portion of the sum)/(theater) a system where two actors take turns in acting the main role, with one actor replacing the other if either is unavailable')
('A咖', 'A咖', 'A ka1', 'class "A"/top grade')
('A圈兒', 'A圈儿', 'A quan1 r5', 'at symbol, @')
('A片', 'A片', 'A pian4', 'adult movie/pornography')

票数 6

Stack Overflow用户

发布于 2016-04-18 06:48:26

p = re.compile(ru"(\S+)\s+(\S+)\s+\[([^\]]*)\]\s+/(.*)/$")
m = p.match(line)
if m:
    simple, trad, pinyin, gloss = m.groups()

有关更多详细信息，请参阅https://docs.python.org/2/howto/regex.html#grouping。

票数 3

Stack Overflow用户

发布于 2016-04-18 06:44:08

这可能有助于：

preg = re.compile(r'^(?<!#)(\w+)\s(\w+)\s(\[.*?\])\s/(.+)/$',
                  re.MULTILINE | re.UNICODE)

with open('your_file') as f:
    for line in f:
        match = preg.match(line)
        if match:
            print(match.groups())

查看这里，了解使用的正则表达式的详细说明。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/36686732

复制

相似问题

问如何在一个正则表达式中捕获所有regex组？
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在一个正则表达式中捕获所有regex组？EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在一个正则表达式中捕获所有regex组？
EN