文章/答案/技术大牛

发布

问Regex Python [python-2.7]
EN

Stack Overflow用户

提问于 2015-09-22 06:15:11

回答 3查看 170关注 0票数 1

我正在开发一个Python程序，该程序筛选一个.txt文件来查找属和种的名称。这些行的格式是这样的(是的，等号始终围绕着通用名称)：

1. =Common Name= Genus Species some other words that I don't want.
2. =Common Name= Genus Species some other words that I don't want.

我似乎找不出一个正则表达式，它可以只匹配属和种，而不匹配通用名称。我知道等号(=)可能会在某种程度上有所帮助，但我不知道如何使用它们。

编辑:一些真实数据：

1. =Western grebe.= ÆCHMOPHORUS OCCIDENTALIS. Rare migrant; western species, chiefly interior regions of North America.

2. =Holboell's grebe.= COLYMBUS HOLBOELLII. Rare migrant; breeds far north; range, all of North America.

3. =Horned grebe.= COLYMBUS AURITUS. Rare migrant; range, almost the same as the last.

4. =American eared grebe.= COLYMBUS NIGRICOLLIS CALIFORNICUS. Summer resident; rare in eastern, common in western Colorado; breeds from plains to 8,000 feet; partial to alkali lakes; western species.

python

regex

python-2.7

回答 3

Stack Overflow用户

发布于 2015-09-22 06:23:58

您可能不需要正则表达式来执行此操作。如果您需要的单词的顺序和单词的计数总是相同的，那么您可以将每一行拆分为子字符串列表，并获得该列表的第三个(属)和第四个(种)元素。代码可能如下所示：

myfile = open('myfilename.txt', 'r')
for line in myfile.readlines():
    words = line.split()
    genus, species = words[2], words[3]

在我看来，它看起来更像是“蟒蛇”。

如果通用名称可以由多个单词组成，则建议的代码将返回错误的结果。要在这种情况下也获得正确的结果，您可以使用以下代码：

myfile = open('myfilename.txt', 'r')
for line in myfile.readlines():
    words = line.split('=')[2].split() # If the program returns wrong results, try changing the index from 2 to 1 or 3. What number is the right one depends on whether there can be any symbols before the first "=".
    genus, species = words[0], words[1]

票数 4

Stack Overflow用户

发布于 2015-09-22 06:21:05

如果在组中捕获单词就足够了(而且你不会直接匹配)，你可以试试：

(?=\d\.\s*=[^=]+=\s(?:(?P<genus>\w+)\s(?P<species>\w+)))

DEMO

所需的值将位于组<genus>和<species>中。整个正则表达式是一个正向后视，因此它匹配字符串开头的零点位置，但它将一些内容捕获到组中。

(?=\d\.\s*=[^=]+=\s -十进制后跟等号和空格之间的内容，
(?:(?P<genus>\w+)\s(?P<species>\w+))) -将第一个单词捕获到属组，第二个单词表示物种组，

票数 1

Stack Overflow用户

发布于 2015-09-22 06:25:15

您可以尝试执行以下操作：

import re

txt='1. =Common Name= Genus Species some other words that I don\'t want.'

re1='.*?'   # Non-greedy match on filler
re2='(?:[a-z][a-z]+)'   # Uninteresting: word
re3='.*?'   # Non-greedy match on filler
re4='(?:[a-z][a-z]+)'   # Uninteresting: word
re5='.*?'   # Non-greedy match on filler
re6='((?:[a-z][a-z]+))' # Word 1
re7='.*?'   # Non-greedy match on filler
re8='((?:[a-z][a-z]+))' # Word 2

rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
if m:
    word1=m.group(1)
    word2=m.group(2)
    print "("+word1+")"+"("+word2+")"+"\n"

在您的测试输入中，如txt所示，将打印以下内容

(属)(种)

你可以通过this这个很棒的网站来帮助做这样的正则表达式！

希望这能有所帮助

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/32705353

复制

相似问题

问Regex Python [python-2.7]
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Regex Python [python-2.7]EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Regex Python [python-2.7]
EN