我正在开发一个Python程序,该程序筛选一个.txt文件来查找属和种的名称。这些行的格式是这样的(是的,等号始终围绕着通用名称):
1. =Common Name= Genus Species some other words that I don't want.
2. =Common Name= Genus Species some other words that I don't want.我似乎找不出一个正则表达式,它可以只匹配属和种,而不匹配通用名称。我知道等号(=)可能会在某种程度上有所帮助,但我不知道如何使用它们。
编辑:一些真实数据:
1. =Western grebe.= ÆCHMOPHORUS OCCIDENTALIS. Rare migrant; western species, chiefly interior regions of North America.
2. =Holboell's grebe.= COLYMBUS HOLBOELLII. Rare migrant; breeds far north; range, all of North America.
3. =Horned grebe.= COLYMBUS AURITUS. Rare migrant; range, almost the same as the last.
4. =American eared grebe.= COLYMBUS NIGRICOLLIS CALIFORNICUS. Summer resident; rare in eastern, common in western Colorado; breeds from plains to 8,000 feet; partial to alkali lakes; western species.发布于 2015-09-22 06:23:58
您可能不需要正则表达式来执行此操作。如果您需要的单词的顺序和单词的计数总是相同的,那么您可以将每一行拆分为子字符串列表,并获得该列表的第三个(属)和第四个(种)元素。代码可能如下所示:
myfile = open('myfilename.txt', 'r')
for line in myfile.readlines():
words = line.split()
genus, species = words[2], words[3]在我看来,它看起来更像是“蟒蛇”。
如果通用名称可以由多个单词组成,则建议的代码将返回错误的结果。要在这种情况下也获得正确的结果,您可以使用以下代码:
myfile = open('myfilename.txt', 'r')
for line in myfile.readlines():
words = line.split('=')[2].split() # If the program returns wrong results, try changing the index from 2 to 1 or 3. What number is the right one depends on whether there can be any symbols before the first "=".
genus, species = words[0], words[1]发布于 2015-09-22 06:21:05
如果在组中捕获单词就足够了(而且你不会直接匹配),你可以试试:
(?=\d\.\s*=[^=]+=\s(?:(?P<genus>\w+)\s(?P<species>\w+)))所需的值将位于组<genus>和<species>中。整个正则表达式是一个正向后视,因此它匹配字符串开头的零点位置,但它将一些内容捕获到组中。
(?=\d\.\s*=[^=]+=\s -十进制后跟等号和空格之间的内容,(?:(?P<genus>\w+)\s(?P<species>\w+))) -将第一个单词捕获到属组,第二个单词表示物种组,发布于 2015-09-22 06:25:15
您可以尝试执行以下操作:
import re
txt='1. =Common Name= Genus Species some other words that I don\'t want.'
re1='.*?' # Non-greedy match on filler
re2='(?:[a-z][a-z]+)' # Uninteresting: word
re3='.*?' # Non-greedy match on filler
re4='(?:[a-z][a-z]+)' # Uninteresting: word
re5='.*?' # Non-greedy match on filler
re6='((?:[a-z][a-z]+))' # Word 1
re7='.*?' # Non-greedy match on filler
re8='((?:[a-z][a-z]+))' # Word 2
rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
if m:
word1=m.group(1)
word2=m.group(2)
print "("+word1+")"+"("+word2+")"+"\n"在您的测试输入中,如txt所示,将打印以下内容
(属)(种)
你可以通过this这个很棒的网站来帮助做这样的正则表达式!
希望这能有所帮助
https://stackoverflow.com/questions/32705353
复制相似问题