文章/答案/技术大牛

发布

社区首页 >问答首页 >如何使用\w+ (不使用A-Ba-b或\d)将单词与数字分开？

问如何使用\w+ (不使用A-Ba-b或\d)将单词与数字分开？
EN

Stack Overflow用户

提问于 2020-04-09 05:33:50

回答 1查看 24关注 0票数 0

我使用OCR (光学字符识别)从文件中提取文本，得到以下字符串：

Lisboa                       187      
      Santo Tirso                  8\n\n        Porto                        137            Vila do Conde
 8\n\n        Maia
   119            Penafiel
       7\n\n        Vila Nova de Gaia   
         83             Portimão        
             7\n\n        Oliveira de Azeméis          18             Évora
         5\n\n

我想得到一个元组列表，每个元组(“城市名称”，“下面的数字”)，像这样：("Lisboa"，"187")，("Santo Tirso"，"8")，("Porto"，"137")，...

我写了这样的表达式:R“(A-Z？) (\d+)”

因为城市的名称可能包括代字号和空格，但我得到(“一个城市的名称后面跟着其他城市的数字名称”，“第二个城市后面的数字”)，像这样：(“Lisboa187 Santo Tirso"，"8")。

所以:我想使用\w+，但不包括第一组中的所有数字(这将是元组的第一个元素)。我该怎么做呢？

ocr

python

regex

回答 1

Stack Overflow用户

发布于 2020-04-09 05:44:27

您可以使用

import re

junk = """
Lisboa                       187      
      Santo Tirso                  8

        Porto                        137            Vila do Conde
 8

        Maia
   119            Penafiel
       7

        Vila Nova de Gaia   
         83             Portimão        
             7

        Oliveira de Azeméis          18             Évora
         5


"""

rx = re.compile(r'\b(?P<city>(?:[A-Za-zéÉã]+\s)+)\D+(?P<number>\d+)')

cities = [(m.group('city').strip(), m.group('number'))
          for m in rx.finditer(junk)]

print(cities)

哪一项会产生

[('Lisboa', '187'), ('Santo Tirso', '8'), ('Porto', '137'), ('Vila do Conde', '8'), ('Maia', '119'), ('Penafiel', '7'), ('Vila Nova de Gaia', '83'), ('Portimão', '7'), ('Oliveira de Azeméis', '18'), ('Évora', '5')]

参见。

表达式解释如下：

\b                      # a word boundary
(?P<city>               # a capturing group named "city"
    (?:[A-Za-zéÉã]+\s)+ # allowed characters for the cites,
                        # followed by a space
)
\D+                     # not digits
(?P<number>\d+)         # first digit after the city name

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/61110399

复制

相似问题

问如何使用\w+ (不使用A-Ba-b或\d)将单词与数字分开？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用\w+ (不使用A-Ba-b或\d)将单词与数字分开？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用\w+ (不使用A-Ba-b或\d)将单词与数字分开？
EN