我有一项任务,需要从表示数字值的文本中提取值。然而,我感兴趣的是提取最多有6位数的值,小数点是可选的。
例如,以下案文:
XYZ的薪酬总额为512.3456美元,2018年的其他薪酬为65万美元,约为该公司所有其他雇员年薪中位数的8.78倍,约为7.4万美元。另外一些薪水是56000英镑。
我需要提取
["650,000", "2018", "8.78", "74,000", "56000"] 从这里。
我使用的正则表达式:
((\d{1,3})(?:,[0-9]{3}){0,1}|(\d{1,6}))(\.\d{1,2})?
它正确地识别了650,000和74,000,000,但没有正确地识别其他人。
我找到了这 7位数的货币regex,并围绕着它做了一个6位数,但没有成功。我该如何纠正我的准则?
发布于 2019-10-31 07:28:56
试试这个:(?<![\d,.])(?:\d,?){0,5}\d(?:\.\d+)?(?!,?\d)
下面是一个详细的解释:
(?x) # flag for readable mode, whitespaces and comments are ignored
# Make sure to not start in the middle of a number, so no digit, comma or dot before the match
(?<![\d,.])
# k-1 digits, with facultative comma between each. Therefore 5,4,3,2 are allowed for the sake of simplicity, be aware of that
(?:\d,?){0,5}
#The kth digit
\d
# Facultative dot and decimal part
(?:\.\d+)?
# Make sure to not stop in the middle of a big number, so no digit after. Comma is allowed, but only for the grammatical comma, so comma+digit is forbidden
(?!,?\d)可能会有进步,但我想这是你想要的。可能有些案子没处理,如果你找到了就告诉我。在这里测试它:https://regex101.com/r/Wxi5Sj/2
发布于 2019-10-31 07:05:46
试试下面的代码
import re
input = "Total compensation for Mr. XYZ was $5,123,456 and other salary which was $650,000 in fiscal 2018, was determined to be approximately 8.78 times the median annual compensation for all of the firm's other employees, which was approximately $74,000. Some other salaries are 56000. "
print(re.findall(r'(?<=\s)\$?\d{0,3}\,?\d{1,3}(?:\.\d{2})?(?!,?\d)', input))输出
['$650,000', '2018', '8.78', '$74,000', '56000']https://stackoverflow.com/questions/58637991
复制相似问题