我希望解析十进制数,而不管它们的格式如何,这是未知的。原文的语言是未知的,可能会有变化。此外,源字符串可以包含一些额外的文本之前或之后,如货币或单位。
我使用的方法如下:
# NOTE: Do not use, this algorithm is buggy. See below.
def extractnumber(value):
if (isinstance(value, int)): return value
if (isinstance(value, float)): return value
result = re.sub(r'&#\d+', '', value)
result = re.sub(r'[^0-9\,\.]', '', result)
if (len(result) == 0): return None
numPoints = result.count('.')
numCommas = result.count(',')
result = result.replace(",", ".")
if ((numPoints > 0 and numCommas > 0) or (numPoints == 1) or (numCommas == 1)):
decimalPart = result.split(".")[-1]
integerPart = "".join ( result.split(".")[0:-1] )
else:
integerPart = result.replace(".", "")
result = int(integerPart) + (float(decimalPart) / pow(10, len(decimalPart) ))
return result这类作品..。
>>> extractnumber("2")
2
>>> extractnumber("2.3")
2.3
>>> extractnumber("2,35")
2.35
>>> extractnumber("-2 000,5")
-2000.5
>>> extractnumber("EUR 1.000,74 €")
1000.74
>>> extractnumber("20,5 20,8") # Testing failure...
ValueError: invalid literal for int() with base 10: '205 208'
>>> extractnumber("20.345.32.231,50") # Returns false positive
2034532231.5所以我的方法对我来说是很脆弱的,而且还会有很多假阳性。
是否有任何库或智能函数可以处理此问题?理想情况下,20.345.32.231,50不应该传递,但是在其他语言(如1.200,50或1 200'50 )中会提取数字,而不管周围的其他文本和字符(包括换行符)的数量。
(根据已接受的答案更新实现:https://github.com/jjmontesl/cubetl/blob/master/cubetl/text/functions.py#L91)
发布于 2013-11-23 06:26:18
您可以使用一个适当花哨的正则表达式来完成这一任务。这是我最大的尝试。我使用命名捕获组,就像一个模式--这个复杂的、数字的模式--在反向引用中使用将更加令人困惑。
首先,regexp模式:
_pattern = r"""(?x) # enable verbose mode (which ignores whitespace and comments)
^ # start of the input
[^\d+-\.]* # prefixed junk
(?P<number> # capturing group for the whole number
(?P<sign>[+-])? # sign group (optional)
(?P<integer_part> # capturing group for the integer part
\d{1,3} # leading digits in an int with a thousands separator
(?P<sep> # capturing group for the thousands separator
[ ,.] # the allowed separator characters
)
\d{3} # exactly three digits after the separator
(?: # non-capturing group
(?P=sep) # the same separator again (a backreference)
\d{3} # exactly three more digits
)* # repeated 0 or more times
| # or
\d+ # simple integer (just digits with no separator)
)? # integer part is optional, to allow numbers like ".5"
(?P<decimal_part> # capturing group for the decimal part of the number
(?P<point> # capturing group for the decimal point
(?(sep) # conditional pattern, only tested if sep matched
(?! # a negative lookahead
(?P=sep) # backreference to the separator
)
)
[.,] # the accepted decimal point characters
)
\d+ # one or more digits after the decimal point
)? # the whole decimal part is optional
)
[^\d]* # suffixed junk
$ # end of the input
"""下面是一个使用它的函数:
def parse_number(text):
match = re.match(_pattern, text)
if match is None or not (match.group("integer_part") or
match.group("decimal_part")): # failed to match
return None # consider raising an exception instead
num_str = match.group("number") # get all of the number, without the junk
sep = match.group("sep")
if sep:
num_str = num_str.replace(sep, "") # remove thousands separators
if match.group("decimal_part"):
point = match.group("point")
if point != ".":
num_str = num_str.replace(point, ".") # regularize the decimal point
return float(num_str)
return int(num_str)一些带有一个逗号或句点的数字字符串及其后面的三个数字(如"1,234"和"1.234")是不明确的。不管实际使用的分隔符是什么,这段代码都将解析为带有1,000个分隔符(1234)的整数,而不是浮点值(1.234)。如果您希望对这些数字有不同的结果(例如,如果您希望使用1.234进行浮动),则可以通过特殊情况来处理这一问题。
一些测试输出:
>>> test_cases = ["2", "2.3", "2,35", "-2 000,5", "EUR 1.000,74 €",
"20,5 20,8", "20.345.32.231,50", "1.234"]
>>> for s in test_cases:
print("{!r:20}: {}".format(s, parse_number(s)))
'2' : 2
'2.3' : 2.3
'2,35' : 2.35
'-2 000,5' : -2000.5
'EUR 1.000,74 €' : 1000.74
'20,5 20,8' : None
'20.345.32.231,50' : None
'1.234' : 1234发布于 2013-11-23 01:31:24
我重新修改了你的代码。这一点,再加上下面的valid_number函数,应该可以做到这一点。
我花时间编写这段糟糕的代码的主要原因是,如果您不知道如何使用regexp (比如我),那么将向未来的读者展示如何糟糕地解析正则表达式。
希望,比我更了解regexp的人能够向我们展示如何实现 :)
约束
.,,和'被接受为千位分隔符和十进制分隔符。123,456被解释为123.456,而不是123456)' ')拆分成数字列表。123,456.00和1,345.00都被认为是有效的,但2345,11.00不被认为是vald)。代码
import re
from itertools import combinations
def extract_number(value):
if (isinstance(value, int)) or (isinstance(value, float)):
yield float(value)
else:
#Strip the string for leading and trailing whitespace
value = value.strip()
if len(value) == 0:
raise StopIteration
for s in value.split(' '):
s = re.sub(r'&#\d+', '', s)
s = re.sub(r'[^\-\s0-9\,\.]', ' ', s)
s = s.replace(' ', '')
if len(s) == 0:
continue
if not valid_number(s):
continue
if not sum(s.count(sep) for sep in [',', '.', '\'']):
yield float(s)
else:
s = s.replace('.', '@').replace('\'', '@').replace(',', '@')
integer, decimal = s.rsplit('@', 1)
integer = integer.replace('@', '')
s = '.'.join([integer, decimal])
yield float(s)好了--下面的代码可能会被一些regexp语句替换.。
def valid_number(s):
def _correct_integer(integer):
# First number should have length of 1-3
if not (0 < len(integer[0].replace('-', '')) < 4):
return False
# All the rest of the integers should be of length 3
for num in integer[1:]:
if len(num) != 3:
return False
return True
seps = ['.', ',', '\'']
n_seps = [s.count(k) for k in seps]
# If no separator is present
if sum(n_seps) == 0:
return True
# If all separators are present
elif all(n_seps):
return False
# If two separators are present
elif any(all(c) for c in combinations(n_seps, 2)):
# Find thousand separator
for c in s:
if c in seps:
tho_sep = c
break
# Find decimal separator:
for c in reversed(s):
if c in seps:
dec_sep = c
break
s = s.split(dec_sep)
# If it is more than one decimal separator
if len(s) != 2:
return False
integer = s[0].split(tho_sep)
return _correct_integer(integer)
# If one separator is present, and it is more than one of it
elif sum(n_seps) > 1:
for sep in seps:
if sep in s:
s = s.split(sep)
break
return _correct_integer(s)
# Otherwise, this is a regular decimal number
else:
return True输出
extract_number('2' ): [2.0]
extract_number('.2' ): [0.2]
extract_number(2 ): [2.0]
extract_number(0.2 ): [0.2]
extract_number('EUR 200' ): [200.0]
extract_number('EUR 200.00 -11.2' ): [200.0, -11.2]
extract_number('EUR 200 EUR 300' ): [200.0, 300.0]
extract_number('$ -1.000,22' ): [-1000.22]
extract_number('EUR 100.2345,3443' ): []
extract_number('111,145,234.345.345'): []
extract_number('20,5 20,8' ): [20.5, 20.8]
extract_number('20.345.32.231,50' ): []https://stackoverflow.com/questions/20157375
复制相似问题