我又带着一个老问题回来了。在试验了许多基于Python的Damerau-Levenshtein编辑远程实现之后,我终于找到了下面列出的作为editdistance_reference()。它似乎提供了正确的结果,并似乎有一个有效的实施。
因此,我开始将代码转换为Cython。在我的测试数据上,参考方法能够提供11000个比较的结果(对于12个字母的单词),而Cythonized每秒进行的比较超过20万次。遗憾的是,结果是不正确的:当您查看我打印出来的用于调试的变量thisrow时,我的版本中充满了变量,不管我抛出了什么数据,而引用输出显示了另一张图片。例如,对'helo'进行'world'测试会产生以下输出(ED标记了我的函数,EDR是正确工作的参考):
来自editdistance()
#ED A [0, 0, 0, 0, 0, 1]
#ED B [1, 0, 0, 0, 0, 1]
#ED B [1, 1, 0, 0, 0, 1]
#ED B [1, 1, 1, 0, 0, 1]
#ED B [1, 1, 1, 1, 0, 1]
#ED B [1, 1, 1, 1, 1, 1]
#ED A [0, 0, 0, 0, 0, 2]
#ED B [1, 0, 0, 0, 0, 2]
#ED B [1, 1, 0, 0, 0, 2]
#ED B [1, 1, 1, 0, 0, 2]
#ED B [1, 1, 1, 1, 0, 2]
#ED B [1, 1, 1, 1, 1, 2]
#ED A [0, 0, 0, 0, 0, 3]
#ED B [1, 0, 0, 0, 0, 3]
#ED B [1, 1, 0, 0, 0, 3]
#ED B [1, 1, 1, 0, 0, 3]
#ED B [1, 1, 1, 1, 0, 3]
#ED B [1, 1, 1, 1, 1, 3]
#ED A [0, 0, 0, 0, 0, 4]
#ED B [1, 0, 0, 0, 0, 4]
#ED B [1, 1, 0, 0, 0, 4]
#ED B [1, 1, 1, 0, 0, 4]
#ED B [1, 1, 1, 1, 0, 4]
#ED B [1, 1, 1, 1, 1, 4]来自editdistance_reference()
#EDR A [0, 0, 0, 0, 0, 1]
#EDR B [1, 0, 0, 0, 0, 1]
#EDR B [1, 2, 0, 0, 0, 1]
#EDR B [1, 2, 3, 0, 0, 1]
#EDR B [1, 2, 3, 4, 0, 1]
#EDR B [1, 2, 3, 4, 5, 1]
#EDR A [0, 0, 0, 0, 0, 2]
#EDR B [2, 0, 0, 0, 0, 2]
#EDR B [2, 2, 0, 0, 0, 2]
#EDR B [2, 2, 3, 0, 0, 2]
#EDR B [2, 2, 3, 4, 0, 2]
#EDR B [2, 2, 3, 4, 5, 2]
#EDR A [0, 0, 0, 0, 0, 3]
#EDR B [3, 0, 0, 0, 0, 3]
#EDR B [3, 3, 0, 0, 0, 3]
#EDR B [3, 3, 3, 0, 0, 3]
#EDR B [3, 3, 3, 3, 0, 3]
#EDR B [3, 3, 3, 3, 4, 3]
#EDR A [0, 0, 0, 0, 0, 4]
#EDR B [4, 0, 0, 0, 0, 4]
#EDR B [4, 4, 0, 0, 0, 4]
#EDR B [4, 4, 4, 0, 0, 4]
#EDR B [4, 4, 4, 4, 0, 4]
#EDR B [4, 4, 4, 4, 4, 4]我一定很笨,因为这个错误可能是非常明显的事情之一。但我似乎找不到它。
还有第二个问题:I malloc空间twoago、oneago和thisrow三个数组,然后以循环方式交换它们。当我尝试使用free( twoago )等的时候,我得到了glibc抱怨double free or corruption的一行。我在谷歌上搜索了一下,难道是因为指针交换业务让滑翔机有点晕眩,所以它无法正确释放内存吗?
下面我首先列出编译所需的setup.py (/path/to/python3.1 ./setup.py build_ext --inplace),然后列出编辑距离代码本身,因此感兴趣的人发现复制起来更容易。
还有一件事:这是与Python3.1一起运行的;有趣的是,在*.pyx文件中,我们确实有空的unicode字符串,但是print仍然是一个语句,而不是一个函数。
是的,我知道这是很多要粘贴的代码,但是问题是,当你削减太多代码的时候,代码是无法运行的。我相信除了editdistance()之外,所有的方法都能正常工作,但是可以自由地指出你感知到的任何问题。
setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
setup(
name = 'cython_dameraulevenshtein',
ext_modules = [
Extension( 'cython_dameraulevenshtein', [ 'cython_dameraulevenshtein.pyx', ] ), ],
cmdclass = {
'build_ext': build_ext }, )cython_dameraulevenshtein.pyx (一直滚动到最后看到有趣的东西):
############################################################################################################
cdef extern from "stdlib.h":
ctypedef unsigned int size_t
void *malloc(size_t size)
void *realloc( void *ptr, size_t size )
void free(void *ptr)
#-----------------------------------------------------------------------------------------------------------
cdef inline unsigned int _minimum_of_two_uints( unsigned int a, unsigned int b ):
if a < b: return a
return b
#-----------------------------------------------------------------------------------------------------------
cdef inline unsigned int _minimum_of_three_uints( unsigned int a, unsigned int b, unsigned int c ):
if a < b:
if c < a:
return c
return a
if c < b:
return c
return b
#-----------------------------------------------------------------------------------------------------------
cdef inline int _warp( unsigned int limit, int value ):
return value if value >= 0 else limit + value
############################################################################################################
# ARRAYS THAT SAY SIZE ;-)
#-----------------------------------------------------------------------------------------------------------
cdef class Array_of_unsigned_int:
cdef unsigned int *data
cdef unsigned int length
#---------------------------------------------------------------------------------------------------------
def __cinit__( self, unsigned int length, fill_value = None ):
self.length = length
self.data = <unsigned int *>malloc( length * sizeof( unsigned int ) ) ###OBS### must check malloc doesn't return NULL pointer
if fill_value is not None:
self.fill( fill_value )
#---------------------------------------------------------------------------------------------------------
cdef fill( self, unsigned int value ):
cdef unsigned int idx
cdef unsigned int *d = self.data
for idx from 0 <= idx < self.length:
d[ idx ] = value
#---------------------------------------------------------------------------------------------------------
cdef resize( self, unsigned int length ):
self.data = <unsigned int *>realloc( self.data, length * sizeof( unsigned int ) ) ###OBS### must check realloc doesn't return NULL pointer
self.length = length
#---------------------------------------------------------------------------------------------------------
def free( self ):
"""Always remember the milk: Free up memory."""
free( self.data ) ###OBS### should free memory here
#---------------------------------------------------------------------------------------------------------
def as_list( self ):
"""Return the array as a Python list."""
R = []
cdef unsigned int idx
cdef unsigned int *d = self.data
for idx from 0 <= idx < self.length:
R.append( d[ idx ] )
return R
############################################################################################################
# CONVERTING UNICODE TO CHARACTER IDs (CIDs)
#---------------------------------------------------------------------------------------------------------
cdef unsigned int _UMX_surrogate_lower_bound = 0x10000
cdef unsigned int _UMX_surrogate_upper_bound = 0x10ffff
cdef unsigned int _UMX_surrogate_hi_lower_bound = 0xd800
cdef unsigned int _UMX_surrogate_hi_upper_bound = 0xdbff
cdef unsigned int _UMX_surrogate_lo_lower_bound = 0xdc00
cdef unsigned int _UMX_surrogate_lo_upper_bound = 0xdfff
cdef unsigned int _UMX_surrogate_foobar_factor = 0x400
#---------------------------------------------------------------------------------------------------------
cdef Array_of_unsigned_int _cids_from_text( text ):
"""Givn a ``text`` either as a Unicode string or as a ``bytes`` or ``bytearray``, return an instance of
``Array_of_unsigned_int`` that enumerates either the Unicode codepoints of each character or the value of
each byte. Surrogate pairs will be condensed into single values, so on narrow Python builds the length of
the array returned may be less than ``len( text )``."""
#.........................................................................................................
# Make sure ``text`` is either a Unicode string (``str``) or a ``bytes``-like thing:
is_bytes = isinstance( text, ( bytes, bytearray, ) )
assert is_bytes or isinstance( text, str ), '#121'
#.........................................................................................................
# Whether it is a ``str`` or a ``bytes``, we know the result can only have at most as many elements as
# there are characters in ``text``, so we can already reserve that much space (in the case of a Unicode
# text, there may be fewer CIDs if there happen to be surrogate characters):
cdef unsigned int length = <unsigned int>len( text )
cdef Array_of_unsigned_int R = Array_of_unsigned_int( length )
#.........................................................................................................
# If ``text`` is empty, we can return an empty array right away:
if length == 0: return R
#.........................................................................................................
# Otherwise, prepare to copy data:
cdef unsigned int idx = 0
#.........................................................................................................
# If ``text`` is a ``bytes``-like thing, use simplified processing; we just have to copy over all byte
# values and are done:
if is_bytes:
for idx from 0 <= idx < length:
R.data[ idx ] = <unsigned int>text[ idx ]
return R
#.........................................................................................................
cdef unsigned int cid = 0
cdef bool is_surrogate = False
cdef unsigned int hi = 0
cdef unsigned int lo = 0
cdef unsigned int chr_count = 0
#.........................................................................................................
# Iterate over all indexes in text:
for idx from 0 <= idx < length:
#.......................................................................................................
# If we met with a surrogate CID in the last cycle, then that was a high surrogate CID, and the
# corresponding low CID is on the current position. Having both, we can compute the intended CID
# and reset the flag:
if is_surrogate:
lo = <unsigned int>ord( text[ idx ] )
# IIRC, this formula was documented in Unicode 3:
cid = ( ( hi - _UMX_surrogate_hi_lower_bound ) * _UMX_surrogate_foobar_factor
+ ( lo - _UMX_surrogate_lo_lower_bound ) + _UMX_surrogate_lower_bound )
is_surrogate = False
#.......................................................................................................
else:
# Otherwise, we retrieve the CID from the current position:
cid = <unsigned int>ord( text[ idx ] )
#.....................................................................................................
if _UMX_surrogate_hi_lower_bound <= cid <= _UMX_surrogate_hi_upper_bound:
# If this CID is a high surrogate CID, set ``hi`` to this value and set a flag so we'll come back
# in the next cycle:
hi = cid
is_surrogate = True
continue
#.......................................................................................................
R.data[ chr_count ] = cid
chr_count += 1
#.........................................................................................................
# Surrogate CIDs take up two characters but end up as a single resultant CID, so the return value may
# have fewer elements than the naive string length indicated; in this case, we want to free some memory
# and correct array length data:
if chr_count != length:
R.resize( chr_count )
#.........................................................................................................
return R
#---------------------------------------------------------------------------------------------------------
def cids_from_text( text ):
cdef Array_of_unsigned_int c_R =_cids_from_text( text )
R = c_R.as_list()
c_R.free() ###OBS### should free memory here
return R
############################################################################################################
# SECOND-ORDER SIMILARITY
#-----------------------------------------------------------------------------------------------------------
cpdef float similarity( char *a, char *b ):
"""Given two byte strings ``a`` and ``b``, return their Damerau-Levenshtein similarity as a float between
0.0 and 1.1. Similarity is computed as ``1 - relative_editdistance( a, b )``, so a result of ``1.0``
indicates identity, while ``0.0`` indicates complete dissimilarity."""
return 1.0 - relative_editdistance( a, b )
#-----------------------------------------------------------------------------------------------------------
cpdef float relative_editdistance( char *a, char *b ):
"""Given two byte strings ``a`` and ``b``, return their relative Damerau-Levenshtein distance. The return
value is a float between 0.0 and 1.0; it is calculated as the absolute edit distance, divided by the
length of the longer string. Therefore, ``0.0`` indicates identity, while ``1.0`` indicates complete
dissimilarity."""
cdef int length = max( len( a ), len( b ) )
if length == 0: return 0.0
return editdistance( a, b ) / <float>length
############################################################################################################
# EDIT DISTANCE
#-----------------------------------------------------------------------------------------------------------
cpdef unsigned int editdistance( text_a, text_b ):
"""Given texts as Unicode strings or ``bytes`` / ``bytearray`` objects, return their absolute
Damerau-Levenshtein distance. Each deletion, insertion, substitution, and transposition is counted as one
difference, so the edit distance between ``abc`` and ``ab``, ``abcx``, ``abx``, ``acb``, respectively, is
``1``."""
#.........................................................................................................
# This should be fast in Python, as it can (and probably is) implemented by doing an identity check in
# the case of ``bytes`` and ``str`` objects:
if text_a == text_b: return 0
#.........................................................................................................
# Convert Unicode text to C array of unsigned integers:
cdef Array_of_unsigned_int a = _cids_from_text( text_a )
cdef Array_of_unsigned_int b = _cids_from_text( text_b )
R = c_editdistance( a, b )
#.........................................................................................................
# Always remember the milk:
a.free()
b.free()
#.........................................................................................................
return R
#-----------------------------------------------------------------------------------------------------------
cdef unsigned int c_editdistance( Array_of_unsigned_int cids_a, Array_of_unsigned_int cids_b ):
# Conceptually, this is based on a len(a) + 1 * len(b) + 1 matrix.
# However, only the current and two previous rows are needed at once,
# so we only store those.
#.........................................................................................................
# This shortcut is pretty useless if comparison is not very fast; therefore, it is done in the function
# that deals with the Python objects, q.v.
# if cids_a.equals( cids_b ): return 0
#.........................................................................................................
cdef unsigned int a_length = cids_a.length
cdef unsigned int b_length = cids_b.length
#.........................................................................................................
# Another shortcut: if one of the texts is empty, then the edit distance is trivially the length of the
# other text. This also works for two empty texts, but those have already been taken care of by the
# previous shortcut:
#.........................................................................................................
if a_length == 0: return b_length
if b_length == 0: return a_length
#.........................................................................................................
cdef unsigned int row_length = b_length + 1
cdef unsigned int row_length_1 = row_length - 1
cdef unsigned int row_bytecount = sizeof( unsigned int ) * row_length
cdef unsigned int *oneago = <unsigned int *>malloc( row_bytecount ) ###OBS### must check malloc doesn't return NULL pointer
cdef unsigned int *twoago = <unsigned int *>malloc( row_bytecount ) ###OBS### must check malloc doesn't return NULL pointer
cdef unsigned int *thisrow = <unsigned int *>malloc( row_bytecount ) ###OBS### must check malloc doesn't return NULL pointer
cdef unsigned int idx = 0
cdef unsigned int idx_a = 0
cdef unsigned int idx_b = 0
cdef int idx_a_1_text = 0
cdef int idx_b_1_row = 0
cdef int idx_b_2_row = 0
cdef int idx_b_1_text = 0
cdef unsigned int deletion_cost = 0
cdef unsigned int addition_cost = 0
cdef unsigned int substitution_cost = 0
#.........................................................................................................
# Equivalent of ``thisrow = list( range( 1, b_length + 1 ) ) + [ 0 ]``:
#print( '#305', cids_a.as_list(), cids_b.as_list(), a_length, b_length, row_length, row_length_1 )
for idx from 1 <= idx < row_length:
thisrow[ idx - 1 ] = idx
thisrow[ row_length - 1 ] = 0
#.........................................................................................................
for idx_a from 0 <= idx_a < a_length:
idx_a_1_text = _warp( a_length, idx_a - 1 )
twoago, oneago = oneago, thisrow
#.......................................................................................................
# Equivalent of ``thisrow = [ 0 ] * b_length + [ idx_a + 1 ]``:
for idx from 0 <= idx < row_length_1:
thisrow[ idx ] = 0
thisrow[ row_length - 1 ] = idx_a + 1
#.......................................................................................................
# some diagnostic output:
x = []
for idx from 0 <= idx < row_length: x.append( thisrow[ idx ] )
print
print '#ED A', x
#.......................................................................................................
for idx_b from 0 <= idx_b < b_length:
#.....................................................................................................
idx_b_1_row = _warp( row_length, idx_b - 1 )
idx_b_1_text = _warp( b_length, idx_b - 1 )
#.....................................................................................................
assert 0 <= idx_b_1_row < row_length, ( '#323', idx_b_1_row, )
assert 0 <= idx_a_1_text < a_length, ( '#324', idx_a_1_text, )
assert 0 <= idx_b_1_text < b_length, ( '#325', idx_b_1_text, )
#.....................................................................................................
deletion_cost = oneago[ idx_b ] + 1
addition_cost = thisrow[ idx_b_1_row ] + 1
substitution_cost = oneago[ idx_b_1_row ] + ( 1 if cids_a.data[ idx_a ]
!= cids_b.data[ idx_b ] else 0 )
thisrow[ idx_b ] = _minimum_of_three_uints( deletion_cost, addition_cost, substitution_cost )
#.....................................................................................................
# Transpositions:
if ( idx_a > 0
and idx_b > 0
and cids_a.data[ idx_a ] == cids_b.data[ idx_b_1_text ]
and cids_a.data[ idx_a_1_text ] == cids_b.data[ idx_b ]
and cids_a.data[ idx_a ] != cids_b.data[ idx_b ] ):
#...................................................................................................
idx_b_2_row = _warp( row_length, idx_b - 2 )
assert 0 <= idx_b_2_row < row_length, ( '#340', idx_b_2_row, )
thisrow[ idx_b ] = _minimum_of_two_uints( thisrow[ idx_b ], twoago[ idx_b_2_row ] + 1 )
#.....................................................................................................
# some diagnostic output:
x = []
for idx from 0 <= idx < row_length: x.append( thisrow[ idx ] )
print '#ED B', x
#.........................................................................................................
# Here, ``b_length - 1`` can't become negative, since we already tested for ``b_length == 0`` in the
# shortcut above:
cdef unsigned int R = thisrow[ b_length - 1 ]
#.........................................................................................................
# Always remember the milk:
# BUG: Activating below lines leads to glibc failing with ``double free or corruption``
#free( twoago )
#free( oneago )
#free( thisrow )e
#.........................................................................................................
return R
#-----------------------------------------------------------------------------------------------------------
def editdistance_reference( text_a, text_b ):
"""This method is believed to compute a correct Damerau-Levenshtein edit distance, with deletions,
insertions, substitutions, and transpositions. Do not touch it; it is here to validate results returned
from the above method. Code adapted from
http://mwh.geek.nz/2009/04/26/python-damerau-levenshtein-distance"""
# Conceptually, the implementation is based on a ``( len( seq1 ) + 1 ) * ( len( seq2 ) + 1 )`` matrix.
# However, only the current and two previous rows are needed at once, so we only store those. Python
# lists wrap around for negative indices, so we put the leftmost column at the *end* of the list. This
# matches with the zero-indexed strings and saves extra calculation.
b_length = len( text_b )
oneago = None
thisrow = list( range( 1, b_length + 1 ) ) + [ 0 ]
for idx_a in range( len( text_a ) ):
twoago, oneago, thisrow = oneago, thisrow, [ 0 ] * b_length + [ idx_a + 1 ]
#.......................................................................................................
# some diagnostic output:
print
print '#EDR A', thisrow
#.......................................................................................................
for idx_b in range( b_length ):
deletion_cost = oneago[ idx_b ] + 1
addition_cost = thisrow[ idx_b - 1 ] + 1
substitution_cost = oneago[ idx_b - 1 ] + ( text_a[ idx_a ] != text_b[ idx_b ] )
thisrow[ idx_b ] = min( deletion_cost, addition_cost, substitution_cost )
if ( idx_a > 0
and idx_b > 0
and text_a[ idx_a ] == text_b[ idx_b - 1 ]
and text_a[ idx_a - 1 ] == text_b[ idx_b ]
and text_a[ idx_a ] != text_b[ idx_b ] ):
thisrow[ idx_b ] = min( thisrow[ idx_b ], twoago[ idx_b - 2 ] + 1 )
#.....................................................................................................
# some diagnostic output:
print '#EDR B', thisrow
#.....................................................................................................
return thisrow[ len( text_b ) - 1 ]编辑我还把这个文本发布到巴斯丁和Cython。
发布于 2010-08-07 23:55:21
做一些初步的调试。您知道,在标记为#ED B的第二输出行中出现了错误。错误的值似乎表明它在早期就找到了一个编辑,并且再也找不到了。这可能是因为min()的一个args卡在1.打印deletion_cost,substitution_cost,addition_cost .哪一个是错的?怎么搞错了?打印输入文本值。暂时禁用换位部分,以查看这是否使问题消失。检查并重新检查_warp抓取器(如果我曾经见过它的话,这是一种诡计多端的花招)及其用法。如果你把"aaaaa“和"aaaaa”相提并论会发生什么?"qwerty“和"qwerty"?"xxxxx“和"yyyyy"?问题是否发生在所有的bytes、bytearray和str输入中?
免费的问题:我怀疑腐败,而不是头昏眼花。打印这三个数组;它们的内容是否与预期的一样?试着一次启用free()一个数组--所有这些都损坏了吗?只有一个?哪个?
内存管理方面的一些问题:您可能喜欢阅读这,并考虑使用Python特定的例程,而不是malloc/free。如果有代孕者,缩小数组的规模似乎太过了。
更新:遵循我自己的建议。删除成本被填充。“曾经”和“这一行”是一样的。造成错误答案和加倍的问题(-!没有被破坏!-)自由:指针的循环洗牌不是循环的。
# twoago, oneago = oneago, thisrow ### BUG ###
twoago, oneago, thisrow = oneago, thisrow, twoago ### FIXED ###更新2:注释容量太小了--没有莫乔,就像我建议的那样,只是普通的调试过程。“为我的修复而专注于此”并不是“超理想的”。引用代码确实为每一次传递创建了一个新列表,这是可以做到的,因为thisrow引用的不是上次传递的任何内容。它不需要这样做,事实上,除了第一个和最后一个元素之外,初始化可以由随机数组成,并且只有在那里填写列表,这样它才能被编入索引,而不是像一些非技巧的实现那样附加到其中。因此,您可以盲目地模仿“引用实现”,而代价是执行额外的(浪费的) malloc/free,或者您可以忽略Python特定的实现细节,只将引用实现作为一个大概正确答案的来源。然后,您可以接受我的修复,然后继续通过削减thisrow数组的大部分初始化来节省时间。
更新3:为您提供了一个替代参考实现。它最初分配3行,以避免在外部循环中创建列表的开销。它还避免了除了thisrow的最后一个元素之外的所有不必要的初始化。这简化了C/Cython的转换。
def damlevref2(seq1, seq2):
# For Python 2.x as was the original.
# Appears to work on Python 1.5.2 as well :-)
seq2len = len(seq2)
twoago = [-777] * (seq2len + 1) # pseudo-malloc; any old rubbish will do
oneago = [-666] * (seq2len + 1) # ditto
thisrow = range(1, seq2len + 1) + [0]
for x in xrange(len(seq1)):
twoago, oneago, thisrow = oneago, thisrow, twoago # circular "pointer" shuffle
thisrow[-1] = x + 1
for y in xrange(seq2len):
delcost = oneago[y] + 1
addcost = thisrow[y - 1] + 1
subcost = oneago[y - 1] + (seq1[x] != seq2[y])
thisrow[y] = min(delcost, addcost, subcost)
if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]):
thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
return thisrow[seq2len - 1] https://stackoverflow.com/questions/3431933
复制相似问题