| Text1 | Text2 | Change |
|:-------------------------------|:------:-----------------| -----:---------|
| This is Mango. This is Banana | This is Banana | This is Mango. |
| This is Mango. | This is Mango, Banana | , Banana |我想如上所述从Text1和Text2派生Change列。上面是excel data / DATAFRAME
下面的代码可以很好地处理文本,但不能处理DATAFRAME
import difflib定义取自:https://en.wikipedia.org/wiki/Internet_Information_Services的原始文本
original = ["About the IIS", "", "IIS 8.5 has several improvements related", "to performance in large-scale scenarios, such", "as those used by commercial hosting providers and Microsoft's", "own cloud offerings."]定义修改后的文本
edited = ["About the IIS", "", "It has several improvements related", "to performance in large-scale scenarios."]启动不同的对象
d = difflib.Differ()计算两个文本之间的差异
diff = d.compare(original, edited)输出结果
print ('\n'.join(diff))=>输出如下所示
python comparing-strings-difflib.py
About the IIS
- IIS 8.5 has several improvements related
? ^^^^^^
+ It has several improvements related
? ^
- to performance in large-scale scenarios, such
? ^^^^^^
+ to performance in large-scale scenarios.
? 发布于 2021-05-10 14:12:00
创建diff是为了比较字符串(特别是source code)
你应该使用
df['Change'] = df.apply(function_name, axis=1) 在每一行上运行自己的函数并比较行中的两个文本。
但使用Diff()获取更改并不有用,因为它以文本形式提供结果。
您应该使用SequenceMatcher将其作为元组获取
(operation, text1_start, text1_end, text2_start, text2_end) 要创建DataFrame,我必须在Text2列中添加缺少的行,并且我使用None来识别缺少的行。
diff_row_number = len(original) - len(edited)
if diff_row_number > 0:
edited = edited + [None]*diff_row_number
elif diff_row_number < 0:
original = original + [None]*(-diff_row_number)如果使用空字符串"",则不需要part if text2 is None:
最小工作代码
import pandas as pd
import difflib
original = ["About the IIS", "", "IIS 8.5 has several improvements related", "to performance in large-scale scenarios, such", "as those used by commercial hosting providers and Microsoft's", "own cloud offerings."]
edited = ["About the IIS", "", "It has several improvements related", "to performance in large-scale scenarios."]
#d = difflib.Differ()
#diff = d.compare(original, edited)
#for line in diff:
# print(line)
# fill missing lines
diff_row_number = len(original) - len(edited)
if diff_row_number > 0:
edited = edited + [None]*diff_row_number
elif diff_row_number < 0:
original = original + [None]*(-diff_row_number)
df = pd.DataFrame({
'Text1': original,
'Text2': edited,
})
def compare_row(row):
text1, text2 = row
if text2 is None:
return 'remove: ' + text1
else:
sm = difflib.SequenceMatcher(a=text1, b=text2)
opcodes = sm.get_opcodes()
changes = []
for item in opcodes:
if item[0] != 'equal':
name, a1,a2, b1,b2 = item
changes.append( name + ' : ' + text1[a1:a2] + ' : ' + text2[b1:b2] )
return '\n'.join(changes)
# --- main ---
df['Change'] = df.apply(compare_row, axis=1)
print(df['Change'])结果:
0
1
2 replace : IS 8.5 : t
3 replace : , such : .
4 remove: as those used by commercial hosting pr...
5 remove: own cloud offerings.
Name: Change, dtype: object编辑:
用空字符串代替None也是如此
import pandas as pd
import difflib
original = ["About the IIS", "", "IIS 8.5 has several improvements related", "to performance in large-scale scenarios, such", "as those used by commercial hosting providers and Microsoft's", "own cloud offerings."]
edited = ["About the IIS", "", "It has several improvements related", "to performance in large-scale scenarios."]
#d = difflib.Differ()
#diff = d.compare(original, edited)
#for line in diff:
# print(line)
# fill missing lines
diff_row_number = len(original) - len(edited)
if diff_row_number > 0:
edited = edited + [""]*diff_row_number
elif diff_row_number < 0:
original = original + [""]*(-diff_row_number)
df = pd.DataFrame({
'Text1': original,
'Text2': edited,
})
def compare_row(row):
text1, text2 = row
sm = difflib.SequenceMatcher(a=text1, b=text2)
opcodes = sm.get_opcodes()
changes = []
for item in opcodes:
if item[0] != 'equal':
name, a1,a2, b1,b2 = item
changes.append( name + ' : ' + text1[a1:a2] + ' : ' + text2[b1:b2] )
return '\n'.join(changes)
# --- main ---
df['Change'] = df.apply(compare_row, axis=1)
print(df['Change'])结果:(用delete代替remove)
0
1
2 replace : IS 8.5 : t
3 replace : , such : .
4 delete : as those used by commercial hosting p...
5 delete : own cloud offerings. :
Name: Change, dtype: objecthttps://stackoverflow.com/questions/67464247
复制相似问题