首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Python Excel字符串列差异

Python Excel字符串列差异
EN

Stack Overflow用户
提问于 2021-05-10 10:59:03
回答 1查看 54关注 0票数 1
代码语言:javascript
复制
| Text1                          | Text2                   | Change         |
|:-------------------------------|:------:-----------------| -----:---------|
| This is Mango. This is Banana  | This is Banana          | This is Mango. |
| This is Mango.                 | This is Mango, Banana   | , Banana       |

我想如上所述从Text1和Text2派生Change列。上面是excel data / DATAFRAME

下面的代码可以很好地处理文本,但不能处理DATAFRAME

代码语言:javascript
复制
import difflib

定义取自:https://en.wikipedia.org/wiki/Internet_Information_Services的原始文本

代码语言:javascript
复制
original = ["About the IIS", "", "IIS 8.5 has several improvements related", "to performance in large-scale scenarios, such", "as those used by commercial hosting providers and Microsoft's", "own cloud offerings."]

定义修改后的文本

代码语言:javascript
复制
edited = ["About the IIS", "", "It has several improvements related", "to performance in large-scale scenarios."]

启动不同的对象

代码语言:javascript
复制
d = difflib.Differ()

计算两个文本之间的差异

代码语言:javascript
复制
diff = d.compare(original, edited)

输出结果

代码语言:javascript
复制
print ('\n'.join(diff))

=>输出如下所示

代码语言:javascript
复制
 python comparing-strings-difflib.py
  About the IIS
  
- IIS 8.5 has several improvements related
?  ^^^^^^

+ It has several improvements related
?  ^

- to performance in large-scale scenarios, such
?                                        ^^^^^^

+ to performance in large-scale scenarios.
?                       
EN

回答 1

Stack Overflow用户

发布于 2021-05-10 14:12:00

创建diff是为了比较字符串(特别是source code)

你应该使用

代码语言:javascript
复制
df['Change'] = df.apply(function_name, axis=1) 

在每一行上运行自己的函数并比较行中的两个文本。

但使用Diff()获取更改并不有用,因为它以文本形式提供结果。

您应该使用SequenceMatcher将其作为元组获取

代码语言:javascript
复制
(operation, text1_start, text1_end, text2_start, text2_end) 

要创建DataFrame,我必须在Text2列中添加缺少的行,并且我使用None来识别缺少的行。

代码语言:javascript
复制
diff_row_number = len(original) - len(edited)

if diff_row_number > 0:
    edited = edited + [None]*diff_row_number
elif diff_row_number < 0:    
    original = original + [None]*(-diff_row_number)

如果使用空字符串"",则不需要part if text2 is None:

最小工作代码

代码语言:javascript
复制
import pandas as pd
import difflib

original = ["About the IIS", "", "IIS 8.5 has several improvements related", "to performance in large-scale scenarios, such", "as those used by commercial hosting providers and Microsoft's", "own cloud offerings."]
edited = ["About the IIS", "", "It has several improvements related", "to performance in large-scale scenarios."]

#d = difflib.Differ()
#diff = d.compare(original, edited)

#for line in diff:
#    print(line)

# fill missing lines    
diff_row_number = len(original) - len(edited)

if diff_row_number > 0:
    edited = edited + [None]*diff_row_number
elif diff_row_number < 0:    
    original = original + [None]*(-diff_row_number)

df = pd.DataFrame({
   'Text1': original,
   'Text2': edited,
})

def compare_row(row):
    text1, text2 = row
    
    if text2 is None:
        return 'remove: ' + text1
    else:
        sm = difflib.SequenceMatcher(a=text1, b=text2)
        opcodes = sm.get_opcodes()
        
        changes = []
        
        for item in opcodes:
            if item[0] != 'equal':
                name, a1,a2, b1,b2 = item
                changes.append( name + ' : ' + text1[a1:a2] + ' : ' + text2[b1:b2] )
                
        return '\n'.join(changes)

# --- main ---

df['Change'] = df.apply(compare_row, axis=1)

print(df['Change'])

结果:

代码语言:javascript
复制
0                                                     
1                                                     
2                                 replace : IS 8.5 : t
3                                 replace : , such : .
4    remove: as those used by commercial hosting pr...
5                         remove: own cloud offerings.
Name: Change, dtype: object

编辑:

用空字符串代替None也是如此

代码语言:javascript
复制
import pandas as pd
import difflib

original = ["About the IIS", "", "IIS 8.5 has several improvements related", "to performance in large-scale scenarios, such", "as those used by commercial hosting providers and Microsoft's", "own cloud offerings."]
edited = ["About the IIS", "", "It has several improvements related", "to performance in large-scale scenarios."]

#d = difflib.Differ()
#diff = d.compare(original, edited)

#for line in diff:
#    print(line)

# fill missing lines    
diff_row_number = len(original) - len(edited)

if diff_row_number > 0:
    edited = edited + [""]*diff_row_number
elif diff_row_number < 0:    
    original = original + [""]*(-diff_row_number)    

df = pd.DataFrame({
   'Text1': original,
   'Text2': edited,
})

def compare_row(row):
    text1, text2 = row
    
    sm = difflib.SequenceMatcher(a=text1, b=text2)
    opcodes = sm.get_opcodes()
    
    changes = []
    
    for item in opcodes:
        if item[0] != 'equal':
            name, a1,a2, b1,b2 = item
            changes.append( name + ' : ' + text1[a1:a2] + ' : ' + text2[b1:b2] )
            
    return '\n'.join(changes)

# --- main ---

df['Change'] = df.apply(compare_row, axis=1)

print(df['Change'])

结果:(用delete代替remove)

代码语言:javascript
复制
0                                                     
1                                                     
2                                 replace : IS 8.5 : t
3                                 replace : , such : .
4    delete : as those used by commercial hosting p...
5                     delete : own cloud offerings. : 
Name: Change, dtype: object
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/67464247

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档