首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >将字符串恢复为初始套管和punctuation_pandas

将字符串恢复为初始套管和punctuation_pandas
EN

Stack Overflow用户
提问于 2022-04-11 20:08:14
回答 2查看 45关注 0票数 1

是否有办法修改这段代码以保持其逻辑,但将字符串还原为初始大小写和标点符号?

代码语言:javascript
复制
data = {'duplicate_column':["Adidas Women's Womens A004 Snow Boot", 'Amul Milk, 100ml, 100ML', 'L-OCCITANE L´Occitane CREMA MANI', 'Corneto Ice Cream Ice, 300 ml -300ml', 'Béaba BÉABA, Set di 6 Contenitori,set']}
df = pd.DataFrame(data)
punct = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{}~´'
transtab = str.maketrans(dict.fromkeys(punct, ''))

df['new_column'] = [
' '.join(dict.fromkeys(s.translate(transtab).lower().split()))
for s in df['duplicate']

此代码正在从“重复”列中删除重复项,并创建一个包含结果的新列。需要将字符串恢复为初始大小写和标点符号。

重复栏(初始数据):

代码语言:javascript
复制
Adidas Women's Womens A004 Snow Boot
Amul Milk, 100ml, 100ML
L-OCCITANE L´Occitane CREMA MANI
Corneto Ice Cream Ice, 300 ml -300ml
Béaba BÉABA, Set di 6 Contenitori,set

用上述代码创建的新列:

代码语言:javascript
复制
adidas womens a004 snow boot
amul milk 100ml
loccitane crema mani
corneto ice cream 300 ml 300ml
béaba set di 6 contenitori set

期望产出:

代码语言:javascript
复制
Adidas Women's A004 Snow Boot
Amul Milk, 100ml
L-OCCITANE CREMA MANI
Corneto Ice Cream, 300 ml
Béaba, Set di 6 Contenitori
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-04-11 20:38:21

而不是修复要恢复的输出。标点符号/大小写一开始不要掉下来。您可以使用基于集合的自定义函数:

代码语言:javascript
复制
import re
regex = re.compile('[%s]' % re.escape(punct))
def remove_dup(s):
    seen = set()
    keep = []
    for w in s.split():
        w2 = regex.sub('', w.lower())
        if w2 in seen:
            continue 
        seen.add(w2)
        keep.append(w)
    return ' '.join(keep).strip(punct)
        
df['new_column'] = list(map(remove_dup, df['duplicate_column']))

输出:

代码语言:javascript
复制
                        duplicate_column                       new_column
0   Adidas Women's Womens A004 Snow Boot    Adidas Women's A004 Snow Boot
1                Amul Milk, 100ml, 100ML                 Amul Milk, 100ml
2       L-OCCITANE L´Occitane CREMA MANI            L-OCCITANE CREMA MANI
3   Corneto Ice Cream Ice, 300 ml -300ml  Corneto Ice Cream 300 ml -300ml
4  Béaba BÉABA, Set di 6 Contenitori,set   Béaba Set di 6 Contenitori,set

替代方案

代码语言:javascript
复制
import re
pat = '[\s%s]' % re.escape(punct)
regex = re.compile(pat)
regex2 = re.compile(fr'({pat}+)(?!s\b|\s*ml\b)')

def remove_dup(s):
    seen = set()
    keep = []
    for w in regex2.split(s):
        if len(w)>1:
            w2 = regex.sub('', w.lower())
            if w2 in seen:
                continue 
            seen.add(w2)
            keep.append(w.strip())
        else:
            keep.append(w)
    return ''.join(keep).strip(punct)
        
df['new_column'] = list(map(remove_dup, df['duplicate_column']))

print(df)

输出:

代码语言:javascript
复制
                        duplicate_column                      new_column
0   Adidas Women's Womens A004 Snow Boot  Adidas Women's  A004 Snow Boot
1                Amul Milk, 100ml, 100ML                 Amul Milk,100ml
2       L-OCCITANE L´Occitane CREMA MANI        L-OCCITANE L´ CREMA MANI
3   Corneto Ice Cream Ice, 300 ml -300ml       Corneto Ice Cream ,300 ml
4  Béaba BÉABA, Set di 6 Contenitori,set     Béaba ,Set di 6 Contenitori
票数 1
EN

Stack Overflow用户

发布于 2022-04-11 20:48:47

另一个版本:

代码语言:javascript
复制
import re

remove_punct = re.compile("""[!"#$%&'()*+-./:;<=>?@[\\]^_`{}~´]""")
millilitres = re.compile(r"(\d+)\s+(ml)", flags=re.I)


def remove_duplicates(x):
    # do some basic preprocess
    x = x.replace(",", " ")
    x = millilitres.sub(r"\1\2", x)

    words = x.split()
    words_without_punct = remove_punct.sub("", x).lower().split()
    dupl, out = set(), []
    for w, wwp in zip(words, words_without_punct):
        if wwp not in dupl:
            out.append(w)
            dupl.add(wwp)
    return " ".join(out)


df["new_column"] = df["duplicate_column"].apply(remove_duplicates)
print(df)

指纹:

代码语言:javascript
复制
                        duplicate_column                     new_column
0   Adidas Women's Womens A004 Snow Boot  Adidas Women's A004 Snow Boot
1                Amul Milk, 100ml, 100ML                Amul Milk 100ml
2       L-OCCITANE L´Occitane CREMA MANI          L-OCCITANE CREMA MANI
3   Corneto Ice Cream Ice, 300 ml -300ml        Corneto Ice Cream 300ml
4  Béaba BÉABA, Set di 6 Contenitori,set     Béaba Set di 6 Contenitori
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/71833727

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档