在保留顺序的同时使用Pandas删除重复项[python]

this is how my data looks like 3sprouts Cesto de Roupa Cisne Sprouts, 3Sprouts, Organizador Bright-Starts Mordedor Chocalho Rattle & Teethe, bright Starts, Rosa/Roxo Bright-Starts Mordedor Twist & Teethe, Starts, Multicor #this is how it should look like 3sprouts Cesto de Roupa Cisne, Organizador Bright-Starts Mordedor Chocalho Rattle & Teethe, Rosa/Roxo Bright-Starts Mordedor Twist & Teethe, Multicor

2条回答

网友

1楼 · 编辑于 2024-07-04 07:55:16

假设：

将不会删除单词containgin-

一些想法：

区分大小写的副本：在区分大小写的IMO中应该是，因此与.lower()比较
保留第一个事件：删除其他事件
用“，”分隔的单词或它们之间包含“-”：如果存在-则拆分单词，然后剥离,进行比较

import re
import itertools

sentences = [
    '3sprouts Cesto de Roupa Cisne Sprouts, 3Sprouts, Organizador',
    'Bright-Starts Mordedor Chocalho Rattle & Teethe, bright Starts, Rosa/Roxo',
    'Bright-Starts Mordedor Twist & Teethe, Starts, Multicor'
]

for s in sentences: 
    s_split = s.split(' ') #keep original sentence split by ' '
    s_split_without_comma = [i.strip(',') for i in s_split]
    #get compare word split by '-' and ' ', use re or itertools
    #method 1: re
    compare_words = re.split(' |-', s)
    #method 2: itertools
    compare_words = list(itertools.chain.from_iterable([i.split('-') for i in s_split]))
    #method 3: DIY
    compare_words = []
    for i in s_split:
        compare_words += i.split('-')

    # strip ','
    compare_words_without_comma = [i.strip(',') for i in compare_words]
    
    # start to compare
    need_removed_index = []
    for word in compare_words_without_comma:
        matched_indexes = []
        for idx, w in enumerate(s_split_without_comma):
            if word.lower() in w.lower().split('-'):
                matched_indexes.append(idx)
        if len(matched_indexes) >1: #has_duplicates
            need_removed_index += matched_indexes[1:]
    need_removed_index = list(set(need_removed_index))
    
    # keep remain and join with ' '
    print(" ".join([i for idx, i in enumerate(s_split) if idx not in need_removed_index]))

应打印：

3sprouts Cesto de Roupa Cisne Sprouts, Organizador
Bright-Starts Mordedor Chocalho Rattle & Teethe, Rosa/Roxo
Bright-Starts Mordedor Twist & Teethe, Multicor

与答案相比，它有点不同，但我仍然不明白为什么第1行中也删除了Sprouts（'3sprouts'匹配'sprouts'？）

没关系。。。请给出一些概念

仅供参考

网友
2楼 · 编辑于 2024-07-04 07:55:16

#sample dataframe used by me for testing: df=pd.DataFrame({'col': {0: '3sprouts Cesto de Roupa Cisne Sprouts, 3Sprouts, Organizador', 1: 'Bright-Starts Mordedor Chocalho Rattle & Teethe, bright Starts, Rosa/Roxo', 2: 'Bright-Starts Mordedor Twist & Teethe, Starts, Multicor'}})
尝试：
out=df['col'].str.title().str.split(', ',expand=True) #For checking purpose real=df['col'].str.split(', ',expand=True) #for assigning purpose real[1]=real[1].mask(out[0].str.contains(f'({"|".join(out[1])})')) #checking if value in col 0 of out is present in the col 1 of out and passing that mask to real real[2]=real[2].mask(out[0].str.contains(f'({"|".join(out[2])})')) #checking if value in col 0 of out is present in the col 2 of out and passing that mask to real df['col']=real.apply(lambda x:', '.join(x.dropna()),1) #finally joining values by ', '
df的输出：
col 0 3sprouts Cesto de Roupa Cisne Sprouts, Organizador 1 Bright-Starts Mordedor Chocalho Rattle & Teethe, Rosa/Roxo 2 Bright-Starts Mordedor Twist & Teethe, Multicor

相关问题更多 >

编程相关推荐

热门问题

热门文章