删除数据框中具有重复字符的行

2024-09-23 22:28:29 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个来自csv文件的大数据集,要用我识别的模式进行清理,但我不能在这里上传该文件,所以我只硬编码了一个小样本,以概述我要寻找的内容。识别的模式是值中的重复字符。但是,如果您查看下面的数据帧,实际上有重复的“单字符”,如SSSS、fffff、AAAA等,然后是重复的“双字符”,如dgdg、BV、TUTUU等。还有重复的“三字符”,如yutyut和fdgfdg

尽管如此,是否也可以删除具有任何重复“单/双/三字符”的行,以便将它们应用于大型数据集?例如,这里的数据框只显示了我在上面识别的模式,但是,在大型数据集中可能有任何字母的重复字符,如“uuuu”、“zzzz”、“eded”、“rsrs”、“xyzxyz”等

        Address1        Address2            Address3        Address4
0    High Street     Park Avenue     St. John’s Road       The Grove
1        wssssss    The Crescent             tyutyut       Mill Road
2      qfdgfdgdg        dddfffff  qdffgfdgfggfbvbvbv  sefsdfdyuytutu
3     Green Lane  Highfield Road    Springfield Road     School Lane
4       Kingsway    Stanley Road       George Street     Albert Road
5  Church Street      New Street           Queensway        Broadway
6       qaaaaass          mjkhjk           chfghfghh         fghfhfh

代码如下:

import pandas as pd
import numpy as np

data = {'Address1': ['High Street', 'wssssss', 'qfdgfdgdg', 'Green Lane', 'Kingsway', 'Church Street', 'qaaaaass'],
        'Address2': ['Park Avenue', 'The Crescent', 'dddfffff', 'Highfield Road', 'Stanley Road', 'New Street', 'mjkhjk'],
        'Address3': ['St. John’s Road', 'tyutyut', 'qdffgfdgfggfbvbvbv', 'Springfield Road', 'George Street', 'Queensway', 'chfghfghh'],
        'Address4': ['The Grove', 'Mill Road', 'sefsdfdyuytutu', 'School Lane', 'Albert Road', 'Broadway', 'fghfhfh']}


address_details = pd.DataFrame(data)

#Code to delete the data for the identified patterns




print(address_details)

我期望的结果是:

       Address1         Address2            Address3        Address4
0    High Street     Park Avenue     St. John’s Road       The Grove
1     Green Lane  Highfield Road    Springfield Road     School Lane
2       Kingsway    Stanley Road       George Street     Albert Road
3  Church Street      New Street           Queensway        Broadway

请告知,谢谢


Tags: the数据streetpark模式字符sthigh
1条回答
网友
1楼 · 发布于 2024-09-23 22:28:29

str.containsloc试着用agg

print(address_details.loc[~address_details.agg(lambda x: x.str.contains(r"(.)\1+\b"), axis=1).any(1)])

输出:

        Address1        Address2          Address3     Address4
0    High Street     Park Avenue   St. John’s Road    The Grove
3     Green Lane  Highfield Road  Springfield Road  School Lane
4       Kingsway    Stanley Road     George Street  Albert Road
5  Church Street      New Street         Queensway     Broadway

或者,如果您关心索引:

print(address_details.loc[~address_details.agg(lambda x: x.str.contains(r"(.)\1+\b"), axis=1).any(1)].reset_index(drop=True))

输出:

        Address1        Address2          Address3     Address4
0    High Street     Park Avenue   St. John’s Road    The Grove
1     Green Lane  Highfield Road  Springfield Road  School Lane
2       Kingsway    Stanley Road     George Street  Albert Road
3  Church Street      New Street         Queensway     Broadway

编辑:

仅适用于小写字母,请尝试:

print(address_details.loc[~address_details.agg(lambda x: x.str.contains(r"([a-z]+)\1{1,}\b"), axis=1).any(1)].reset_index(drop=True))

相关问题 更多 >