我有一个来自csv文件的大数据集,要用我识别的模式进行清理,但我不能在这里上传该文件,所以我只硬编码了一个小样本,以概述我要寻找的内容。识别的模式是值中的重复字符。但是,如果您查看下面的数据帧,实际上有重复的“单字符”,如SSSS、fffff、AAAA等,然后是重复的“双字符”,如dgdg、BV、TUTUU等。还有重复的“三字符”,如yutyut和fdgfdg
尽管如此,是否也可以删除具有任何重复“单/双/三字符”的行,以便将它们应用于大型数据集?例如,这里的数据框只显示了我在上面识别的模式,但是,在大型数据集中可能有任何字母的重复字符,如“uuuu”、“zzzz”、“eded”、“rsrs”、“xyzxyz”等
Address1 Address2 Address3 Address4
0 High Street Park Avenue St. John’s Road The Grove
1 wssssss The Crescent tyutyut Mill Road
2 qfdgfdgdg dddfffff qdffgfdgfggfbvbvbv sefsdfdyuytutu
3 Green Lane Highfield Road Springfield Road School Lane
4 Kingsway Stanley Road George Street Albert Road
5 Church Street New Street Queensway Broadway
6 qaaaaass mjkhjk chfghfghh fghfhfh
代码如下:
import pandas as pd
import numpy as np
data = {'Address1': ['High Street', 'wssssss', 'qfdgfdgdg', 'Green Lane', 'Kingsway', 'Church Street', 'qaaaaass'],
'Address2': ['Park Avenue', 'The Crescent', 'dddfffff', 'Highfield Road', 'Stanley Road', 'New Street', 'mjkhjk'],
'Address3': ['St. John’s Road', 'tyutyut', 'qdffgfdgfggfbvbvbv', 'Springfield Road', 'George Street', 'Queensway', 'chfghfghh'],
'Address4': ['The Grove', 'Mill Road', 'sefsdfdyuytutu', 'School Lane', 'Albert Road', 'Broadway', 'fghfhfh']}
address_details = pd.DataFrame(data)
#Code to delete the data for the identified patterns
print(address_details)
我期望的结果是:
Address1 Address2 Address3 Address4
0 High Street Park Avenue St. John’s Road The Grove
1 Green Lane Highfield Road Springfield Road School Lane
2 Kingsway Stanley Road George Street Albert Road
3 Church Street New Street Queensway Broadway
请告知,谢谢
用
str.contains
和loc
试着用agg
:输出:
或者,如果您关心索引:
输出:
编辑:
仅适用于小写字母,请尝试:
相关问题 更多 >
编程相关推荐