从datafram中删除特殊字符和字母数字的简单方法

nonhashtag ['want', 'better', 'than', 'Dhabi,', 'United', 'Arab', 'Emirates'] ['Just', 'posted', 'photo', 'Rasim', 'Villa'] ['Dhabi', 'International', 'Airport', '(AUH)', '\xd9\x85\xd8\xb7\xd8\xa7\xd8\xb1', '\xd8\xa3\xd8\xa8\xd9\x88', '\xd8\xb8\xd8\xa8\xd9\x8a', '\xd8\xa7\xd9\x84\xd8\xaf\xd9\x88\xd9\x84\xd9\x8a', 'Dhabi'] ['just', 'shrug', 'off!', 'Dubai', 'Mall', 'Burj', 'Khalifa'] ['out!', 'Cowboy', 'steppin', 'Notorious', 'going', 'sleep!', 'Make', 'happen'] ['Buona', 'notte', '\xd1\x81\xd0\xbf\xd0\xbe\xd0\xba\xd0\xbe\xd0\xb9\xd0\xbd\xd0\xbe\xd0\xb9', '\xd0\xbd\xd0\xbe\xd1\x87\xd0\xb8', '\xd9\x84\xd9\x8a\xd9\x84\xd8\xa9', '\xd8\xb3\xd8\xb9\xd9\x8a\xd8\xaf\xd8\xa9!', '\xd8\xa3\xd8\xa8\xd9\x88', '\xd8\xb8\xd8\xa8\xd9\x8a', 'Viceroy', 'Hotel,', 'Yas\xe2\x80\xa6']

nonhashtag ['want', 'better', 'than', 'Dhabi,', 'United', 'Arab', 'Emirates'] ['Just', 'posted', 'photo', 'Rasim', 'Villa'] ['Dhabi', 'International', 'Airport', '(AUH)', 'Dhabi'] ['just', 'shrug', 'off!', 'Dubai', 'Mall', 'Burj', 'Khalifa'] ['out!', 'Cowboy', 'steppin', 'Notorious', 'going', 'sleep!', 'Make', 'happen'] ['Buona', 'notte', 'Viceroy', 'Hotel,']

2条回答

网友

1楼 · 编辑于 2024-05-18 15:20:07

我导入了很多文件，很多时候列名是脏的，它们会得到不需要的特殊字符，我不知道哪些字符可能会出现。我只想在列名中加下划线，不加空格

df.columns = df.columns.str.strip()     
df.columns = df.columns.str.replace(' ', '_')         
df.columns = df.columns.str.replace(r"[^a-zA-Z\d\_]+", "")    
df.columns = df.columns.str.replace(r"[^a-zA-Z\d\_]+", "")

网友

2楼 · 编辑于 2024-05-18 15:20:07

这就是你想要的吗？

In [71]: df.nonhashtag.apply(' '.join).str.replace('[^A-Za-z\s]+', '') \
           .str.split(expand=False)
Out[71]:
0    [want, better, than, Dhabi, United, Arab, Emir...
1                  [Just, posted, photo, Rasim, Villa]
2          [Dhabi, International, Airport, AUH, Dhabi]
3       [just, shrug, off, Dubai, Mall, Burj, Khalifa]
4    [out, Cowboy, steppin, Notorious, going, sleep...
5                  [Buona, notte, Viceroy, Hotel, Yas]
Name: nonhashtag, dtype: object

'[^A-Za-z\s]+'-是一个正则表达式，意思是除以下字符外，所有字符都接受：

使用ASCII代码从A到Z
从a到z
空格和制表符

因此.str.replace('[^A-Za-z\s]+', '')将删除除属于英语字母表、空格和制表符的字母以外的所有字符

相关问题更多 >

编程相关推荐

热门问题

热门文章