如何从一列文本观察中删除没有任何意义的单词

2024-10-04 09:29:36 发布

您现在位置:Python中文网/ 问答频道 /正文

我有大约200万条评论,评论中的大部分文字都是纯粹的垃圾,如下所示:

[('tcsworklife', 1),
 ('freshs', 1),
 ('elserun', 1),
 ('anathor', 1),
 ('ontract', 1),
 ('locationadibatla', 1),
 ('hindiname', 1),
 ('culturenegotiation', 1),
 ('ਵਭਗ', 1),
 ('ਵਗਰ', 1),
 ('ਭਰਭ', 1),
 ('ਬਹਤ', 1),
 ('ਹ', 1),
 ('ਵਧਆ', 1),
 ('happybcz', 1),
 ('qriruduif', 1),
 ('carpanter', 1),
 ('ghule', 1),
 ('intrapolitics', 1),
 ('collasan', 1),
 ('tcsthe', 1),
 ('oftion', 1),
 ('shiftit', 1),
 ('tellycalling', 1),
 ('majour', 1),
 ('securitied', 1),
 ('balaraju', 1),
 ('minupuri', 1),
 ('sdcvbhgvfcrdxs', 1),
 ('vgfcdxsza', 1),
 ('dscdc', 1),
 ('qdwd', 1),
 ('njn', 1),
 ('njnjn', 1),
 ('njnjnjn', 1),
 ('gbjk', 1),
 ('skhgksd', 1),
 ('kshdsgsd', 1),
 ('sbkhgsdjsg', 1),
 ('shkddshkjsd', 1),
 ('siddharthai', 1),
 ('nbwjh', 1),
 ('satilment', 1),
 ('mallinath', 1),
 ('tippanna', 1),
 ('djciajd', 1),
 ('fnjec', 1),
 ('jxrjcidcjtvm', 1),
 ('aporchunet', 1),
 ('thoraibakkamchennai', 1)]
'chooseeverything', 1),
 ('thatâs', 1),
 ('understandbest', 1),
 ('intercomany', 1),
 ('experiancelow', 1),
 ('anythingmachine', 1),
 ('lifetraveling', 1),
 ('timenight', 1),
 ('hollidayyou', 1),
 ('trsnsport', 1),
 ('workplacegreat', 1),
 ('webdriver', 1),
 ('freinely', 1)

我如何摆脱这些毫无意义的词语,保留那些有意义的词语? 注意:有些单词是有意义的,但不包含空格,或者只是拼写错误,而不是像weqwioeuwiouewq2rtg这样的垃圾词。我正在寻找最理想的清洁方法


Tags: 评论垃圾意义文字词语culturenegotiationanathorqriruduif
1条回答
网友
1楼 · 发布于 2024-10-04 09:29:36

你可以将你的每个单词与相应的词典s.t.进行比较

import nltk   # if not installed yet just run pip install nltk
nltk.download('wordnet')
if wordnet.synsets("Human"):
    print("this word belongs to the English Dictionary")
else:
    print("it does not belong to the English Dictionary")

知道它是否属于英语词典中的一个词

如果你需要的话,其他语言词典还有其他的可能性

相关问题 更多 >