标记列表列表

2024-10-01 15:37:47 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试标记一个废弃tweets的csv文件。我上传了csv文件作为列表

with open('recent_tweet_purex.csv', 'r') as purex:
reader_purex = csv.reader(purex)
purex_list = list(reader_purex)

现在tweets都在这样的列表中

^{pr2}$

我已经导入了nltk以及以下包

 from nltk.tokenize import word_tokenize
 import string
 from nltk.corpus import stopwords
 from nltk.stem import WordNetLemmatizer
 from nltk.tokenize import sent_tokenize
 nltk.download('punkt')

我试着用

 purex_words = word_tokenize(purex_words)

要标记化,但我总是出错

有什么帮助吗?在


Tags: 文件csvfrom标记import列表withtweets
1条回答
网友
1楼 · 发布于 2024-10-01 15:37:47

您将数组传递给word_tokenize函数,它期望string or bytes-like object。如果你用绳子喂它,它会工作的。简单的例子。在

purex_words = [['I miss having someone to talk to all night..'], ['Pergunte-me qualquer coisa'],

['RT@Caracolinhos13:Tenho a tl cheia dessa merda de quem vos visitou nas\xc3\xbaltimas horas'],['RT@B24pt:#carloshaddeam'],[“Tudo tem um fim”], [“RT@thechgama:stalkear as curtidas\xc3\xa9 um caminho sem volta”],['Como consegues fumar 3 purexs seguidas?\xe2\x80\x94 Eram 2 purex e混合物…']]

^{pr2}$

你可以先把单子弄平再把句子循环起来。注意,我在您的列表中添加了一个外部[]。在

flat_list = [item for sublist in purex_words for item in sublist]
for sentence in flat_list:
    print(word_tokenize(sentence))

结果是这样的。在

['I', 'miss', 'having', 'someone', 'to', 'talk', 'to', 'all', 'night..']
['Pergunte-me', 'qualquer', 'coisa']
['RT', '@', 'Caracolinhos13', ':', 'Tenho', 'a', 'tl', 'cheia', 'dessa', 'merda', 'de', 'quem', 'vos', 'visitou', 'nas', '\\xc3\\xbaltimas', 'horas']
['RT', '@', 'B24pt', ':', '#', 'CarlosHadADream']
['Tudo', 'tem', 'um', 'fim']
['RT', '@', 'thechgama', ':', 'stalkear', 'as', 'curtidas', '\\xc3\\xa9', 'um', 'caminho', 'sem', 'volta']
['Como', 'consegues', 'fumar', '3', 'purexs', 'seguidas', '?', '\\xe2\\x80\\x94', 'Eram', '2', 'purex', 'e', 'mix', '...']

相关问题 更多 >

    热门问题