如何标记字符串的numpy数组中的所有字符串

2024-05-17 04:36:26 发布

您现在位置：Python中文网/ 问答频道 /正文

550

网友

男 | 程序猿一只，喜欢编程写python代码。

我试图在句子列表中找到特定字符串列表的单词位置。我用numpy，sklean和nltk来实现这一点。在我的实际代码中，我有10000个句子，单词列表同样长，所以我尽量避免使用循环和列表/集合的规则，因为它们不够快。在

到目前为止，我已经写了下面的代码

from nltk.tokenize import TweetTokenizer
import nltk
import numpy as np
from sklearn import feature_extraction

sentences = ["Great place and so amazing", "I like doughnuts", "Mary had a little lamb"]

posWords = ["great","like","amazing","little lamb"]

# Here we see which words from the wordlist appear in the sentences.
cv = feature_extraction.text.CountVectorizer(vocabulary=posWords)
taggedSentences = cv.fit_transform(sentences).toarray() # This vector is of size (noOfSentences x noOfWordsInPoswords)

taggedSentencesCutDown = taggedSentences > 0
taggedSentencesCutDown = np.column_stack(np.where(taggedSentencesCutDown)) # This is a list of tuples (sentence, wordIndex)


sentencesIdentified = np.unique(taggedSentencesCutDown[:,0])


for sentenceIdx in sentencesIdentified:

    tokenisedSent = np.array(tknzr.tokenize(sentences[sentenceIdx]))
    wordsFoundSent = np.where(taggedSentencesCutDown[:,0] == sentenceIdx)
    wordsFoundSent = taggedSentencesCutDown[wordsFoundSent]

    matches = np.where(posWords[wordsFoundSent[:,1]] in tokenisedSent)
    sent = tokenisedSent[matches]

理想情况下，我想要的是下面的数组

^{pr2}$

我需要两样东西：

使用NLTK tokenise标记taggedSentencesCutDown数组中的所有句子（最好不使用for循环，就像我目前所做的那样，因为我的真正数组有10000个句子和单词）
countVector能处理像“小羊羔”这样的字符串吗？目前还没有发现。有没有什么方法可以像countvectorizer那样高效优雅地完成这个任务？

谢谢

Tags： in from import 列表 np sentences where 单词

0条回答

目前没有回答

如何标记字符串的numpy数组中的所有字符串

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何标记字符串的numpy数组中的所有字符串

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >