如何在python中从文本中提取关键字？

3条回答

网友

1楼 · 编辑于 2024-09-30 05:14:59

您希望python理解关键字，还是希望在特定文本中将单词视为标记？因为对于第一个，您可能需要建立一个机器学习机制或神经网络来理解和提取文本中的关键字。但是对于第二种情况，您可以使用非常简单的步骤来标记单词

比如说,

 import nltk    #need to download necessary dictionaries 
 nltk.download('punkt')
 nltk.download('stopwords')
 nltk.download('wordnet')
 text = "I wonder if I have been changed in the night. Let me think. Was 
 I the same when I got up this morning? I almost can remember feeling a 
 little different. But if I am not the same, the next question is 'Who 
 in the world am I?' Ah, that is the great puzzle!"  # This is an 
 #example of a text
 tokens = nltk.word_tokenize(text)
 tokens  #punctuations did not removed and conceived as part of the word
 #Output will look like the following;
 ['I',
  'wonder',
  'if',
  'I',
  'have',
  'been',
  'changed',
  'in',
  'the',
  'night',
  '.',
 'Let',
  'me',
  'think',
  '.',
  'Was',
  'I',
  'the',
  'same',
  'when',
  'I',....]
  #As first, you can clean the text by lowering the letters
  tokens2 = [ word.lower() for word in tokens if word.isalpha()]
  #Second, you can remove stops words in the text. There are different 
  #libraries available for various languages but admittedly English is 
  #the best library
  from nltk.corpus import stopwords
  stop_words = stopwords.words("english")
  #You can filter the text from stop words by filtering the previously 
  #created tokens2
  tokens3 = [word for word in tokens2 if word not in stop_words] #word 
  #for word named as list comprehension
  #Tokenization is a pre-set up for the lemmatization which is a way to  
  eliminate repeating words and comprehend the stems of the words
  # lemmatization
  from nltk.stem import WordNetLemmatizer 
  lemmatizer = WordNetLemmatizer()
  lemmatizer.lemmatize('stripes', pos= 'v') # n is for noun v is for 
  #verb
  print(lemmatizer.lemmatize("stripes", 'n'))
  #output is stripe because the stem of the word stripes is stripe
  # The following is an example for using stemming
  from nltk.stem import PorterStemmer 
  stemmer = PorterStemmer()
  [stemmer.stem(word) for word in tokens3]
  #output will be 
  ['wonder',
   'chang',
   'night',
   'let',
   'think',
   'got',
   'morn',
   'almost',
   'rememb',
   'feel',
   'littl',
   'differ',
   'next',
   'question',
   'world',
   'ah',
   'great',
   'puzzl'] # From the text, stop words were eliminated. Such as I, 
    #have, been and etc. Also stems of the words retrieved.
    #One last thing to see how lemmatizer works         
    tokens4 = [lemmatizer.lemmatize(word, pos='n') for word in tokens3]
    tokens4 = [lemmatizer.lemmatize(word, pos='v') for word in tokens4]
    print(tokens4)
    #Output will be
    ['wonder', 'change', 'night', 'let', 'think', 'get', 'morning', 
    'almost', 'remember', 'feel', 'little', 'different', 'next', 
    'question', 'world', 'ah', 'great', 'puzzle']

我希望我能解释清楚。此外，如果你想继续前进，创建一个神经网络或类似的机制，你可以使用一个热编码

网友

2楼 · 编辑于 2024-09-30 05:14:59

试试这个：

string = "Merhaba bugun bir miktar bas agrisi var, genellikle sonbahar gunlerinde baslayan bu bas agrisi insanin canini sikmakta. Bu durumdan kurtulmak icin neler yapmali."

keywords = ('bas agrisi', 'kurtulmak')

print(*[key for key in keywords if key in string], sep='\n')

输出：

bas agrisi
kurtulmak

网友

3楼 · 编辑于 2024-09-30 05:14:59

使用re库查找所有可能的关键字

import re

text = "Merhaba bugun bir miktar bas agrisi var, genellikle sonbahar gunlerinde baslayan bu bas agrisi insanin canini sikmakta. Bu durumdan kurtulmak icin neler yapmali."
keywords = ('bas agrisi', 'kurtulmak')

result = re.findall('|'.join(keywords), text)
for key in result:
    print(key)

bas agrisi
bas agrisi
kurtulmak

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何在python中从文本中提取关键字？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >