如何在python中从文本中提取关键字?

2024-09-30 05:14:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我想从文本中提取一些关键字并打印出来,但是如何提取呢

这是我想从中提取的示例文本

text = "Merhaba bugun bir miktar bas agrisi var, genellikle sonbahar gunlerinde baslayan bu bas agrisi insanin canini sikmakta. Bu durumdan kurtulmak icin neler yapmali."

这是从文本中提取关键字的示例

keywords = ('bas agrisi', 'kurtulmak')

我想检测这些关键词并打印出来

bas agrisi
kurtulmak

如何在python中实现这一点


Tags: text文本示例var关键字basbirbugun
3条回答

您希望python理解关键字,还是希望在特定文本中将单词视为标记?因为对于第一个,您可能需要建立一个机器学习机制或神经网络来理解和提取文本中的关键字。但是对于第二种情况,您可以使用非常简单的步骤来标记单词

比如说,

 import nltk    #need to download necessary dictionaries 
 nltk.download('punkt')
 nltk.download('stopwords')
 nltk.download('wordnet')
 text = "I wonder if I have been changed in the night. Let me think. Was 
 I the same when I got up this morning? I almost can remember feeling a 
 little different. But if I am not the same, the next question is 'Who 
 in the world am I?' Ah, that is the great puzzle!"  # This is an 
 #example of a text
 tokens = nltk.word_tokenize(text)
 tokens  #punctuations did not removed and conceived as part of the word
 #Output will look like the following;
 ['I',
  'wonder',
  'if',
  'I',
  'have',
  'been',
  'changed',
  'in',
  'the',
  'night',
  '.',
 'Let',
  'me',
  'think',
  '.',
  'Was',
  'I',
  'the',
  'same',
  'when',
  'I',....]
  #As first, you can clean the text by lowering the letters
  tokens2 = [ word.lower() for word in tokens if word.isalpha()]
  #Second, you can remove stops words in the text. There are different 
  #libraries available for various languages but admittedly English is 
  #the best library
  from nltk.corpus import stopwords
  stop_words = stopwords.words("english")
  #You can filter the text from stop words by filtering the previously 
  #created tokens2
  tokens3 = [word for word in tokens2 if word not in stop_words] #word 
  #for word named as list comprehension
  #Tokenization is a pre-set up for the lemmatization which is a way to  
  eliminate repeating words and comprehend the stems of the words
  # lemmatization
  from nltk.stem import WordNetLemmatizer 
  lemmatizer = WordNetLemmatizer()
  lemmatizer.lemmatize('stripes', pos= 'v') # n is for noun v is for 
  #verb
  print(lemmatizer.lemmatize("stripes", 'n'))
  #output is stripe because the stem of the word stripes is stripe
  # The following is an example for using stemming
  from nltk.stem import PorterStemmer 
  stemmer = PorterStemmer()
  [stemmer.stem(word) for word in tokens3]
  #output will be 
  ['wonder',
   'chang',
   'night',
   'let',
   'think',
   'got',
   'morn',
   'almost',
   'rememb',
   'feel',
   'littl',
   'differ',
   'next',
   'question',
   'world',
   'ah',
   'great',
   'puzzl'] # From the text, stop words were eliminated. Such as I, 
    #have, been and etc. Also stems of the words retrieved.
    #One last thing to see how lemmatizer works         
    tokens4 = [lemmatizer.lemmatize(word, pos='n') for word in tokens3]
    tokens4 = [lemmatizer.lemmatize(word, pos='v') for word in tokens4]
    print(tokens4)
    #Output will be
    ['wonder', 'change', 'night', 'let', 'think', 'get', 'morning', 
    'almost', 'remember', 'feel', 'little', 'different', 'next', 
    'question', 'world', 'ah', 'great', 'puzzle']

我希望我能解释清楚。此外,如果你想继续前进,创建一个神经网络或类似的机制,你可以使用一个热编码

试试这个:

string = "Merhaba bugun bir miktar bas agrisi var, genellikle sonbahar gunlerinde baslayan bu bas agrisi insanin canini sikmakta. Bu durumdan kurtulmak icin neler yapmali."

keywords = ('bas agrisi', 'kurtulmak')

print(*[key for key in keywords if key in string], sep='\n')

输出:

bas agrisi
kurtulmak

使用re库查找所有可能的关键字

import re

text = "Merhaba bugun bir miktar bas agrisi var, genellikle sonbahar gunlerinde baslayan bu bas agrisi insanin canini sikmakta. Bu durumdan kurtulmak icin neler yapmali."
keywords = ('bas agrisi', 'kurtulmak')

result = re.findall('|'.join(keywords), text)
for key in result:
    print(key)
bas agrisi
bas agrisi
kurtulmak

相关问题 更多 >

    热门问题