首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何从python中的文本中提取关键字?

如何从python中的文本中提取关键字?
EN

Stack Overflow用户
提问于 2021-09-08 13:09:25
回答 3查看 1.2K关注 0票数 1

我想从文本和打印中提取一些关键词,但是怎么做呢?

这是我想从中提取的样本文本。

代码语言:javascript
复制
text = "Merhaba bugun bir miktar bas agrisi var, genellikle sonbahar gunlerinde baslayan bu bas agrisi insanin canini sikmakta. Bu durumdan kurtulmak icin neler yapmali."

这是从文本中提取的关键字示例。

代码语言:javascript
复制
keywords = ('bas agrisi', 'kurtulmak')

我想要检测这些关键词并打印类似;

代码语言:javascript
复制
bas agrisi
kurtulmak

我怎么能在蟒蛇身上做到这一点呢?

EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2021-09-08 13:12:29

试试这个:

代码语言:javascript
复制
string = "Merhaba bugun bir miktar bas agrisi var, genellikle sonbahar gunlerinde baslayan bu bas agrisi insanin canini sikmakta. Bu durumdan kurtulmak icin neler yapmali."

keywords = ('bas agrisi', 'kurtulmak')

print(*[key for key in keywords if key in string], sep='\n')

输出:

代码语言:javascript
复制
bas agrisi
kurtulmak
票数 2
EN

Stack Overflow用户

发布于 2021-09-08 13:18:50

使用re库查找所有可能的关键字。

代码语言:javascript
复制
import re

text = "Merhaba bugun bir miktar bas agrisi var, genellikle sonbahar gunlerinde baslayan bu bas agrisi insanin canini sikmakta. Bu durumdan kurtulmak icin neler yapmali."
keywords = ('bas agrisi', 'kurtulmak')

result = re.findall('|'.join(keywords), text)
for key in result:
    print(key)
代码语言:javascript
复制
bas agrisi
bas agrisi
kurtulmak
票数 1
EN

Stack Overflow用户

发布于 2021-09-08 14:28:03

您希望python理解关键字,还是希望在特定文本中将单词视为标记?因为对于第一个问题,您可能需要建立一个机器学习机制或神经网络来理解并从文本中提取关键字。但是对于第二个步骤,您可以使用一个非常简单的步骤来标记单词。

例如,

代码语言:javascript
复制
 import nltk    #need to download necessary dictionaries 
 nltk.download('punkt')
 nltk.download('stopwords')
 nltk.download('wordnet')
 text = "I wonder if I have been changed in the night. Let me think. Was 
 I the same when I got up this morning? I almost can remember feeling a 
 little different. But if I am not the same, the next question is 'Who 
 in the world am I?' Ah, that is the great puzzle!"  # This is an 
 #example of a text
 tokens = nltk.word_tokenize(text)
 tokens  #punctuations did not removed and conceived as part of the word
 #Output will look like the following;
 ['I',
  'wonder',
  'if',
  'I',
  'have',
  'been',
  'changed',
  'in',
  'the',
  'night',
  '.',
 'Let',
  'me',
  'think',
  '.',
  'Was',
  'I',
  'the',
  'same',
  'when',
  'I',....]
  #As first, you can clean the text by lowering the letters
  tokens2 = [ word.lower() for word in tokens if word.isalpha()]
  #Second, you can remove stops words in the text. There are different 
  #libraries available for various languages but admittedly English is 
  #the best library
  from nltk.corpus import stopwords
  stop_words = stopwords.words("english")
  #You can filter the text from stop words by filtering the previously 
  #created tokens2
  tokens3 = [word for word in tokens2 if word not in stop_words] #word 
  #for word named as list comprehension
  #Tokenization is a pre-set up for the lemmatization which is a way to  
  eliminate repeating words and comprehend the stems of the words
  # lemmatization
  from nltk.stem import WordNetLemmatizer 
  lemmatizer = WordNetLemmatizer()
  lemmatizer.lemmatize('stripes', pos= 'v') # n is for noun v is for 
  #verb
  print(lemmatizer.lemmatize("stripes", 'n'))
  #output is stripe because the stem of the word stripes is stripe
  # The following is an example for using stemming
  from nltk.stem import PorterStemmer 
  stemmer = PorterStemmer()
  [stemmer.stem(word) for word in tokens3]
  #output will be 
  ['wonder',
   'chang',
   'night',
   'let',
   'think',
   'got',
   'morn',
   'almost',
   'rememb',
   'feel',
   'littl',
   'differ',
   'next',
   'question',
   'world',
   'ah',
   'great',
   'puzzl'] # From the text, stop words were eliminated. Such as I, 
    #have, been and etc. Also stems of the words retrieved.
    #One last thing to see how lemmatizer works         
    tokens4 = [lemmatizer.lemmatize(word, pos='n') for word in tokens3]
    tokens4 = [lemmatizer.lemmatize(word, pos='v') for word in tokens4]
    print(tokens4)
    #Output will be
    ['wonder', 'change', 'night', 'let', 'think', 'get', 'morning', 
    'almost', 'remember', 'feel', 'little', 'different', 'next', 
    'question', 'world', 'ah', 'great', 'puzzle']

我希望我能解释清楚。此外,如果您想继续前进,并创建一个神经网络或类似的机制,您可以使用一个热编码。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/69103712

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档