我想从文本和打印中提取一些关键词,但是怎么做呢?
这是我想从中提取的样本文本。
text = "Merhaba bugun bir miktar bas agrisi var, genellikle sonbahar gunlerinde baslayan bu bas agrisi insanin canini sikmakta. Bu durumdan kurtulmak icin neler yapmali."这是从文本中提取的关键字示例。
keywords = ('bas agrisi', 'kurtulmak')我想要检测这些关键词并打印类似;
bas agrisi
kurtulmak我怎么能在蟒蛇身上做到这一点呢?
发布于 2021-09-08 13:12:29
试试这个:
string = "Merhaba bugun bir miktar bas agrisi var, genellikle sonbahar gunlerinde baslayan bu bas agrisi insanin canini sikmakta. Bu durumdan kurtulmak icin neler yapmali."
keywords = ('bas agrisi', 'kurtulmak')
print(*[key for key in keywords if key in string], sep='\n')输出:
bas agrisi
kurtulmak发布于 2021-09-08 13:18:50
使用re库查找所有可能的关键字。
import re
text = "Merhaba bugun bir miktar bas agrisi var, genellikle sonbahar gunlerinde baslayan bu bas agrisi insanin canini sikmakta. Bu durumdan kurtulmak icin neler yapmali."
keywords = ('bas agrisi', 'kurtulmak')
result = re.findall('|'.join(keywords), text)
for key in result:
print(key)bas agrisi
bas agrisi
kurtulmak发布于 2021-09-08 14:28:03
您希望python理解关键字,还是希望在特定文本中将单词视为标记?因为对于第一个问题,您可能需要建立一个机器学习机制或神经网络来理解并从文本中提取关键字。但是对于第二个步骤,您可以使用一个非常简单的步骤来标记单词。
例如,
import nltk #need to download necessary dictionaries
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
text = "I wonder if I have been changed in the night. Let me think. Was
I the same when I got up this morning? I almost can remember feeling a
little different. But if I am not the same, the next question is 'Who
in the world am I?' Ah, that is the great puzzle!" # This is an
#example of a text
tokens = nltk.word_tokenize(text)
tokens #punctuations did not removed and conceived as part of the word
#Output will look like the following;
['I',
'wonder',
'if',
'I',
'have',
'been',
'changed',
'in',
'the',
'night',
'.',
'Let',
'me',
'think',
'.',
'Was',
'I',
'the',
'same',
'when',
'I',....]
#As first, you can clean the text by lowering the letters
tokens2 = [ word.lower() for word in tokens if word.isalpha()]
#Second, you can remove stops words in the text. There are different
#libraries available for various languages but admittedly English is
#the best library
from nltk.corpus import stopwords
stop_words = stopwords.words("english")
#You can filter the text from stop words by filtering the previously
#created tokens2
tokens3 = [word for word in tokens2 if word not in stop_words] #word
#for word named as list comprehension
#Tokenization is a pre-set up for the lemmatization which is a way to
eliminate repeating words and comprehend the stems of the words
# lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('stripes', pos= 'v') # n is for noun v is for
#verb
print(lemmatizer.lemmatize("stripes", 'n'))
#output is stripe because the stem of the word stripes is stripe
# The following is an example for using stemming
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
[stemmer.stem(word) for word in tokens3]
#output will be
['wonder',
'chang',
'night',
'let',
'think',
'got',
'morn',
'almost',
'rememb',
'feel',
'littl',
'differ',
'next',
'question',
'world',
'ah',
'great',
'puzzl'] # From the text, stop words were eliminated. Such as I,
#have, been and etc. Also stems of the words retrieved.
#One last thing to see how lemmatizer works
tokens4 = [lemmatizer.lemmatize(word, pos='n') for word in tokens3]
tokens4 = [lemmatizer.lemmatize(word, pos='v') for word in tokens4]
print(tokens4)
#Output will be
['wonder', 'change', 'night', 'let', 'think', 'get', 'morning',
'almost', 'remember', 'feel', 'little', 'different', 'next',
'question', 'world', 'ah', 'great', 'puzzle']我希望我能解释清楚。此外,如果您想继续前进,并创建一个神经网络或类似的机制,您可以使用一个热编码。
https://stackoverflow.com/questions/69103712
复制相似问题