嗨,我正在使用crfsuite训练一个crf,使用的是一些拉丁文本的样本数据。我用O,PERSON和PLACE标记了训练数据。当测试我的训练模型时,我得到的所有预测都是O。我怀疑这是因为我没有足够的训练数据。我的训练是基于3760字节的。(我知道这有点!-它会让CRF不工作吗?)
def word2features2(sent, i):
word = sent[i][1] #getting the word token
#a dict of features per word
features = [
#features of current token
'bias',
'word.lower=' + word.lower(),
'word[-3:]=' + word[-3:], #substrings
'word[-2:]=' + word[-2:],
'word.isupper=%s' % word.isupper(),
'word.istitle=%s' % word.istitle(),
'word.isdigit=%s' % word.isdigit()
]
if i > 0: #if the sentence is composed of more than one word
word1 = sent[i-1][1] #get features of previous word
features.extend([
'-1:word.lower=' + word1.lower(),
'-1:word.istitle=%s' % word1.istitle(),
'-1:word.isupper=%s' % word1.isupper()
])
else:
features.append('BOS') #in case it is the first word in the sentence - Beginning of Sentence
if i < len(sent)-1: #if the end of the sentence is not reached
word1 = sent[i+1][1] #get the features of the next word
features.extend([
'+1:word.lower=' + word1.lower(),
'+1:word.istitle=%s' % word1.istitle(),
'+1:word.isupper=%s' % word1.isupper()
])
else:
features.append('EOS') #in case it is the last word in the sentence - End of Sentence
return features
#each sentence is passed through the feature functions
def get_features(sent):
return [word2features2(sent, i) for i in range(len(sent))]
#get the POS/NER tags for each token in a sentence
def get_tags(sent):
return [tag for tag, token in sent]
X_train = [get_features(s) for s in TRAIN_DATA]
y_train = [get_tags(s) for s in TRAIN_DATA]
crf = sklearn_crfsuite.CRF(
algorithm='lbfgs',
c1=0.1,
c2=0.1,
all_possible_transitions=False
)
crf.fit(X_train, y_train)
text4 = 'Azar Nifusi Judeus de civitate Malte presens etc. non vi sed sponte etc. incabellavit et ad cabellam habere concessit ac dedit, tradidit et assignavit Nicolao Delia et Lemo suo filio presentibus etc. terras ipsius Azar vocatas Ta Xellule et Ginen Chagem in contrata Deyr Issafisaf cum iuribus suis omnibus <etc.> pro annis decem continuo sequturis numerandis a medietate mensis Augusti primo preteriti in antea pro salmis octo frumenti <sue> pro qualibet ayra provenientibus ex dictis terris \ad racionem, videlicet, de salmis sexdecim/ quas salmas octo frumenti in qualibet ayra \dicti cabelloti/ promiserunt dare et assignare prefato <Nicol.> Azar et eciam dicti cabelloti anno quolibet promiserunt et tenentur eidem Azar dare et deferre cum eiusdem Azar somerio salmas decem spinarum ac eciam prefat cabelloti promiserunt eidem Azar in qualibet ayra provenient[ium] ex dictis terris dare duas salmas palie in ayra et dictus <cabellotus promisit> Azar promisit eisdem cabellotis suis non spoliantur de dicta cabella neque via alienacionis neque alia quavis via [f. 5v / p. 8] et eciam promisit suis expensis dictas terras circumdare muro et dicti cabellotis tenentur in medio ipsius dicte ingabellationis dare \dicto Azar pro causa predicta/ dimidiam salmam frumenti et eciam promisit durantibus dictis annis decem dictas terras non reincabellare alicui persone et eciam tenentur revidere et curatareb ad circumfaciendas dictas terras \muro/ ad expensas tamen dicti Judei. Que omnia etc. Promiserunt etc. Obligando etc. Renunciando etc. Unde etc.'
y_pred = crf.predict(text4)发布于 2018-08-14 00:52:22
嗯,就像任何机器学习模型一样,非常小的训练集将导致拟合不足。这可能就是这里正在发生的事情。虽然,所有预测的值都是相同的,但对我来说,代码本身存在一些错误。
def get_features(sent):
return [word2features2(sent, i) for i in range(len(sent))]
X_train = [get_features(s) for s in TRAIN_DATA]因此,这里看起来像是将每个单词的长度作为"i“传递到word2features2函数中。我认为您可能希望以单词列表的形式传递句子,因此请尝试
def get_features(sent):
word_list = sent.split(" ")
return [word2features2(word_list, i) for i in range(len(sent))]我假设您的训练数据是一个句子列表,而不是像下面这样的单词列表
train_data = ['this is a sentence', 'this is also a sentence'] <= yours
train_data = [['this','is','a','sentence'],['this','is','also','a','sentence]] <= not yours公平地说,我真的不知道你的训练数据是什么样子的,所以
word = sent[i][1]Line在我看来也有点可疑。
https://stackoverflow.com/questions/51827004
复制相似问题