文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在由一个列表组成的嵌套列表上使用RandomizedSearchCV？

问如何在由一个列表组成的嵌套列表上使用RandomizedSearchCV？
EN

Stack Overflow用户

提问于 2022-01-31 09:42:58

回答 1查看 116关注 0票数 2

我已经建立了一个句子边界检测分类器。对于序列标记，我使用了一个条件随机场。对于超参数优化，我想使用RandomizedSearchCV。我的培训数据包括6个附加注释的文本。我将所有6个文本合并到一个令牌列表中。对于实现，我遵循了文档中的一个示例。在这里，我的简化代码：

from sklearn_crfsuite import CRF
from sklearn_crfsuite import metrics
from sklearn.metrics import make_scorer
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
import scipy.stats

#my tokenlist has the length n
X_train = [feature_dict_token_1, ... , feature_dict_token_n]
# 3 types of tags, B-SEN for begin of sentence; E-SEN for end of sentence; O-Others
y_train = [tag_token_1, ..., tag_token_n]

# define fixed parameters and parameters to search
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    max_iterations=100,
    all_possible_transitions=True
)
params_space = {
    'c1': scipy.stats.expon(scale=0.5),
    'c2': scipy.stats.expon(scale=0.05),
}

labels = ['B-SEN', 'E-SEN', 'O']

# use F1-score for evaluation
f1_scorer = make_scorer(metrics.flat_f1_score,
                        average='weighted', labels=labels)

# search
rs = RandomizedSearchCV(crf, params_space,
                        cv=3,
                        verbose=1,
                        n_jobs=-1,
                        n_iter=50,
                        scoring=f1_scorer)
rs.fit([X_train], [y_train])

我使用的是rs.fit([X_train], [y_train])而不是rs.fit(X_train, y_train)，因为文档 of crf.train说它需要一个列表：

fit(X, y, X_dev=None, y_dev=None)

Parameters: 
-X (list of lists of dicts) – Feature dicts for several documents (in a python-crfsuite format).
-y (list of lists of strings) – Labels for several documents.
-X_dev ((optional) list of lists of dicts) – Feature dicts used for testing.
-y_dev ((optional) list of lists of strings) – Labels corresponding to X_dev.

但是，使用列表，我得到了以下错误：

ValueError: Cannot have number of splits n_splits=5 greater than the number of samples: n_samples=1

我理解这是因为我分别使用了X_train和y_train，并且不可能将简历应用到由一个列表组成的列表中，但是对于X_train和y_train，crf.fit无法处理。我怎么才能解决这个问题？

gridsearchcv

python-crfsuite

python

scikit-learn

nlp

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-01-31 13:24:46

根据官方教程这里，您的培训/测试集(即X_train、X_test)应该是一个字典列表。例如：

[[{'bias': 1.0,
   'word.lower()': 'melbourne',
   'word[-3:]': 'rne',
   'word[-2:]': 'ne',
   'word.isupper()': False,
   'word.istitle()': True,
   'word.isdigit()': False,
   'postag': 'NP'},
  {'bias': 1.0,
   'word.lower()': '(',
   'word[-3:]': '(',
   'word[-2:]': '(',
   'word.isupper()': False,
   'word.istitle()': False,
   'word.isdigit()': False,
   'postag': 'Fpa'},
   ...],
    [{'bias': 1.0,
   'word.lower()': '-',
   'word[-3:]': '-',
   'word[-2:]': '-',
   'word.isupper()': False,
   'word.istitle()': False,
   'word.isdigit()': False,
   'postag': 'Fg',
   'postag[:2]': 'Fg'},
    {'bias': 1.0,
   'word.lower()': '25',
   'word[-3:]': '25',
   'word[-2:]': '25',
   'word.isupper()': False,
   'word.istitle()': False,
   'word.isdigit()': True,
   'postag': 'Z'
   }]]

标签集(即y_tain和y_test) )应该是字符串列表的列表。例如：

[['B-LOC', 'I-LOC'], ['B-ORG', 'O']]

然后，您与正常情况下的模型相匹配：

rs.fit(X_train, y_train)

请参考上面提到的教程，看看它是如何工作的。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/70923870

复制

相似问题

问如何在由一个列表组成的嵌套列表上使用RandomizedSearchCV？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在由一个列表组成的嵌套列表上使用RandomizedSearchCV？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在由一个列表组成的嵌套列表上使用RandomizedSearchCV？
EN