首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >来自管道的特征矩阵

来自管道的特征矩阵
EN

Stack Overflow用户
提问于 2018-07-10 14:25:51
回答 1查看 113关注 0票数 0

给出了来自scikit learn examples的示例,使用如下所示的管道的特性联合。如何在流水线执行后获得整个特征矩阵的尺寸?

代码语言:javascript
复制
pipeline = Pipeline([
# Extract the subject & body
('subjectbody', SubjectBodyExtractor()),

# Use FeatureUnion to combine the features from subject and body
('union', FeatureUnion(
    transformer_list=[

        # Pipeline for pulling features from the post's subject line
        ('subject', Pipeline([
            ('selector', ItemSelector(key='subject')),
            ('tfidf', TfidfVectorizer(min_df=50)),
        ])),

        # Pipeline for standard bag-of-words model for body
        ('body_bow', Pipeline([
            ('selector', ItemSelector(key='body')),
            ('tfidf', TfidfVectorizer()),
            ('best', TruncatedSVD(n_components=50)),
        ])),

        # Pipeline for pulling ad hoc features from post's body
        ('body_stats', Pipeline([
            ('selector', ItemSelector(key='body')),
            ('stats', TextStats()),  # returns a list of dicts
            ('vect', DictVectorizer()),  # list of dicts -> feature matrix
        ])),

    ],

    # weight components in FeatureUnion
    transformer_weights={
        'subject': 0.8,
        'body_bow': 0.5,
        'body_stats': 1.0,
    },
)),

# Use a SVC classifier on the combined features
('svc', SVC(kernel='linear')),
])
EN

回答 1

Stack Overflow用户

发布于 2018-07-10 16:41:33

FeatureUnion只会更改数据的列,因此行数保持不变。

现在,要获得管道执行后的列数,有多种方法:

1)您当前的管道将SVC作为最后的估计器。这不会改变数据的形状,只适合数据。因此,您可以使用它的属性来获取上一步输入到它的特征的数量。

根据documentation,您可以使用:

support_vectors_:类似数组,形状= n_SV,n_features

第二个维度将表示输入到SVC的n_features。您可以通过以下方式访问:

代码语言:javascript
复制
pipeline.named_steps['svc'].support_vectors_.shape

2) (更简单)您可以复制管道(保留最后一步(svc)),然后对其调用fit_transform()

代码语言:javascript
复制
pipeline = Pipeline([
# Extract the subject & body
('subjectbody', SubjectBodyExtractor()),

# Use FeatureUnion to combine the features from subject and body
('union', FeatureUnion(
    transformer_list=[

        # Pipeline for pulling features from the post's subject line
        ('subject', Pipeline([
            ('selector', ItemSelector(key='subject')),
            ('tfidf', TfidfVectorizer(min_df=50)),
        ])),

        # Pipeline for standard bag-of-words model for body
        ('body_bow', Pipeline([
            ('selector', ItemSelector(key='body')),
            ('tfidf', TfidfVectorizer()),
            ('best', TruncatedSVD(n_components=50)),
        ])),

        # Pipeline for pulling ad hoc features from post's body
        ('body_stats', Pipeline([
            ('selector', ItemSelector(key='body')),
            ('stats', TextStats()),  # returns a list of dicts
            ('vect', DictVectorizer()),  # list of dicts -> feature matrix
        ])),

    ],

    # weight components in FeatureUnion
    transformer_weights={
        'subject': 0.8,
        'body_bow': 0.5,
        'body_stats': 1.0,
    },
)),
])

然后,

代码语言:javascript
复制
X_transformed = pipeline.fit_transform(X)
print(X_transformed.shape)
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/51258466

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档