文章/答案/技术大牛

发布

问来自管道的特征矩阵
EN

Stack Overflow用户

提问于 2018-07-10 14:25:51

回答 1查看 113关注 0票数 0

给出了来自scikit learn examples的示例，使用如下所示的管道的特性联合。如何在流水线执行后获得整个特征矩阵的尺寸？

pipeline = Pipeline([
# Extract the subject & body
('subjectbody', SubjectBodyExtractor()),

# Use FeatureUnion to combine the features from subject and body
('union', FeatureUnion(
    transformer_list=[

        # Pipeline for pulling features from the post's subject line
        ('subject', Pipeline([
            ('selector', ItemSelector(key='subject')),
            ('tfidf', TfidfVectorizer(min_df=50)),
        ])),

        # Pipeline for standard bag-of-words model for body
        ('body_bow', Pipeline([
            ('selector', ItemSelector(key='body')),
            ('tfidf', TfidfVectorizer()),
            ('best', TruncatedSVD(n_components=50)),
        ])),

        # Pipeline for pulling ad hoc features from post's body
        ('body_stats', Pipeline([
            ('selector', ItemSelector(key='body')),
            ('stats', TextStats()),  # returns a list of dicts
            ('vect', DictVectorizer()),  # list of dicts -> feature matrix
        ])),

    ],

    # weight components in FeatureUnion
    transformer_weights={
        'subject': 0.8,
        'body_bow': 0.5,
        'body_stats': 1.0,
    },
)),

# Use a SVC classifier on the combined features
('svc', SVC(kernel='linear')),
])

python-3.x

scikit-learn

feature-extraction

回答 1

Stack Overflow用户

发布于 2018-07-10 16:41:33

FeatureUnion只会更改数据的列，因此行数保持不变。

现在，要获得管道执行后的列数，有多种方法：

1)您当前的管道将SVC作为最后的估计器。这不会改变数据的形状，只适合数据。因此，您可以使用它的属性来获取上一步输入到它的特征的数量。

根据documentation，您可以使用：

support_vectors_：类似数组，形状= n_SV，n_features

第二个维度将表示输入到SVC的n_features。您可以通过以下方式访问：

pipeline.named_steps['svc'].support_vectors_.shape

2) (更简单)您可以复制管道(保留最后一步(svc))，然后对其调用fit_transform()。

pipeline = Pipeline([
# Extract the subject & body
('subjectbody', SubjectBodyExtractor()),

# Use FeatureUnion to combine the features from subject and body
('union', FeatureUnion(
    transformer_list=[

        # Pipeline for pulling features from the post's subject line
        ('subject', Pipeline([
            ('selector', ItemSelector(key='subject')),
            ('tfidf', TfidfVectorizer(min_df=50)),
        ])),

        # Pipeline for standard bag-of-words model for body
        ('body_bow', Pipeline([
            ('selector', ItemSelector(key='body')),
            ('tfidf', TfidfVectorizer()),
            ('best', TruncatedSVD(n_components=50)),
        ])),

        # Pipeline for pulling ad hoc features from post's body
        ('body_stats', Pipeline([
            ('selector', ItemSelector(key='body')),
            ('stats', TextStats()),  # returns a list of dicts
            ('vect', DictVectorizer()),  # list of dicts -> feature matrix
        ])),

    ],

    # weight components in FeatureUnion
    transformer_weights={
        'subject': 0.8,
        'body_bow': 0.5,
        'body_stats': 1.0,
    },
)),
])

然后,

X_transformed = pipeline.fit_transform(X)
print(X_transformed.shape)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/51258466

复制

相似问题

问来自管道的特征矩阵
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问来自管道的特征矩阵EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问来自管道的特征矩阵
EN