文章/答案/技术大牛

发布

问获取大熊猫各栏的方差
EN

Stack Overflow用户

提问于 2017-12-26 11:07:43

回答 1查看 5.1K关注 0票数 1

我想要计算保存在一个列车和测试文件a中的特性的方差如下：

col1  Feature0  Feature1     Feature2   Feature3  Feature4  Feature5  Feature6  Feature7     Feature8     Feature9
col2     26658     40253.5  3.22115e+09  0.0277727   5.95939    266.56   734.248   307.364   0.000566779  0.000520574
col3     2658   4053.5     3.25e+09  0.0277   5.95939    266.56   734.248   307.364  0.000566779  0.000520574 
....

为此，我写了以下文章：

import numpy as np
from sklearn.decomposition import PCA
import pandas as pd
#from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
from matplotlib import pyplot as plt

# Reading csv file
training_file = 'Training.csv'
testing_file  = 'Test.csv'
Training_Frame = pd.read_csv(training_file)
Testing_Frame  = pd.read_csv(testing_file)
Training_Frame.shape
# Now we have the feature values saved we start
# with the standardisation of the those values
stdsc = preprocessing.MinMaxScaler()
np_scaled_train = stdsc.fit_transform(Training_Frame.iloc[:,:-2])

sel = VarianceThreshold(threshold=(.2 * (1 - .2)))
sel.fit_transform(np_scaled_train)
pd_scaled_train = pd.DataFrame(data=np_scaled_train)
pd_scaled_train.to_csv('variance_result.csv',header=False, index=False)

这显然行不通。在variance_result.csv中得到的结果仅仅是训练矩阵的归一化。因此，我的问题是，如何才能得到那些方差小于20%的列(特性)的索引。提前谢谢！

更新

我用这种方法解决了方差问题：

    import numpy as np
from sklearn.decomposition import PCA
import pandas as pd
#from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
from matplotlib import pyplot as plt
from sklearn.feature_selection import VarianceThreshold

# Reading csv file
training_file = 'Training.csv'
testing_file  = 'Test.csv'
Training_Frame = pd.read_csv(training_file)
Testing_Frame  = pd.read_csv(testing_file)

Training_Frame.shape
# Now we have the feature values saved we start
# with the standardisation of the those values
stdsc = preprocessing.MinMaxScaler()
np_scaled_train = stdsc.fit_transform(Training_Frame.iloc[:,:-2])
pd_scaled_train = pd.DataFrame(data=np_scaled_train)
variance =pd_scaled_train.apply(np.var,axis=0) 
pd_scaled_train.to_csv('variance_result.csv',header=False, index=False)
temp_df = pd.DataFrame(variance.values,Training_Frame.columns.values[:-2])
temp_df.T.to_csv('Training_features_variance.csv',index=False)

不，我仍然不知道如何获得变异的特性，比如说，比0.2更大的特性，来自variance，其他的，谢谢运行循环！

pandas

scikit-learn

python

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-12-26 11:35:19

只需将阈值设置为0.0，然后使用VarianceThreshold对象的VarianceThreshold属性获取所有特性的方差，然后就可以识别其中哪些特性的方差较低。

from sklearn.feature_selection import VarianceThreshold
X = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]]
selector = VarianceThreshold()
selector.fit_transform(X)

selector.variances_
#Output: array([ 0.        ,  0.22222222,  2.88888889,  0.        ])

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/47977694

复制

相似问题

问获取大熊猫各栏的方差
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问获取大熊猫各栏的方差EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问获取大熊猫各栏的方差
EN