我有时间序列数据,不是单调增长,所以调用排序/洗牌是不可能的。
我希望在保持数据相对顺序的同时,随机提取n%的数据,作为验证或测试集,可以如下所示:
my_ndarray = [ 1, 20, 10, 3, 90, 5, 80, 50, 4, 1] # (number of samples = 1645, number of timesteps = 10, number of features = 7)
# custom_train_test_split()
train = [1, 20, 90, 5, 50, 4, 1]
valid = [10, 3, 80]我希望就如何有效地做到这一点提供一些指导。据我理解,Java风格的迭代在Python中效率很低。我怀疑三维布尔表掩码将是丙酮和矢量化的方式。
发布于 2020-04-09 20:25:54
以下是解决方案的样子:
。
下面是使用普通Python列表的解决方案:
my_ndarray = [ 1, 20, 10, 3, 90, 5, 80, 50, 4, 1]
# Add temporary dimension by converting each item
# to a sublist, where the index is the first element of each sublist
nda=[[i,my_ndarray[i]] for i in len(my_ndarray)]
np.random.shuffle(nda)
# Training data is the first 7 items
traindata=nda[0:7]
traindata.sort()
traindata=[x[1] for x in traindata]
# Test data is the rest
testdata=nda[7:10]
testdata.sort()
testdata=[x[1] for x in testdata]发布于 2020-04-09 19:59:13
这个很管用。我将test_size=0.4设置为使40%的行位于test_df中。这假设您的dataframe在左边有所有的功能列,在右边有响应列。
x = df[features_columns_names_list]
y = df[response_column_name]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.4)
train_df = pd.concat([X_train, y_train], axis=1).sort_index(axis = 0)
test_df = pd.concat([X_test, y_test], axis=1).sort_index(axis = 0)https://stackoverflow.com/questions/61129122
复制相似问题