文章/答案/技术大牛

发布

社区首页 >问答首页 >从数据中为角数据生成numpy-ndarray

问从数据中为角数据生成numpy-ndarray
EN

Stack Overflow用户

提问于 2020-08-09 19:39:12

回答 1查看 91关注 0票数 0

这是我一直在思考的一项任务。我有一个DataFrame，它包含用户(按用户id)的运动特征，类似于下面的一个：

>>> df
   id  speed1  speed2  acc1  acc2  label
0   1      19      12     5     2      0
1   1      10      11     9     3      0
2   1      12      10     4    -1      0
3   1      29      13     8     4      0
4   1      30      23     9    10      0
5   1      18      11     2    -1      0
6   1      10       6    -3    -2      0
7   2       5       1     0     0      1
8   2       7       2     1     3      1
9   2       6       2     1     0      1

从这个数据中，我想要生成一个numpy ndarray (我应该说数组列表吗？)通过分割每个用户的记录(即id)来分割固定长度的片段，这样每个片段都具有(1, 5, 4)的形状，我可以通过这种方式将其输入到神经网络：

在上面的dataframe.

where中，每个段(例如，1)由运动特征speed1 speed2 acc1 acc2 (例如4)的五个数组组成，行不能组成五个数组，其余的数组被填充为零(即zero-padded)

)。

然后，label列也应该是一个单独的数组，与新数组的大小相匹配，方法是在填充段的零填充数组位置复制label的值。

在上面给定的df示例中，预期的输出是：

>>>input_array
[
   [
     [19 12 5 2]
     [10 11 9 3]
     [12 10 4 -1]
     [29 13 8 4]
     [30 23 9 10]
   ]
 
   [
     [18 11 2 -1]
     [10 6 -3 -2]
     [0  0  0  0]
     [0  0  0  0]
     [0  0  0  0]
   ]
 
   [
     [5 6 -3 -2]
     [7  2  1 3]
     [6  2  1 0]
     [0  0  0 0]
     [0  0  0 0]
   ]
]

id=1有7行，所以最后3行是零填充的.类似地，zero-padded.

有3行，所以最后2行是id=2

编辑

我注意到了答案中给出的函数有两个bug。

函数在某些情况下引入了一个全零数组.

例如，在这方面：

df2 = {
    'id': [1,1,1,1,1,1,1,1,1,1,1,1],
'speed1': [17.63,17.63,0.17,1.41,0.61,0.32,0.18,0.43,0.30,0.46,0.75,0.37],
'speed2': [0.00,-0.09,1.24,-0.80,-0.29,-0.14,0.25,-0.13,0.16,0.29,-0.38,0.27],
'acc1': [0.00,0.01,-2.04,0.51,0.15,0.39,-0.38,0.29,0.13,-0.67,0.65,0.52],
'acc2': [29.03,56.12,18.49,11.85,36.75,27.52,81.08,51.06,19.85,10.76,14.51,24.27],
'label' : [3,3,3,3,3,3,3,3,3,3,3,3] }

df2 = pd.DataFrame.from_dict(df2)

X , y = transform(df2[:10])
X
array([[[[ 1.763e+01,  0.000e+00,  0.000e+00,  2.903e+01],
         [ 1.763e+01, -9.000e-02,  1.000e-02,  5.612e+01],
         [ 1.700e-01,  1.240e+00, -2.040e+00,  1.849e+01],
         [ 1.410e+00, -8.000e-01,  5.100e-01,  1.185e+01],
         [ 6.100e-01, -2.900e-01,  1.500e-01,  3.675e+01]]],


       [[[ 0.000e+00,  0.000e+00,  0.000e+00,  0.000e+00],
         [ 0.000e+00,  0.000e+00,  0.000e+00,  0.000e+00],
         [ 0.000e+00,  0.000e+00,  0.000e+00,  0.000e+00],
         [ 0.000e+00,  0.000e+00,  0.000e+00,  0.000e+00],
         [ 0.000e+00,  0.000e+00,  0.000e+00,  0.000e+00]]],


       [[[ 3.200e-01, -1.400e-01,  3.900e-01,  2.752e+01],
         [ 1.800e-01,  2.500e-01, -3.800e-01,  8.108e+01],
         [ 4.300e-01, -1.300e-01,  2.900e-01,  5.106e+01],
         [ 3.000e-01,  1.600e-01,  1.300e-01,  1.985e+01],
         [ 4.600e-01,  2.900e-01, -6.700e-01,  1.076e+01]]]])

请注意函数是如何引入一个全零数组作为第二个元素的。理想情况下，输出应该只包含第一个和最后一个数组。

当传递超过10行的df时，

函数会因index can't contain negative values错误而失败.

所以，如果你df2你得到了这个：

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-71-743489875901> in <module>()
----> 1 X , y = transform(df2)
      2 X

2 frames

<ipython-input-55-f6e028a2e8b8> in transform(dataframe, chunk_size)
     24             inpt = np.pad(
     25                 inpt, [(0, chunk_size-len(inpt)),(0, 0)],
---> 26                 mode='constant')
     27             # add each inputs split to accumulators
     28             X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)

<__array_function__ internals> in pad(*args, **kwargs)

/usr/local/lib/python3.6/dist-packages/numpy/lib/arraypad.py in pad(array, pad_width, mode, **kwargs)
    746 
    747     # Broadcast to shape (array.ndim, 2)
--> 748     pad_width = _as_pairs(pad_width, array.ndim, as_index=True)
    749 
    750     if callable(mode):

/usr/local/lib/python3.6/dist-packages/numpy/lib/arraypad.py in _as_pairs(x, ndim, as_index)
    517 
    518     if as_index and x.min() < 0:
--> 519         raise ValueError("index can't contain negative values")
    520 
    521     # Converting the array with `tolist` seems to improve performance

ValueError: index can't contain negative values

numpy-ndarray

python

pandas

keras

deep-learning

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-08-09 20:25:47

修正了Bugs。下面的实现现在应该提供所需的输出：

import pandas as pd
import numpy as np

df = {
    'id': [1,1,1,1,1,1,1,1,1,1,1,1],
'speed1': [17.63,17.63,0.17,1.41,0.61,0.32,0.18,0.43,0.30,0.46,0.75,0.37],
'speed2': [0.00,-0.09,1.24,-0.80,-0.29,-0.14,0.25,-0.13,0.16,0.29,-0.38,0.27],
'acc1': [0.00,0.01,-2.04,0.51,0.15,0.39,-0.38,0.29,0.13,-0.67,0.65,0.52],
'acc2': [29.03,56.12,18.49,11.85,36.75,27.52,81.08,51.06,19.85,10.76,14.51,24.27],
'label' : [3,3,3,3,3,3,3,3,3,3,3,3] }

df = pd.DataFrame.from_dict(df)

def transform(dataframe, chunk_size=5):
    
    grouped = dataframe.groupby('id')

    # initialize accumulators
    X, y = np.zeros([0, 1, chunk_size, 4]), np.zeros([0,])

    # loop over each group (df[df.id==1] and df[df.id==2])
    for _, group in grouped:

        inputs = group.loc[:, 'speed1':'acc2'].values
        label = group.loc[:, 'label'].values[0]

        # calculate number of splits
        N = (len(inputs)-1) // chunk_size

        if N > 0:
            inputs = np.array_split(
                 inputs, [chunk_size + (chunk_size*i) for i in range(N)])
        else:
            inputs = [inputs]

        # loop over splits
        for inpt in inputs:
            inpt = np.pad(
                inpt, [(0, chunk_size-len(inpt)),(0, 0)], 
                mode='constant')
            # add each inputs split to accumulators
            X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)
            y = np.concatenate([y, label[np.newaxis]], axis=0) 

    return X, y

X, y = transform(df)

print('X shape =', X.shape)
print('X =', X)
print('Y shape =', y.shape)
print('Y =', y)

# >> out:
# X shape = (3, 1, 5, 4)
# X = [[[[17.63  0.    0.   29.03]
#    [17.63 -0.09  0.01 56.12]
#    [ 0.17  1.24 -2.04 18.49]
#    [ 1.41 -0.8   0.51 11.85]
#    [ 0.61 -0.29  0.15 36.75]]]
#
#
#  [[[ 0.32 -0.14  0.39 27.52]
#    [ 0.18  0.25 -0.38 81.08]
#    [ 0.43 -0.13  0.29 51.06]
#    [ 0.3   0.16  0.13 19.85]
#    [ 0.46  0.29 -0.67 10.76]]]
#
#
#  [[[ 0.75 -0.38  0.65 14.51]
#    [ 0.37  0.27  0.52 24.27]
#    [ 0.    0.    0.    0.  ]
#    [ 0.    0.    0.    0.  ]
#    [ 0.    0.    0.    0.  ]]]]
# Y shape = (3,)
# Y = [3. 3. 3.]

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/63330598

复制

相似问题

问从数据中为角数据生成numpy-ndarray
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从数据中为角数据生成numpy-ndarrayEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从数据中为角数据生成numpy-ndarray
EN