文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在复制或群居熊猫DataFrame时保持主要秩序？

问如何在复制或群居熊猫DataFrame时保持主要秩序？
EN

Stack Overflow用户

提问于 2019-05-23 12:08:02

回答 1查看 403关注 0票数 1

我怎样才能有序地使用或操作(猴子补丁)熊猫，以便在结果对象上始终保持相同的主要顺序，以便进行复制和群聚？

我使用pandas.DataFrame作为业务应用程序中的数据结构(风险模型)，需要快速聚合多维数据。熊猫的聚合在很大程度上取决于在底层的numpy数组上使用的主要订购方案。

不幸的是，当我创建副本或使用groupby和sum执行聚合时，熊猫(版本0.23.4)改变了底层numpy数组的主要顺序。

其影响是：

案例1: 17.2秒

案例2: 5分46秒

在包含45023行和100000列的DataFrame及其副本上。对索引执行聚合。该索引是一个具有15个级别的pd.MultiIndex。聚合保持三个层次，并导致大约239组。

我通常使用45000行和100000列的DataFrames。在行中，我有一个大约15个级别的pandas.MultiIndex。要计算各种层次结构节点的统计信息，我需要在索引维度上聚合(和)。

如果底层的numpy数组是c_contiguous，那么聚合是快速的，因此按列的主要顺序(C顺序)保存。如果它是f_contiguous的话，它是非常慢的，因此按行的主要顺序(F顺序)。

不幸的是，的熊猫在的时候从C级改变到F级。

创建一个副本的DataFrame，甚至当，
通过执行聚合，并在石斑鱼上获取和。因此，结果的DataFrame有一个不同的主要顺序(!)

当然，我可以继续使用另一个“datamodel”，只需将MultiIndex保留在列上即可。那么现在的熊猫版本总是对我有利。但这是不可以的。我认为，可以预见的是，对于正在考虑的两个操作(组和副本)，主要顺序不应该改变。

import numpy as np
import pandas as pd

print("pandas version: ", pd.__version__)

array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
array.flags
print("Numpy array is C-contiguous: ", data.flags.c_contiguous)

dataframe = pd.DataFrame(array, index = pd.MultiIndex.from_tuples([('A', 'U'), ('A', 'V'), ('B', 'W')], names=['dim_one', 'dim_two']))
print("DataFrame is C-contiguous: ", dataframe.values.flags.c_contiguous)

dataframe_copy = dataframe.copy()
print("Copy of DataFrame is C-contiguous: ", dataframe_copy.values.flags.c_contiguous)

aggregated_dataframe = dataframe.groupby('dim_one').sum()
print("Aggregated DataFrame is C-contiguous: ", aggregated_dataframe.values.flags.c_contiguous)


## Output in Jupyter Notebook
# pandas version:  0.23.4
# Numpy array is C-contiguous:  True
# DataFrame is C-contiguous:  True
# Copy of DataFrame is C-contiguous:  False
# Aggregated DataFrame is C-contiguous:  False

应保留数据的主要顺序。如果熊猫喜欢切换到一个隐含的偏好，那么它应该允许覆盖这一点。Numpy允许在创建副本时输入订单。

一个补丁版本的熊猫应该会导致

## Output in Jupyter Notebook
# pandas version:  0.23.4
# Numpy array is C-contiguous:  True
# DataFrame is C-contiguous:  True
# Copy of DataFrame is C-contiguous:  True
# Aggregated DataFrame is C-contiguous:  True

对于上面剪切的示例代码。

python

pandas

performance

pandas-groupby

column-major-order

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-04-29 09:20:21

熊猫猴贴片(0.23.4，也许还有其他版本)

我创建了一个补丁，我想和你分享。这导致了上述问题中提到的绩效的提高。

它适用于熊猫0.23.4版。对于其他版本，您需要尝试它是否仍然有效。

以下两个模块是必需的，您可能会根据放置它们的位置来调整导入。

memory_layout.py   
memory.py

要修补代码，只需在程序或笔记本的开头导入以下内容，并设置内存布局参数。它将猴子补丁熊猫，并确保，副本的DataFrames行为符合要求的布局。

from memory_layout import memory_layout
# memory_layout.order = 'F'  # assert F-order on copy
# memory_layout.order = 'K'  # Keep given layout on copy 
memory_layout.order = 'C'  # assert C-order on copy

memory_layout.py

创建具有以下内容的文件memory_layout.py。

import numpy as np
from pandas.core.internals import Block
from memory import memory_layout

# memory_layout.order = 'F'  # set memory layout order to 'F' for np.ndarrays in DataFrame copies (fortran/row order)
# memory_layout.order = 'K'  # keep memory layout order for np.ndarrays in DataFrame copies (order out is order in)
memory_layout.order = 'C'  # set memory layout order to 'C' for np.ndarrays in DataFrame copies (C/column order)


def copy(self, deep=True, mgr=None):
    """
    Copy patch on Blocks to set or keep the memory layout
    on copies.

    :param self: `pandas.core.internals.Block`
    :param deep: `bool`
    :param mgr: `BlockManager`
    :return: copy of `pandas.core.internals.Block`
    """
    values = self.values
    if deep:
        if isinstance(values, np.ndarray):
memory_layout))
            values = memory_layout.copy_transposed(values)
memory_layout))
        else:
            values = values.copy()
    return self.make_block_same_class(values)


Block.copy = copy  # Block for pandas 0.23.4: in pandas.core.internals.Block

memory.py

创建具有以下内容的文件memory.py。

"""
Implements MemoryLayout copy factory to change memory layout
of `numpy.ndarrays`.
Depending on the use case, operations on DataFrames can be much
faster if the appropriate memory layout is set and preserved.

The implementation allows for changing the desired layout. Changes apply when
copies or new objects are created, as for example, when slicing or aggregating
via groupby ...

This implementation tries to solve the issue raised on GitHub
https://github.com/pandas-dev/pandas/issues/26502

"""
import numpy as np

_DEFAULT_MEMORY_LAYOUT = 'K'


class MemoryLayout(object):
    """
    Memory layout management for numpy.ndarrays.

    Singleton implementation.

    Example:
    >>> from memory import memory_layout
    >>> memory_layout.order = 'K'  #
    >>> # K ... keep array layout from input
    >>> # C ... set to c-contiguous / column order
    >>> # F ... set to f-contiguous / row order
    >>> array = memory_layout.apply(array)
    >>> array = memory_layout.apply(array, 'C')
    >>> array = memory_layout.copy(array)
    >>> array = memory_layout.apply_on_transpose(array)

    """

    _order = _DEFAULT_MEMORY_LAYOUT
    _instance = None

    @property
    def order(self):
        """
        Return memory layout ordering.

        :return: `str`
        """
        if self.__class__._order is None:
            raise AssertionError("Array layout order not set.")
        return self.__class__._order

    @order.setter
    def order(self, order):
        """
        Set memory layout order.
        Allowed values are 'C', 'F', and 'K'. Raises AssertionError
        when trying to set other values.

        :param order: `str`
        :return: `None`
        """
        assert order in ['C', 'F', 'K'], "Only 'C', 'F' and 'K' supported."
        self.__class__._order = order

    def __new__(cls):
        """
        Create only one instance throughout the lifetime of this process.

        :return: `MemoryLayout` instance as singleton
        """
        if cls._instance is None:
            cls._instance = super(MemoryLayout, cls).__new__(MemoryLayout)
        return cls._instance

    @staticmethod
    def get_from(array):
        """
        Get memory layout from array

        Possible values:
           'C' ... only C-contiguous or column order
           'F' ... only F-contiguous or row order
           'O' ... other: both, C- and F-contiguous or both
           not C- or F-contiguous (as on empty arrays).

        :param array: `numpy.ndarray`
        :return: `str`
        """
        if array.flags.c_contiguous == array.flags.f_contiguous:
            return 'O'
        return {True: 'C', False: 'F'}[array.flags.c_contiguous]

    def apply(self, array, order=None):
        """
        Apply the order set or the order given as input on the array
        given as input.

        Possible values:
           'C' ... apply C-contiguous layout or column order
           'F' ... apply F-contiguous layout or row order
           'K' ... keep the given layout

        :param array: `numpy.ndarray`
        :param order: `str`
        :return: `np.ndarray`
        """
        order = self.__class__._order if order is None else order

        if order == 'K':
            return array

        array_order = MemoryLayout.get_from(array)
        if array_order == order:
            return array

        return np.reshape(np.ravel(array), array.shape, order=order)

    def copy(self, array, order=None):
        """
        Return a copy of the input array with the memory layout set.
        Layout set:
           'C' ... return C-contiguous copy
           'F' ... return F-contiguous copy
           'K' ... return copy with same layout as
           given by the input array.

        :param array: `np.ndarray`
        :return: `np.ndarray`
        """
        order = order if order is not None else self.__class__._order
        return array.copy(order=self.get_from(array)) if order == 'K' \
            else array.copy(order=order)

    def copy_transposed(self, array):
        """
        Return a copy of the input array in order that its transpose
        has the memory layout set.

        Note: numpy simply changes the memory layout from row to column
        order instead of reshuffling the data in memory.

        Layout set:
           'C' ... return F-contiguous copy
           'F' ... return C-contiguous copy
           'K' ... return copy with oposite (C versus F) layout as
           given by the input array.

        :param array: `np.ndarray`
        :return: `np.ndarray`

        :param array:
        :return:
        """
        if self.__class__._order == 'K':
            return array.copy(
                order={'C': 'C', 'F': 'F', 'O': None}[self.get_from(array)])
        else:
            return array.copy(
                order={'C': 'F', 'F': 'C'}[self.__class__._order])

    def __str__(self):
        return str(self.__class__._order)


memory_layout = MemoryLayout()  # Singleton

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/56274882

复制

相似问题

问如何在复制或群居熊猫DataFrame时保持主要秩序？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在复制或群居熊猫DataFrame时保持主要秩序？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在复制或群居熊猫DataFrame时保持主要秩序？
EN