首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >pandas merge命令在并行循环中失败- "ValueError: buffer source array is read-only“

pandas merge命令在并行循环中失败- "ValueError: buffer source array is read-only“
EN

Stack Overflow用户
提问于 2019-05-08 16:27:35
回答 1查看 780关注 0票数 3

我正在写一个使用并行循环和pandas的bootstrap算法。我遇到的问题是,并行循环中的merge命令会导致"ValueError: buffer source array is read- only“错误--但前提是我使用完整的数据集进行合并(120k行)。任何少于12k行的子集都可以很好地工作,所以我推断这不是语法问题。我能做什么?

目前的pandas版本是0.24.2,cython是0.29.7。

代码语言:javascript
复制
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 418, in _process_worker
    r = call_item()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 272, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 567, in __call__
    return self.func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/joblib/parallel.py", line 225, in __call__
    for func, args, kwargs in self.items]
  File "/home/ubuntu/.local/lib/python3.6/site-packages/joblib/parallel.py", line 225, in <listcomp>
    for func, args, kwargs in self.items]
  File "<ipython-input-72-cdb83eaf594c>", line 12, in bootstrap
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 6868, in merge
    copy=copy, indicator=indicator, validate=validate)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 48, in merge
    return op.get_result()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 546, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 756, in _get_join_info
    right_indexer) = self._get_join_indexers()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 735, in _get_join_indexers
    how=self.how)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 1130, in _get_join_indexers
    llab, rlab, shape = map(list, zip(* map(fkeys, left_keys, right_keys)))
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 1662, in _factorize_keys
    rlab = rizer.factorize(rk)
  File "pandas/_libs/hashtable.pyx", line 111, in pandas._libs.hashtable.Int64Factorizer.factorize
  File "stringsource", line 653, in View.MemoryView.memoryview_cwrapper
  File "stringsource", line 348, in View.MemoryView.memoryview.__cinit__
ValueError: buffer source array is read-only
"""

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
<ipython-input-73-652c1db5701b> in <module>()
      1 num_cores = multiprocessing.cpu_count()
----> 2 results = Parallel(n_jobs=num_cores, prefer='processes', verbose = 5)(delayed(bootstrap)() for i in range(n_trials))
      3 #pd.DataFrame(results[0])

~/.local/lib/python3.6/site-packages/joblib/parallel.py in __call__(self, iterable)
    932 
    933             with self._backend.retrieval_context():
--> 934                 self.retrieve()
    935             # Make sure that we get a last message telling us we are done
    936             elapsed_time = time.time() - self._start_time

~/.local/lib/python3.6/site-packages/joblib/parallel.py in retrieve(self)
    831             try:
    832                 if getattr(self._backend, 'supports_timeout', False):
--> 833                     self._output.extend(job.get(timeout=self.timeout))
    834                 else:
    835                     self._output.extend(job.get())

~/.local/lib/python3.6/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    519         AsyncResults.get from multiprocessing."""
    520         try:
--> 521             return future.result(timeout=timeout)
    522         except LokyTimeoutError:
    523             raise TimeoutError()

/usr/lib/python3.6/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433             else:
    434                 raise TimeoutError()

/usr/lib/python3.6/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

ValueError: buffer source array is read-only

代码是

代码语言:javascript
复制
def bootstrap():

    df_resample_ids = skl.utils.resample(ob_ids)
    df_resample_ids = pd.DataFrame(df_resample_ids).sort_values(by="0").reset_index(drop=True)
    df_resample_ids.columns = [ob_id_field]

    df_resample = pd.DataFrame(df_resample_ids.merge(df, on = ob_id_field))

    return df_resample

num_cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=num_cores, prefer='processes', verbose = 5)(delayed(bootstrap)() for i in range(n_trials))

algo将从ID变量创建重采样/替换的ID,并使用合并命令基于重采样的ID和存储在df中的原始数据集创建新的数据集。如果我剪掉原始数据集(任何地方)的一个子集,留下的行数少于12k行,那么并行循环将不会出现错误,并会按预期执行。

应要求,下面是一个新的代码片段,用于重新创建数据结构并反映我目前正在使用的主要方法:

代码语言:javascript
复制
import pandas as pd
import sklearn as skl
import multiprocessing
from joblib import Parallel, delayed

df = pd.DataFrame(np.random.randn(200000, 24), columns=list('ABCDDEFGHIJKLMNOPQRSTUVW'))
df["ID"] = df.index.drop_duplicates().tolist() 
ob_ids = df.index.drop_duplicates().tolist() 

def bootstrap2():

    df_resample_ids = skl.utils.resample(ob_ids)
    df_resample_ids = pd.DataFrame(df_resample_ids).sort_values(by=0).reset_index(drop=True)
    df_resample_ids.columns = ['ID']
    df_resample = pd.DataFrame(df1.merge(df_resample_ids, on = 'ID'))

    result = df_resample

    return result

num_cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=num_cores, prefer='processes', verbose = 5)(delayed(bootstrap2)() for i in range(n_trials))

但是,我注意到,当数据完全由np.random数字组成时,循环不会出现错误。原始数据帧的数据类型为:

代码语言:javascript
复制
start_rtg                        int64
end_rtg                        float64
days_diff                      float64
ultimate_customer_system_id      int64

如何避免只读错误?

EN

回答 1

Stack Overflow用户

发布于 2019-05-09 17:18:48

发布我的问题的答案,因为我发现其中一个变量是int64数据类型。当我将所有变量转换为float64时,错误消失了。因此,这是一个仅限于某些数据类型的问题...

干杯斯蒂芬

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/56036527

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档