文章/答案/技术大牛

发布

社区首页 >问答首页 >熊猫内部参与表演问题

问熊猫内部参与表演问题
EN

Stack Overflow用户

提问于 2016-08-27 07:17:54

回答 1查看 1.1K关注 0票数 3

我有两个csv文件，并将它们加载到熊猫数据帧中。一个文件很大，大约有10M行和20列(都是字符串类型)，大小大约是1G字节，另一个文件是小的，大约5k行，5列和大小约1M。我想通过两个数据帧之间的一个公共列进行内部连接。

我就是这样加入的，

mergedDataSet = pd.merge(smallDataFrame, largeDataFrame, on='uid', how='inner')

我试过，如果我样本1%的大数据集，程序运行顺利，没有任何问题，并在5秒内完成，所以我验证了函数应该是我的代码。

但是如果我加入真正的大数据集，程序将在20到30秒内终止，错误信息是Process finished with exit code 137 (interrupted by signal 9: SIGKILL)。我在Mac上使用Python2.7，并在PyCharm上运行。我的机器有16G内存，并且大大超过1G文件的大小。

想知道在熊猫中是否有任何调整数据帧性能的想法，或者其他内部连接的快速解决方案？

我的另一个困惑是，为什么程序是KILLed？由谁和为什么理性？

编辑1，在执行内部连接时捕获/var/log/system.log中的错误，

Aug 27 11:00:18 foo-laptop com.apple.CDScheduler[702]: Thermal pressure state: 1 Memory pressure state: 0
Aug 27 11:00:18 foo-laptop com.apple.CDScheduler[47]: Thermal pressure state: 1 Memory pressure state: 0
Aug 27 11:00:33 foo-laptop iTerm2[43018]: Time to encode state for window <PseudoTerminal: 0x7fb3659d3960 tabs=1 window=<PTYWindow: 0x7fb3637c0c80 frame=NSRect: {{0, 0}, {1280, 800}} title=5. tail alpha=1.000000 isMain=1 isKey=1 isVisible=1 delegate=0x7fb3659d3960>>: 0.02136099338531494
Aug 27 11:00:41 foo-laptop iTerm2[43018]: Time to encode state for window <PseudoTerminal: 0x7fb3659d3960 tabs=1 window=<PTYWindow: 0x7fb3637c0c80 frame=NSRect: {{0, 0}, {1280, 800}} title=5. tail alpha=1.000000 isMain=0 isKey=0 isVisible=1 delegate=0x7fb3659d3960>>: 0.01138699054718018
Aug 27 11:00:46 foo-laptop kernel[0]: low swap: killing pid 92118 (python2.7)
Aug 27 11:00:46 foo-laptop kernel[0]: memorystatus_thread: idle exiting pid 789 [CallHistoryPlugi]
Aug 27 11:00:56 foo-laptop iTerm2[43018]: Time to encode state for window <PseudoTerminal: 0x7fb3659d3960 tabs=1 window=<PTYWindow: 0x7fb3637c0c80 frame=NSRect: {{0, 0}, {1280, 800}} title=5. tail alpha=1.000000 isMain=0 isKey=0 isVisible=1 delegate=0x7fb3659d3960>>: 0.01823097467422485
Aug 27 11:00:58 foo-laptop kernel[0]: process WeChat[85077] caught causing excessive wakeups. Observed wakeups rate (per sec): 184; Maximum permitted wakeups rate (per sec): 150; Observation period: 300 seconds; Task lifetime number of wakeups: 2193951
Aug 27 11:00:58 foo-laptop com.apple.xpc.launchd[1] (com.apple.ReportCrash[92123]): Endpoint has been activated through legacy launch(3) APIs. Please switch to XPC or bootstrap_check_in(): com.apple.ReportCrash
Aug 27 11:00:58 foo-laptop ReportCrash[92123]: Invoking spindump for pid=85077 wakeups_rate=184 duration=245 because of excessive wakeups
Aug 27 11:01:03 foo-laptop com.apple.CDScheduler[702]: Thermal pressure state: 0 Memory pressure state: 0
Aug 27 11:01:03 foo-laptop com.apple.CDScheduler[47]: Thermal pressure state: 0 Memory pressure state: 0

你好，林

pandas

dataframe

python

performance

python-2.7

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-08-28 15:25:35

检查两边“uid”列的基数。最有可能的是，您的join正在将数据倍增。例如，如果在dataframe1的100条记录中有1值的uid，在dataframe2中有10条记录，那么join将产生1000条记录。

为了检查基数，我将执行以下操作：

df1[df1.uid.isin(df2.uid.unique())]['uid'].value_counts()
df2[df2.uid.isin(df1.uid.unique())]['uid'].value_counts()

此代码将检查“uid”的值是否存在于其他帧的uid中并具有副本。

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/39178533

复制

相似问题

问熊猫内部参与表演问题
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问熊猫内部参与表演问题EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问熊猫内部参与表演问题
EN