我有两个csv文件,并将它们加载到熊猫数据帧中。一个文件很大,大约有10M行和20列(都是字符串类型),大小大约是1G字节,另一个文件是小的,大约5k行,5列和大小约1M。我想通过两个数据帧之间的一个公共列进行内部连接。
我就是这样加入的,
mergedDataSet = pd.merge(smallDataFrame, largeDataFrame, on='uid', how='inner')我试过,如果我样本1%的大数据集,程序运行顺利,没有任何问题,并在5秒内完成,所以我验证了函数应该是我的代码。
但是如果我加入真正的大数据集,程序将在20到30秒内终止,错误信息是Process finished with exit code 137 (interrupted by signal 9: SIGKILL)。我在Mac上使用Python2.7,并在PyCharm上运行。我的机器有16G内存,并且大大超过1G文件的大小。
想知道在熊猫中是否有任何调整数据帧性能的想法,或者其他内部连接的快速解决方案?
我的另一个困惑是,为什么程序是KILLed?由谁和为什么理性?
编辑1,在执行内部连接时捕获/var/log/system.log中的错误,
Aug 27 11:00:18 foo-laptop com.apple.CDScheduler[702]: Thermal pressure state: 1 Memory pressure state: 0
Aug 27 11:00:18 foo-laptop com.apple.CDScheduler[47]: Thermal pressure state: 1 Memory pressure state: 0
Aug 27 11:00:33 foo-laptop iTerm2[43018]: Time to encode state for window <PseudoTerminal: 0x7fb3659d3960 tabs=1 window=<PTYWindow: 0x7fb3637c0c80 frame=NSRect: {{0, 0}, {1280, 800}} title=5. tail alpha=1.000000 isMain=1 isKey=1 isVisible=1 delegate=0x7fb3659d3960>>: 0.02136099338531494
Aug 27 11:00:41 foo-laptop iTerm2[43018]: Time to encode state for window <PseudoTerminal: 0x7fb3659d3960 tabs=1 window=<PTYWindow: 0x7fb3637c0c80 frame=NSRect: {{0, 0}, {1280, 800}} title=5. tail alpha=1.000000 isMain=0 isKey=0 isVisible=1 delegate=0x7fb3659d3960>>: 0.01138699054718018
Aug 27 11:00:46 foo-laptop kernel[0]: low swap: killing pid 92118 (python2.7)
Aug 27 11:00:46 foo-laptop kernel[0]: memorystatus_thread: idle exiting pid 789 [CallHistoryPlugi]
Aug 27 11:00:56 foo-laptop iTerm2[43018]: Time to encode state for window <PseudoTerminal: 0x7fb3659d3960 tabs=1 window=<PTYWindow: 0x7fb3637c0c80 frame=NSRect: {{0, 0}, {1280, 800}} title=5. tail alpha=1.000000 isMain=0 isKey=0 isVisible=1 delegate=0x7fb3659d3960>>: 0.01823097467422485
Aug 27 11:00:58 foo-laptop kernel[0]: process WeChat[85077] caught causing excessive wakeups. Observed wakeups rate (per sec): 184; Maximum permitted wakeups rate (per sec): 150; Observation period: 300 seconds; Task lifetime number of wakeups: 2193951
Aug 27 11:00:58 foo-laptop com.apple.xpc.launchd[1] (com.apple.ReportCrash[92123]): Endpoint has been activated through legacy launch(3) APIs. Please switch to XPC or bootstrap_check_in(): com.apple.ReportCrash
Aug 27 11:00:58 foo-laptop ReportCrash[92123]: Invoking spindump for pid=85077 wakeups_rate=184 duration=245 because of excessive wakeups
Aug 27 11:01:03 foo-laptop com.apple.CDScheduler[702]: Thermal pressure state: 0 Memory pressure state: 0
Aug 27 11:01:03 foo-laptop com.apple.CDScheduler[47]: Thermal pressure state: 0 Memory pressure state: 0你好,林
发布于 2016-08-28 15:25:35
检查两边“uid”列的基数。最有可能的是,您的join正在将数据倍增。例如,如果在dataframe1的100条记录中有1值的uid,在dataframe2中有10条记录,那么join将产生1000条记录。
为了检查基数,我将执行以下操作:
df1[df1.uid.isin(df2.uid.unique())]['uid'].value_counts()
df2[df2.uid.isin(df1.uid.unique())]['uid'].value_counts()此代码将检查“uid”的值是否存在于其他帧的uid中并具有副本。
https://stackoverflow.com/questions/39178533
复制相似问题