文章/答案/技术大牛

发布

问加速数据帧.loc()
EN

Stack Overflow用户

提问于 2017-02-16 20:39:48

回答 2查看 3.7K关注 0票数 2

我有一个大约400k的IP (存储在熊猫DataFrame df_IP中)的列表，可以使用maxming geoIP数据库进行地理定位。我使用city版本，并检索城市、纬度、经度和县代码(法国的部门)，因为一些城市具有相同的名称，但位于非常不同的地方。

以下是我的工作代码：

import geoip2.database
import pandas as pd

reader = geoip2.database.Reader('path/to/GeoLite2-City.mmdb')
results = pd.DataFrame(columns=('IP',
                                'city',
                                'latitude',
                                'longitude',
                                'dept_code'))

for i, IP in enumerate(df_IP["IP"]):
    try :
        response = reader.city(IP)
        results.loc[i] = [IP,response.city.name,response.location.latitude,response.location.longitude,response.subdivisions.most_specific.iso_code]
    except Exception as e:
        print ("error with line {}, IP {}: {}").format(i,df_IP["IP"][i],e )

它工作得很好，但每次循环都会变得越来越慢。如果我在1000个第一个IP上计时，我需要4.7秒，所以整个400k大约需要30分钟，但它运行了近4个小时。

随着时间的推移，国际海事组织唯一可以减慢速度的是数据帧results的填充:我有哪些不使用.loc并且可以更快的替代方案？最后，我仍然需要相同的数据帧。

我也有兴趣解释一下为什么loc在大数据帧上速度如此之慢。

python

pandas

geoip

回答 2

Stack Overflow用户

发布于 2018-02-23 10:26:11

我面临着类似的情况，因为loc导致运行时对我来说崩溃了。在摆弄了很多次之后，我找到了一个简单的解决方案，而且速度非常快。使用set_value而不是loc。

以下是示例代码的外观:您可以根据自己的用例对其进行调整。假设您的数据帧是这样的

Index  'A'  'B' 'Label'
23      0    1    Y
45      3    2    N

self.data.set_value(45,'Label,'NA')

这会将第二行的"Label“列的值设置为NA。

有关set_value的更多信息，请参阅以下链接：

http://pandas.pydata.org/pandas-docs/version/0.17/generated/pandas.DataFrame.set_value.html

票数 1

Stack Overflow用户

发布于 2017-07-25 00:39:46

我也遇到过同样的问题，正如@oliversm建议的那样，我创建了一个列表，然后将其添加到原始数据集中。下面是代码的样子：

……

results_list=[]

for i, IP in enumerate(df_IP["IP"]):
    try :
        response = reader.city(IP)
     results_list.append( response.city.name,response.location.latitude,response.location.longitude,response.subdivisions.most_specific.iso_code)
    except Exception as e:
        print ("error with line {}, IP {}: {}").format(i,df_IP["IP"][i],e )

results_array=np.asarray(results_list) #list to array to add to the dataframe as a new column

results['results_column']=pd.Series(results_array,index=results.index)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/42274253

复制

相似问题

问加速数据帧.loc()
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问加速数据帧.loc()EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问加速数据帧.loc()
EN