我在df中有一个IP地址列表。这些IP地址在GET请求中使用requests发送到ARIN数据库,我感兴趣的是获取该IP地址的组织或客户。我正在使用requests Session()在requests-futures FuturesSession()中,希望能加快API调用的速度。以下是代码:
s = requests.Session()
session = FuturesSession(session=s, max_workers=10)
def getIPAddressOrganization(IP_Address):
url = 'https://whois.arin.net/rest/ip/' + IP_Address + '.json'
request = session.get(url)
response = request.result().json()
try:
organization = response['net']['orgRef']['@name']
except KeyError:
organization = response['net']['customerRef']['@name']
return organization
df['organization'] = df['IP'].apply(getIPAddressOrganization)添加常规的requests Session()很大程度上提高了性能,但是requests-futures FuturesSession()并没有起作用(可能是因为我缺乏知识)。
如何将pandas apply()与requests-futures结合使用,以及/或是否还有其他更有效的方法来加速API调用?
发布于 2021-12-15 20:13:27
这并不直接回答这个问题,但它表明熊猫的apply()函数确实等待每个API调用的结果,并且不对IO时间进行并行化或优化:
import time
import pandas as pd
df = pd.DataFrame(data=range(10))
start = time.perf_counter()
df.apply(lambda r: time.sleep(5), axis=1)
end = time.perf_counter() - start
print(f'total time: {end}')总时间: 50.05315346799034
结论--也许最好考虑采用异步IO方法
暂定方向:
async def parallel_rest_calls(data: List):
async with aiohttp.ClientSession() as session:
tasks = []
for ip in data:
tasks.append(getIPAddressOrganization(session=session, ip)
enriched_data_col = await asyncio.gather(*tasks, return_exceptions=True)
return enriched_data_col
async def getIPAddressOrganization(session: aiohttp.ClientSession, IP_Address):
url = 'https://whois.arin.net/rest/ip/' + IP_Address + '.json'
async with session.get(url, headers=headers, params=params) as response:
json = await response.json()
status = response.status
try:
organization = json['net']['orgRef']['@name']
except KeyError:
organization = json['net']['customerRef']['@name']
return (IP_Address, organization)https://stackoverflow.com/questions/48325908
复制相似问题