首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何提高Python循环的性能?

如何提高Python循环的性能?
EN

Stack Overflow用户
提问于 2021-03-22 19:09:39
回答 3查看 118关注 0票数 0

我有一个几乎有1400万行的DataFrame。我正在处理金融期权数据,理想情况下,我需要根据到期时间为每个期权设定一个利率(称为无风险利率)。根据我正在跟踪的文献,要做到这一点,一种方法是获得美国国债利率,对于每个期权,检查其到期时间最接近期权到期时间的国债利率(按绝对值计算)。为了实现这一点,我创建了一个循环,用这些差异填充Dataframe。我的代码一点也不优雅,而且有点凌乱,因为有一些日期和到期日的组合没有利率。因此在循环中使用了条件句。循环完成后,我可以查看具有最低绝对差异的期限,并选择该期限的比率。脚本运行的时间太长了,所以我添加了tqdm,以获得正在发生的事情的某种反馈。

我试着运行代码。它将需要几天的时间才能完成,并且随着迭代次数的增加而变慢(我从tqdm了解到这一点)。首先,我使用DataFrame.loc向differences添加行。但由于我认为这是代码随着时间的推移而变慢的原因,我切换到了DataFrame.append。代码仍然很慢,而且会随着时间的推移而变慢。

我搜索了一种提高性能的方法,发现了这个问题:How to speed up python loop。有人建议使用Cython,但老实说,我仍然认为自己是Python的初学者,所以从示例来看,这似乎不是一件微不足道的事情。这是我最好的选择吗?如果需要很多时间来学习,我也可以像文献中的其他人那样做,只需对所有选项使用3个月的利率。但我不想去那里。也许我的问题还有其他(简单的)答案,请让我知道。我提供了一个可重现的代码示例(虽然只有2行数据):

代码语言:javascript
复制
from tqdm import tqdm
import pandas as pd


# Treasury maturities, in years
treasury_maturities = [1/12, 2/12, 3/12, 6/12, 1, 2, 3, 5, 7, 10, 20, 30]

# Useful lists
treasury_maturities1 = [3/12, 6/12, 1, 2, 3, 5, 7, 10, 20, 30]
treasury_maturities2 = [1/12]
treasury_maturities3 = [6/12, 1, 2, 3, 5, 7, 10, 20, 30]
treasury_maturities4 = [1, 2, 3, 5, 7, 10, 20, 30]
treasury_maturities5 = [1/12, 2/12, 3/12, 6/12, 1, 2, 3, 5, 7, 10, 20]

# Dataframe that will contain the difference between the time to maturity of option and the different maturities
differences = pd.DataFrame(columns = treasury_maturities)


# Options Dataframe sample
options_list = [[pd.to_datetime("2004-01-02"), pd.to_datetime("2004-01-17"), 800.0, "c",    309.1, 311.1, 1108.49, 1108.49, 0.0410958904109589, 310.1], [pd.to_datetime("2004-01-02"), pd.to_datetime("2004-01-17"), 800.0, "p", 0.0, 0.05, 1108.49, 1108.49, 0.0410958904109589, 0.025]]

options = pd.DataFrame(options_list, columns = ['QuoteDate', 'expiration', 'strike', 'OptionType', 'bid_eod', 'ask_eod', 'underlying_bid_eod', 'underlying_ask_eod', 'Time_to_Maturity', 'Option_Average_Price'])


# Loop
for index, row in tqdm(options.iterrows()):
    if pd.to_datetime("2004-01-02") <= row.QuoteDate <= pd.to_datetime("2018-10-15"):
        if pd.to_datetime("2004-01-02") <= row.QuoteDate <= pd.to_datetime("2006-02-08") and row.Time_to_Maturity > 25:
            list_s = ([abs(maturity - row.Time_to_Maturity) for maturity in 
              treasury_maturities5])
            list_s = [list_s + [40]] # 40 is an arbitrary number bigger than 30
            differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True) 
        elif (pd.to_datetime("2008-12-10") or pd.to_datetime("2008-12-18") or pd.to_datetime("2008-12-24")) == row.QuoteDate and 1.5/12 <= row.Time_to_Maturity <= 3.5/12:
            list_s = [0, 40, 40]
            list_s = [list_s + [abs(maturity - row.Time_to_Maturity) for 
                                   maturity in treasury_maturities3]]
            differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True)
        elif (pd.to_datetime("2008-12-10") or pd.to_datetime("2008-12-18") or pd.to_datetime("2008-12-24")) == row.QuoteDate and 3.5/12 < row.Time_to_Maturity <= 4.5/12:    
            list_s = ([abs(maturity - row.Time_to_Maturity) for maturity in 
                           treasury_maturities2])
            list_s = list_s + [40, 40, 0]
            list_s = [list_s + [abs(maturity - row.Time_to_Maturity) for 
                                   maturity in treasury_maturities4]]
            differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True)
        else:
            if 1.5/12 <= row.Time_to_Maturity <= 2/12:
                list_s = [0, 40]
                list_s = [list_s + [abs(maturity - row.Time_to_Maturity) for maturity in 
              treasury_maturities1]]
                differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True)
            elif 2/12 < row.Time_to_Maturity <= 2.5/12:
                list_s = ([abs(maturity - row.Time_to_Maturity) for maturity in 
              treasury_maturities2])
                list_s = list_s + [40, 0]
                list_s = [list_s + [abs(maturity - row.Time_to_Maturity) for maturity in 
              treasury_maturities3]]
                differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True)
            else:
                list_s = [[abs(maturity - row.Time_to_Maturity) for maturity in 
              treasury_maturities]]
                differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True)
    else:        
        list_s = [[abs(maturity - row.Time_to_Maturity) for maturity in 
              treasury_maturities]]
        differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True)
EN

回答 3

Stack Overflow用户

发布于 2021-04-10 20:53:36

简短的回答

循环和if语句都是计算开销很大的操作,所以要想办法减少使用的次数。

Loop Optimization:-提高编程循环速度的最好方法是将尽可能多的计算移出循环外( of The loop )。

DRY:-不要重复你自己。您有几个多余的if条件,请查看嵌套的if条件,并遵循DRY原则。

使用pandas和numpy

像pandas和numpy这样的库的主要优点之一是,它们是为提高数组的数学运算效率而设计的(参见Why are numpy arrays so fast?)。这意味着你通常根本不需要使用循环。不是在循环中创建新的DataFrame,而是为要计算的每个值创建一个新列。

为了克服不同日期的不同逻辑的问题,过滤行并应用逻辑,使用掩码/过滤器只选择您需要操作的行,而不是使用if语句(参见pandas filtering tutorial)。

代码示例

这段代码不是逻辑的复制,而是如何实现它的示例。它不是完美的,但应该提供一些主要的效率改进。

代码语言:javascript
复制
import pandas as pd
import numpy as np

# Maturity periods, months and years
month_periods = np.array([1, 2, 3, 6, ], dtype=np.float64)
year_periods = np.array([1, 2, 3, 4, 5, 7, 10, 20, 30, ], dtype=np.float64)

# Create column names for maturities
maturity_cols = [f"month_{m:02.0f}" for m in month_periods] + [f"year_{y:02.0f}" for y in year_periods]

# Normalise months  & concatenate into single array
month_periods = month_periods / 12
maturities = np.concatenate((month_periods, year_periods))

# Create some dummy data
np.random.seed(seed=42)  # Seed PRN generator
date_range = pd.date_range(start="2004-01-01", end="2021-01-30", freq='D')  # Dates to sample from
dates = np.random.choice(date_range, size=n_records, replace=True)
maturity_times = np.random.random(size=n_records)
options = pd.DataFrame(list(zip(dates, maturity_times)), columns=['QuoteDate', 'Time_to_Maturity', ])

# Create date masks
after = options['QuoteDate'] >= pd.to_datetime("2008-01-01")
before = options['QuoteDate'] <= pd.to_datetime("2015-01-01")

# Combine date masks / create flipped version
between = after & before
outside = np.logical_not(between)

# Select data with masks
df_outside = options[outside].copy()
df_between = options[between].copy()

# Smaller dataframes
df_a = df_between[df_between['Time_to_Maturity'] > 25].copy()
df_b = df_between[df_between['Time_to_Maturity'] <= 3.5 / 12].copy()
df_c = df_between[df_between['Time_to_Maturity'] <= 4.5 / 12].copy()
df_d = df_between[
    (df_between['Time_to_Maturity'] >= 2 / 12) & (df_between['Time_to_Maturity'] <= 4.5 / 12)].copy()

# For each maturity period, add difference column using different formula
for i, col in enumerate(maturity_cols):
    # Add a line here for each subset / chunk of data which requires a different formula
    df_a[col] = ((maturities[i] - df_outside['Time_to_Maturity']) + 40).abs()
    df_b[col] = ((maturities[i] - df_outside['Time_to_Maturity']) / 2) .abs()
    df_c[col] = (maturities[i] - df_outside['Time_to_Maturity'] + 1).abs()
    df_d[col] = (maturities[i] - df_outside['Time_to_Maturity'] * 0.8).abs()
    df_outside[col] = (maturities[i] - df_outside['Time_to_Maturity']).abs()

# Concatenate dataframes back to one dataset
frames = [df_outside, df_a, df_b, df_c, df_d, ]
output = pd.concat(frames).dropna(how='any')

output.head()

记录数的平均执行时间

即使是百万条记录也能快速处理(内存允许)|记录|旧时间(秒)|新时间(秒)|改进||-|| 10 | 0.0105 | 0.0244 | -132.38% || 100 | 0.1078 | 0.0249 | 76.90% || 1,000 (1k) | 1.03 | 0.0249 | 97.58% || 10,000 (10k) | 15。182.014 | 0.0322 | 99.79% ||100000|100K|100000| 0.065 | 99.96% ||100万(1m) |?| 0.4014 |?||1000000 (10m) |?| 4.7488 |?||1400万(14m) |?| 6.0172 |?||10000万(100m) |?| 83.286 |?

进一步优化

一旦你优化并分析了你的基本代码,你也可以研究多线程,并行化你的代码,或者使用不同的语言。此外,1400万条记录将消耗大量的内存--远远超出大多数工作站的处理能力。要绕过此限制,您可以按块读取文件本身,并一次对一个块执行计算:

代码语言:javascript
复制
result_frames = []
for chunk in pd.read_csv("voters.csv", chunksize=10000):
    # Do things here
    result = chunk
    result_frames.append(result)

谷歌搜索词:多处理/线程/任务/ PySpark

票数 2
EN

Stack Overflow用户

发布于 2021-03-23 00:06:40

对于您的问题,"divide and conquer“可以引导您找到解决方案。我建议将你的代码分成块并分析每个部分,因为,我看到了一些如下的冗余:

代码语言:javascript
复制
(pd.to_datetime("2008-12-10") or pd.to_datetime("2008-12-18") or pd.to_datetime("2008-12-24"))

从字符串到日期时间的转换似乎是在每一行中完成的。您必须使用profile或更具体的工具(如perf_tool * )评测代码。它通过在代码中放置一些标记并报告所有中间时间、调用次数和方法来帮助您。

*我是主要开发者

票数 1
EN

Stack Overflow用户

发布于 2021-04-10 20:03:47

正如其他人已经指出的那样,请分析您的代码以找到最慢的部分。

一些可能的加速:

尽可能地考虑使用生成器而不是列表。此外,也许使用list.extend可能比列表连接更快。

代码语言:javascript
复制
list_s = ([abs(maturity - row.Time_to_Maturity) for maturity in 
                           treasury_maturities2)

可以是

代码语言:javascript
复制
list_s = (abs(maturity - row.Time_to_Maturity) for maturity in 
                           treasury_maturities2)

代码语言:javascript
复制
list_s = list_s + [foo, bar, baz]

可以是

代码语言:javascript
复制
list_s = list_s.extend([foo, bar, baz])
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/66744864

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档