我正在尝试计算一家公司在财报公布后一年内出现在新闻上的次数,并将这一数字与同一时间范围内的其他公司进行比较。我有两个pandas数据框,一个有收入日期,另一个有新闻。我的方法很慢。有没有更好的熊猫/numpy的方法?
import pandas as pd
companies = pd.DataFrame({'CompanyName': ['A', 'B', 'C'], 'EarningsDate': ['2013/01/15', '2015/03/25', '2017/05/03']})
companies['EarningsDate'] = pd.to_datetime(companies.EarningsDate)
news = pd.DataFrame({'CompanyName': ['A', 'A', 'A', 'B', 'B', 'C'],
'NewsDate': ['2012/02/01', '2013/01/10', '2015/05/13' , '2012/05/23', '2013/01/03', '2017/05/01']})
news['NewsDate'] = pd.to_datetime(news.NewsDate)companies看起来像
CompanyName EarningsDate
0 A 2013-01-15
1 B 2015-03-25
2 C 2017-05-03news看起来像
CompanyName NewsDate
0 A 2012-02-01
1 A 2013-01-10
2 A 2015-05-13
3 B 2012-05-23
4 B 2013-01-03
5 C 2017-05-01我如何重写这段代码?这是有效的,但它是非常慢的,因为每个数据帧> 500k行。
company_count = []
other_count = []
for _, company in companies.iterrows():
end_date = company.EarningsDate
start_date = end_date - pd.DateOffset(years=1)
subset = news[(news.NewsDate > start_date) & (news.NewsDate < end_date)]
mask = subset.CompanyName==company.CompanyName
company_count.append(subset[mask].shape[0])
other_count.append(subset[~mask].groupby('CompanyName').size().mean())
companies['12MonCompanyNewsCount'] = pd.Series(company_count)
companies['12MonOtherNewsCount'] = pd.Series(other_count).fillna(0)最终结果,companies看起来像这样
CompanyName EarningsDate 12MonCompanyNewsCount 12MonOtherNewsCount
0 A 2013-01-15 2 2
1 B 2015-03-25 0 0
2 C 2017-05-03 1 0发布于 2017-08-29 23:33:48
好了,开始吧。
要获取12MonCompanyNewsCount,您可以使用merge_asof,它非常简洁:
companies['12MonCompanyNewsCount'] = pd.merge_asof(
news,
companies,
by='CompanyName',
left_on='NewsDate',
right_on='EarningsDate',
tolerance=pd.Timedelta('365D'),
direction='forward'
).groupby('CompanyName').count().NewsDate它的工作速度大约是当前实现的两倍(并且会更好地扩展)
对于12MonOtherNewsCount,我真的不能想出一种不遍历所有东西的方法。不过,我想这会更简洁一些:
companies['12MonOtherNewsCount'] = companies.apply(
lambda x: len(
news[
(news.NewsDate.between(x.EarningsDate-pd.Timedelta('365D'), x.EarningsDate, inclusive=False))
&(news.CompanyName!=x.CompanyName)
]
),
axis=1
)而且它确实看起来更快一点。
发布于 2017-08-29 23:14:46
我找不到不遍历companies行的方法。但是,您可以为companies设置开始日期列,迭代companies的行,并为符合您的条件的news的日期和公司名称创建布尔索引。然后,只需执行布尔and操作并对结果布尔数组求和即可。
我发誓当你看到代码的时候会更有意义。
# create the start date column and the 12 month columns,
# fill the 12 month columns with zeros for now
companies['startdate'] = companies.EarningsDate - pd.DateOffset(years=1)
companies['12MonCompanyNewsCount'] = 0
companies['12MonOtherNewsCount'] = 0
# iterate the rows of companies and hold the index
for i, row in companies.iterrows():
# create a boolean index when the news date is after the start date
# and when the news date is before the end date
# and when the company names match
ix_start = news.NewsDate >= row.startdate
ix_end = news.NewsDate <= row.EarningsDate
ix_samename = news.CompanyName == row.CompanyName
# set the news count value for the current row of `companies` using
# boolean `and` operations on the indices. first when the names match
# and again when the names don't match.
companies.loc[i,'12MonCompanyNewsCount'] = (ix_start & ix_end & ix_samename).sum()
companies.loc[i,'12MonOtherNewsCount'] = (ix_start & ix_end & ~ix_samename).sum()
companies
#returns:
CompanyName EarningsDate startdate 12MonCompanyNewsCount \
0 A 2013-01-15 2012-01-15 1
1 B 2015-03-25 2014-03-25 0
2 C 2017-05-03 2016-05-03 1
12MonOtherNewsCount
0 2
1 1
2 0https://stackoverflow.com/questions/45940584
复制相似问题