文章/答案/技术大牛

发布

社区首页 >问答首页 >从另一个数据中创建熊猫数据的更快方法

问从另一个数据中创建熊猫数据的更快方法
EN

Stack Overflow用户

提问于 2019-10-04 07:26:26

回答 1查看 520关注 0票数 0

我有一个有41500多条记录和3个字段的数据文件：ID、start_date和end_date。

我想从其中创建一个单独的数据，其中只有两个字段，如：ID和active_years，其中将包含针对start_year和end_year范围之间可能存在的所有年份的每个标识符的记录(包括范围中的年底)。

这就是我现在要做的，但是41500行要花两个多小时才能完成。

df = pd.DataFrame(columns=['id', 'active_years'])
ix = 0

for _, row in raw_dataset.iterrows():

    st_yr = int(row['start_date'].split('-')[0]) # because dates are in the format yyyy-mm-dd
    end_yr = int(row['end_date'].split('-')[0])

    for year in range(st_yr, end_yr+1):

        df.loc[ix, 'id'] = row['ID']
        df.loc[ix, 'active_years'] = year
        ix = ix + 1

那么，是否有更快的方法来实现这一点呢？

编辑一些示例来尝试和处理，

raw_dataset = pd.DataFrame({'ID':['a121','b142','cd3'],'start_date':['2019-10-09','2017-02-06','2012-12-05'],'end_date':['2020-01-30','2019-08-23','2016-06-18']})

print(raw_dataset)
     ID  start_date    end_date
0  a121  2019-10-09  2020-01-30
1  b142  2017-02-06  2019-08-23
2   cd3  2012-12-05  2016-06-18

# the desired dataframe should look like this
print(desired_df)
     id  active_years
0  a121  2019
1  a121  2020
2  b142  2017
3  b142  2018
4  b142  2019
5   cd3  2012
6   cd3  2013
7   cd3  2014
8   cd3  2015
9   cd3  2016

python

pandas

dataframe

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-10-04 08:07:26

动态增长的python列表比动态增长的numpy数组快得多(numpy数组是熊猫数据的底层数据结构)。有关一个简短的解释，请参见here。考虑到这一点：

import pandas as pd

# Initialize input dataframe
raw_dataset = pd.DataFrame({
    'ID':['a121','b142','cd3'],
    'start_date':['2019-10-09','2017-02-06','2012-12-05'],
    'end_date':['2020-01-30','2019-08-23','2016-06-18'],
})

# Create integer columns for start year and end year
raw_dataset['start_year'] = pd.to_datetime(raw_dataset['start_date']).dt.year
raw_dataset['end_year'] = pd.to_datetime(raw_dataset['end_date']).dt.year

# Iterate over input dataframe rows and individual years
id_list = []
active_years_list = []
for row in raw_dataset.itertuples():
    for year in range(row.start_year, row.end_year+1):
        id_list.append(row.ID)
        active_years_list.append(year)

# Create result dataframe from lists
desired_df = pd.DataFrame({
    'id': id_list,
    'active_years': active_years_list,
})

print(desired_df)
# Output:
#     id  active_years
# 0  a121          2019
# 1  a121          2020
# 2  b142          2017
# 3  b142          2018
# 4  b142          2019
# 5   cd3          2012
# 6   cd3          2013
# 7   cd3          2014
# 8   cd3          2015
# 9   cd3          2016

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/58231762

复制

相似问题

问从另一个数据中创建熊猫数据的更快方法
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从另一个数据中创建熊猫数据的更快方法EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从另一个数据中创建熊猫数据的更快方法
EN