因此,我有以下几点:
#this is the data we have
df = pd.DataFrame(data=(['A','1-50', 10],['B','25-200', 15],['C','25-300', 5]), columns=['Category','Range', 'Qty'])
#these are the different range categories we need to have.
list_of_ranges = ['1-10', '10-25', '25-50', '50-100', '100-200', '200-300', '300-400']
# insert magic spells here
#this is what the result needs to look like
results = pd.DataFrame(data=(['A','1-25', 10],['A','25-50', 10],['B','25-50', 15],['B','50-100', 15],['B','100-200', 15],['C','25-50', 15],['C','50-100', 15],['C','100-200', 15],['C','200-300', 5]), columns=['Category','Range', 'Qty'])如上面的例子所示:
我有一个df,它的范围需要细分为子范围,除了新的范围外,所有的列都需要重复。我怎么能这么做?
Edit1:逻辑的示例
"A“区的气温每年从摄氏1-50度持续10天。这是一个单独的行,内容如下:
1: A,1-50,10这一行可以解释为:在"A“区,温度范围可以是每年1-10天、10-25天或25-50天。所以我想要三行:
1: A,1-10,10
2: A,10-25,10
3: A,25-50,10发布于 2020-12-12 17:13:08
我们需要几个函数来处理您定义的“范围”,但否则问题是为df中的每个“范围”创建一个“小范围”列表,然后再为df中的“explode-ing”创建一个df。
def split_range(r):
"""
split range into a tuple. range is a string 'xx-yy'
"""
tokens = r.split('-')
return (int(tokens[0]), int(tokens[1]))
def is_inside(r1,r2):
"""
True if range r1 is inside r2. Range is a string 'xx-yy'
"""
t1, t2 = split_range(r1), split_range(r2)
return (t1[0]>=t2[0]) and (t1[1] <= t2[1])
df['small_ranges'] = df.apply(lambda row: [rng for rng in list_of_ranges if is_inside(rng, row['Range']) ], axis=1)这会产生
Category Range Qty small_ranges
-- ---------- ------- ----- -----------------------------------------
0 A 1-50 10 ['1-10', '10-25', '25-50']
1 B 25-200 15 ['25-50', '50-100', '100-200']
2 C 25-300 5 ['25-50', '50-100', '100-200', '200-300']现在我们explode
df.explode('small_ranges')输出
Category Range Qty small_ranges
-- ---------- ------- ----- --------------
0 A 1-50 10 1-10
0 A 1-50 10 10-25
0 A 1-50 10 25-50
1 B 25-200 15 25-50
1 B 25-200 15 50-100
1 B 25-200 15 100-200
2 C 25-300 5 25-50
2 C 25-300 5 50-100
2 C 25-300 5 100-200
2 C 25-300 5 200-300发布于 2020-12-12 17:47:13
这里是一个使用pandas.Interval的解决方案,在这种情况下似乎非常有用。首先,我们将字符串转换为pd.Interval
list_of_ranges = [pd.Interval(*tuple(map(int, r.split('-')))) for r in list_of_ranges]
df['Range'] = df['Range'].apply(lambda r: pd.Interval(*tuple(map(int, r.split('-')))))我们创建了一个新的DataFrame,包括每个原始范围所需的所有范围:
my_temps = []
for idx, row in df.iterrows():
_df = pd.DataFrame(columns=df.columns)
_df['Range'] = [r for r in list_of_ranges if r.overlaps(row['Range'])]
_df['Category'], _df['Qty'] = row['Category'], row['Qty']
my_temps.append(_df)
final_df = pd.concat(my_temps).reset_index(drop=True)然后,我们最终将范围再次转换为它们的原始字符串格式:
final_df['Range'] = final_df['Range'].apply(lambda r: '{}-{}'.format(r.left, r.right))这将产生以下数据:
Category Range Qty
0 A 1-10 10
1 A 10-25 10
2 A 25-50 10
0 B 25-50 15
1 B 50-100 15
2 B 100-200 15
0 C 25-50 5
1 C 50-100 5
2 C 100-200 5
3 C 200-300 5如果你还有任何问题,请告诉我们!
https://stackoverflow.com/questions/65266290
复制相似问题