在python中,我的数据看起来像这样,有500.000行:
《时代周刊》杂志上的一篇文章《华尔街日报》,《华尔街日报》
1/1-1900 10:41:00 -01-01
3/1-1900 09:54:00 -01-01
4/1-1900 15:45:00 -01-01
5/1-1900 18:41:00 -01-01
4/1-1900 15:45:00 -01-01
我想做一个新的专栏,在这样的季度中使用垃圾箱:
垃圾箱中的数据,数据中的数据。
9月2日9:00-9:15
北京时间9月4日9:15-9:30
北京时间9月4日9:30-9:45
11月4日10:00-10:15
我知道你怎么做垃圾桶,但是时间戳给我带来了麻烦。有人能帮我一下吗?已经谢谢你了!
发布于 2019-10-12 03:57:24
我知道现在很晚了。但迟做总比不做好。我也遇到了类似的需求,并使用pandas库完成了
中加载数据
datetime第二,check
df.info()
例如,在我的示例中,TIME列最初是object类型,即string类型
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17640 entries, 0 to 17639
Data columns (total 3 columns):
TIME 17640 non-null object
value 17640 non-null int64
dtypes: int64(1), object(2)
memory usage: 413.5+ KB如果已采用datetime格式,则df['TIME'] = pd.to_datetime(df['TIME'])忽略此项
df.info()现在提供更新的格式
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17640 entries, 0 to 17639
Data columns (total 3 columns):
TIME 17640 non-null datetime64[ns]
value 17640 non-null int64
dtypes: datetime64[ns](2), int64(1)
memory usage: 413.5 KB计数=打印(index=df.TIME,data=np.array(df.count)).resample('15T').count() pd.Series(计数:3))
时间2017-07-01 00:00:00 3 2017-07-01 00:15:00 3 2017-07-01 00:30:00 3频率: 15T,dtype: int64
在上面的命令中,15T表示15分钟存储桶,您可以将其替换为D表示日存储桶,2D表示2天存储桶,M表示月份存储桶,2M表示2个月存储桶,依此类推。您可以在此链接上阅读这些注释的详细信息
现在,我们的存储桶数据已经完成了,正如你在上面看到的。
r = pd.date_range('2017-07', '2017-09', freq='15T')
x = np.repeat(np.array(r), 2, axis=0)[1:-1]
# now reshape data to fit in Dataframe
x = np.array(x)[:].reshape(-1, 2)
# now fit in dataframe and print it
final_df = pd.DataFrame(x, columns=['start', 'end'])
print(final_df[:3]) start end
0 2017-07-01 00:00:00 2017-07-01 00:15:00
1 2017-07-01 00:15:00 2017-07-01 00:30:00
2 2017-07-01 00:30:00 2017-07-01 00:45:00日期范围也完成了
print(final_df:3) _df‘np.array’=最终(表示)计数
start end count
0 2017-07-01 00:00:00 2017-07-01 00:15:00 3
1 2017-07-01 00:15:00 2017-07-01 00:30:00 3
2 2017-07-01 00:30:00 2017-07-01 00:45:00 3希望任何人都能觉得它有用。1:https://pypi.org/project/pandas/ 2:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.resample.html#pandas.Series.resample
发布于 2015-05-11 06:29:33
我不确定这是不是你要的。如果不是,我建议你改进你的问题,因为很难理解你的问题。特别是,如果能看到你已经尝试做了什么,那就太好了。
from __future__ import division, print_function
from collections import namedtuple
from itertools import product
from datetime import time
from StringIO import StringIO
MAX_HOURS = 23
MAX_MINUTES = 59
def process_data_file(data_file):
"""
The data_file is supposed to be an opened file object
"""
time_entry = namedtuple("time_entry", ["time", "count"])
data_to_bin = []
for line in data_file:
t, count = line.rstrip().split("\t")
t = map(int, t.split()[-1].split(":")[:2])
data_to_bin.append(time_entry(time(*t), int(count)))
return data_to_bin
def make_milestones(min_hour=0, max_hour=MAX_HOURS, interval=15):
minutes = [minutes for minutes in xrange(MAX_MINUTES+1) if not minutes % interval]
hours = range(min_hour, max_hour+1)
return [time(*milestone) for milestone in list(product(hours, minutes))]
def bin_time(data_to_bin, milestones):
time_entry = namedtuple("time_entry", ["time", "count"])
data_to_bin = sorted(data_to_bin, key=lambda time_entry: time_entry.time, reverse=True)
binned_data = []
current_count = 0
upper = milestones.pop()
lower = milestones.pop()
for entry in data_to_bin:
while not lower <= entry.time <= upper:
if current_count:
binned_data.append(time_entry("{}-{}".format(str(lower)[:-3], str(upper)[:-3]), current_count))
current_count = 0
upper, lower = lower, milestones.pop()
current_count += entry.count
return binned_data
data_file = StringIO("""1-1-1900 10:41:00\t1
3-1-1900 09:54:00\t1
4-1-1900 15:45:00\t1
5-1-1900 18:41:00\t1
4-1-1900 15:45:00\t1""")
binned_time = bin_time(process_data_file(data_file), make_milestones())
for entry in binned_time:
print(entry.time, entry.count, sep="\t")输出:
18:30-18:45 1
15:45-16:00 2
10:30-10:45 1发布于 2022-01-30 15:47:38
试着在没有熊猫的情况下:
from collections import defaultdict
import datetime as dt
from itertools import groupby
def bin_ts(dtime, delta):
modulo = dtime.timestamp() % delta.total_seconds()
return dtime - dt.timedelta(seconds=modulo)
src_data = [
('1-1-1900 10:41:00', 1),
('3-1-1900 09:54:00', 1),
('4-1-1900 15:45:00', 1),
('5-1-1900 18:41:00', 1),
('4-1-1900 15:45:00', 1)
]
ts_data = [(dt.datetime.strptime(ts, '%d-%m-%Y %H:%M:%S'), count) for ts, count in src_data]
bin_size = dt.timedelta(minutes=15)
binned = [(bin_ts(ts, bin_size), count) for ts, count in ts_data]
def time_fmt(ts):
res = "%s - %s" % (ts.strftime('%H:%M'), (ts + bin_size).strftime('%H:%M'))
return res
binned_time = [(time_fmt(ts), count) for ts, count in binned]
cnts = defaultdict(int)
for ts, group in groupby(binned_time, lambda x: x[0]):
for row in group:
cnts[ts] += row[1]
output = list(cnts.items())
output.sort(key=lambda x: x[0])
from pprint import pprint
pprint(output)结果是:
[('09:45 - 10:00', 1),
('10:30 - 10:45', 1),
('15:45 - 16:00', 2),
('18:30 - 18:45', 1)]https://stackoverflow.com/questions/30151552
复制相似问题