从 SQL 到 SPL:按时间窗口统计
某库表的 Time 字段是时间,时间的间隔有时大于 1 分钟。
Time | Value |
10:10:00 | 3 |
10:11:00 | 4 |
10:13:00 | 5 |
10:13:00 | 9 |
10:13:00 | 8 |
10:14:00 | 2 |
现在要将数据每分钟分成一个窗口,补上缺失的窗口,统计出每个窗口的 4 个值,分别是:start_value,前一个窗口的最后一条;end_value,本窗口的最后一条;min,本窗口的最小值;max,本窗口的最大值。第一分钟的 start_value 用本窗口的第一条记录;如果缺少某窗口的数据,则用前一个窗口的最后一条代替(同本窗口的 start_value)。
start | end | start_value | end_value | min | max |
10:10:00 | 10:11:00 | 3 | 3 | 3 | 3 |
10:11:00 | 10:12:00 | 3 | 4 | 4 | 4 |
10:12:00 | 10:13:00 | 4 | 4 | 4 | 4 |
10:13:00 | 10:14:00 | 4 | 8 | 5 | 9 |
10:14:00 | 10:15:00 | 8 | 2 | 2 | 2 |
SQL:
with overview as (
SELECT
distinct on (a.time) a.id, a.time, b.time as "end", a.value,
date_trunc('minute', a.time) as minute_start,
date_trunc('minute', b.time) as minute_end
FROM
main a
left join
main b
on
a."time"<b."time" and a.id = b.id
order by
a.time, b.time asc
),
overview2 as (
select
id, value, true as backfill,
date_trunc('minute', "end") as time,
date_trunc('minute', "end") as minute
from
overview
where
minute_start <> minute_end
UNION ALL
select
id, time, value, false as backfill,
date_trunc('minute', time) as minute
from
overview
),
overview3 as (
select
*
from
overview2
UNION ALL (
Select
distinct on (a.missingminute)
c.id,
a.missingminute as time,
a.missingminute as minute,
c.value,
true as backfill
from (
SELECT
date_trunc('minute', time.time) as missingminute
FROM
generate_series((select min(minute) from overview2),(select max(minute) from overview2),'1 minute'::interval) time
left join (
select distinct
minute
from
overview2
) b
on
date_trunc('minute', time) = b.minute
where
b.minute isnull
) a
left join
main c
on
a.missingminute > c.time
order by
a.missingminute,
c.time desc
)
order by
time
)
select
t1.id,
t1.minute as minute_start,
t1.minute + interval '1 minute' as minute_end,
t1.backfill as start_backfill,
t1.start,
t2.end,
coalesce(t3.min, t1.start) as min,
coalesce(t3.max, t1.start) as max
from
(select distinct on (id, minute) id, minute, value as start, backfill from overview3 order by id, minute, time asc) t1
left join
(select distinct on (id, minute) id, minute, value as end from overview3 order by id, minute, time desc) t2 on t1.id = t2.id and t1.minute = t2.minute
left join
(select id, minute, min(value) min, max(value) max from overview2 group by id,minute) t3 on t1.id = t3.id and t1.minute = t3.minute
SQL要用多层的嵌套子查询和多个复杂的 join 语句实现,代码冗长难懂。SPL 直接提供了时间序列函数,按序列对齐的函数、位置相关的语法:https://try.esproc.com/splx?55o
A | |
1 | $select * from main.txt |
2 | =A1.run(Time=time@m(Time)) |
3 | =list=periods@s(A2.min(Time),A2.max(Time),60) |
4 | =A2.align@a(list,Time) |
5 | =A4.new(list(#):start, elapse@s(start,60):end, sv=ifn(end_value[-1],~.Value):start_value, ifn(~.m(-1).Value, sv):end_value, ifn(~.min(Value),sv):min, ifn(~.max(Value),sv):max) |
A1:加载数据。
A2:将 Time 字段修改为整分钟数。
A3:生成连续分钟的时间序列 list。
A4:将数据按 list 对齐,每组是一个窗口的数据,有些窗口为空。
A5:生成新的二维表,用原每组记录生成一条新记录,start 按序号从 list 取值,start_value 取自上一条新记录的 end_value,该值为空(第一分钟)时取自当前组的第 1 条。end_value 取自上一组记录的最后一条,min 取自本组最小值,max 取自本组最大值。如果后三项缺失,则用 start_value(sv) 代替。[-1] 表示相对上一条,m(-1) 表示集合中的倒数第一个成员。
英文版 https://c.esproc.com/article/1742378003626