从 SQL 到 SPL:按时间窗口统计

某库表的 Time 字段是时间,时间的间隔有时大于 1 分钟。

Time Value
10:10:00 3
10:11:00 4
10:13:00 5
10:13:00 9
10:13:00 8
10:14:00 2

现在要将数据每分钟分成一个窗口,补上缺失的窗口,统计出每个窗口的 4 个值,分别是:start_value,前一个窗口的最后一条;end_value,本窗口的最后一条;min,本窗口的最小值;max,本窗口的最大值。第一分钟的 start_value 用本窗口的第一条记录;如果缺少某窗口的数据,则用前一个窗口的最后一条代替(同本窗口的 start_value)。

start end start_value end_value min max
10:10:00 10:11:00 3 3 3 3
10:11:00 10:12:00 3 4 4 4
10:12:00 10:13:00 4 4 4 4
10:13:00 10:14:00 4 8 5 9
10:14:00 10:15:00 8 2 2 2

SQL

with overview as (
    SELECT 
        distinct on (a.time) a.id, a.time, b.time as "end", a.value, 
        date_trunc('minute', a.time) as minute_start, 
        date_trunc('minute', b.time) as minute_end 
    FROM 
        main a 
    left join 
        main b 
    on 
        a."time"<b."time" and a.id = b.id 
    order by 
        a.time, b.time asc
    ),
overview2 as (
    select 
        id, value, true as backfill,
        date_trunc('minute', "end") as time, 
        date_trunc('minute', "end") as minute
    from 
        overview 
    where 
        minute_start <> minute_end
    UNION ALL
    select 
        id, time, value, false as backfill,
        date_trunc('minute', time) as minute
    from 
        overview
    ),  
overview3 as (
    select 
        * 
    from 
        overview2 
    UNION ALL (
        Select 
            distinct on (a.missingminute) 
            c.id, 
            a.missingminute as time, 
            a.missingminute as minute, 
            c.value, 
            true as backfill 
        from (
            SELECT 
                date_trunc('minute', time.time) as missingminute
            FROM 
                generate_series((select min(minute) from overview2),(select max(minute) from overview2),'1 minute'::interval) time 
            left join (
                select distinct 
                    minute 
                from 
                    overview2
                ) b 
            on 
                date_trunc('minute', time) = b.minute 
            where 
                b.minute isnull
            ) a 
        left join 
            main c 
        on 
            a.missingminute > c.time 
        order by 
            a.missingminute, 
            c.time desc
        ) 
    order by 
        time
    )
select 
    t1.id, 
    t1.minute as minute_start, 
    t1.minute + interval '1 minute' as minute_end, 
    t1.backfill as start_backfill,
    t1.start, 
    t2.end, 
    coalesce(t3.min, t1.start) as min, 
    coalesce(t3.max, t1.start) as max 
from 
    (select distinct on (id, minute) id, minute, value as start, backfill from overview3 order by id, minute, time asc) t1 
left join
    (select distinct on (id, minute) id, minute, value as end from overview3 order by id, minute, time desc) t2 on t1.id = t2.id and t1.minute = t2.minute 
left join
    (select id, minute, min(value) min, max(value) max from overview2 group by id,minute) t3 on t1.id = t3.id and t1.minute = t3.minute

SQL要用多层的嵌套子查询和多个复杂的 join 语句实现,代码冗长难懂。SPL 直接提供了时间序列函数,按序列对齐的函数、位置相关的语法:https://try.esproc.com/splx?55o


 A
1 $select * from main.txt
2 =A1.run(Time=time@m(Time))
3 =list=periods@s(A2.min(Time),A2.max(Time),60)
4 =A2.align@a(list,Time)
5 =A4.new(list(#):start, elapse@s(start,60):end, sv=ifn(end_value[-1],~.Value):start_value,
ifn(~.m(-1).Value, sv):end_value,
ifn(~.min(Value),sv):min,
ifn(~.max(Value),sv):max)

A1:加载数据。

A2:将 Time 字段修改为整分钟数。

A3:生成连续分钟的时间序列 list。

A4:将数据按 list 对齐,每组是一个窗口的数据,有些窗口为空。

A5:生成新的二维表,用原每组记录生成一条新记录,start 按序号从 list 取值,start_value 取自上一条新记录的 end_value,该值为空(第一分钟)时取自当前组的第 1 条。end_value 取自上一组记录的最后一条,min 取自本组最小值,max 取自本组最大值。如果后三项缺失,则用 start_value(sv) 代替。[-1] 表示相对上一条,m(-1) 表示集合中的倒数第一个成员。

问题来源:https://stackoverflow.com/questions/78102526/postgressql-backfill-data-if-doesnt-exist-for-a-start-period