从 SQL 到 SPL:Fetch values from previous non-null value rows

数据库表 organisation_user_link 存储着账户的当前状态,其中 dossier_created 是账户创建时间。

id

user_id

organisation_id

status_id

stopped_reason_id

dossier_created

1

3

73

2


2021-10-29 07:50:21

2

9

1199

4

5

2021-05-19 17:44:07

库表 organisation_user_link_status_history 存储着账户状态的变更历史。

timestamp

user_id

organisation_id

status_id

stopped_reason_id

2024-03-11 12:05:30

3

73

1


2024-03-08 11:15:35

3

73

3


2024-03-05 13:25:40

3

73

4

3

2024-03-13 02:07:10

9

1199

1


2024-03-11 02:07:10

9

1199

2


现在要根据指定的查询日期,列出今天(2024-03-14)到查询日期(2024-03-01)这段日期区间内,每天的账户状态。要求:合理地补充上空白日期的状态,比如,今天到最近的变更日期,要补成今天的,最近的变更日期到次近的变更日期,要补成最近的;补充账户创建日期;最后按账户和日期逆序进行排序。

date

user_id

organisation_id

status_id

stopped_reason_id

dossier_created

2024-03-14

3

73

2


2021-10-29 07:50:21

2024-03-14

9

1199

4

5

2021-05-19 17:44:07

2024-03-13

3

73

2


2021-10-29 07:50:21

2024-03-13

9

1199

1


2021-05-19 17:44:07

2024-03-12

3

73

2


2021-10-29 07:50:21

2024-03-12

9

1199

1


2021-05-19 17:44:07

2024-03-11

3

73

1


2021-10-29 07:50:21

2024-03-11

9

1199

2


2021-05-19 17:44:07

2024-03-10

3

73

1


2021-10-29 07:50:21

2024-03-10

9

1199

2


2021-05-19 17:44:07

2024-03-09

3

73

1


2021-10-29 07:50:21

2024-03-09

9

1199

2


2021-05-19 17:44:07

2024-03-08

3

73

3


2021-10-29 07:50:21

2024-03-08

9

1199

2


2021-05-19 17:44:07

2024-03-07

3

73

3


2021-10-29 07:50:21

2024-03-07

9

1199

2


2021-05-19 17:44:07

2024-03-06

3

73

3


2021-10-29 07:50:21

2024-03-06

9

1199

2


2021-05-19 17:44:07

2024-03-05

3

73

4

3

2021-10-29 07:50:21

2024-03-05

9

1199

2


2021-05-19 17:44:07

2024-03-04

3

73

4

3

2021-10-29 07:50:21

2024-03-04

9

1199

2


2021-05-19 17:44:07

2024-03-03

3

73

4

3

2021-10-29 07:50:21

2024-03-03

9

1199

2


2021-05-19 17:44:07

2024-03-02

3

73

4

3

2021-10-29 07:50:21

2024-03-02

9

1199

2


2021-05-19 17:44:07

2024-03-01

3

73

4

3

2021-10-29 07:50:21

2024-03-01

9

1199

2


2021-05-19 17:44:07

SQL:

WITH RECURSIVE dates ( date ) AS (
    SELECT DATE('2024-03-01')
    UNION ALL
    SELECT DATE(date) + INTERVAL 1 DAY
    FROM dates
    WHERE DATE(DATE) < (NOW() - INTERVAL 1 DAY)
),
current_history_data_query AS (
    SELECT 
        current_history_data.*
    FROM (
        SELECT
           DATE(timestamp) AS date,
           user_id,
           organisation_id,
           status_id,
           stopped_reason_id,
           dossier_created,
           'history-data' AS src
         FROM (
           SELECT
               oulsh.user_id,
               oulsh.organisation_id,
               oulsh.timestamp,
               oulsh.status_id,
               oulsh.stopped_reason_id,
               oul.dossier_created,
               ROW_NUMBER() OVER (PARTITION BY oulsh.user_id, oulsh.organisation_id, DATE(oulsh.timestamp) ORDER BY oulsh.timestamp DESC) AS row_num
           FROM organisation_user_link_status_history AS oulsh
           INNER JOIN organisation_user_link AS oul ON oulsh.user_id = oul.user_id AND oulsh.organisation_id = oul.organisation_id
         ) AS numbered_rows
         WHERE row_num = 1 AND DATE(timestamp) != DATE(NOW())
        
         UNION ALL
        
         SELECT CURRENT_DATE AS date, oul.user_id, oul.organisation_id, oul.status_id, oul.stopped_reason_id, oul.dossier_created, 'current-data' AS src
         FROM organisation_user_link AS oul
    ) AS current_history_data
    ORDER BY DATE DESC
)
SELECT d.date, u.user_id, u.organisation_id,
  (
    SELECT status_id
    FROM current_history_data_query
    WHERE user_id = u.user_id
    AND organisation_id = u.organisation_id
    AND date >= d.date
    ORDER BY date ASC
    LIMIT 1
  ) AS status_id,
  (
    SELECT stopped_reason_id
    FROM current_history_data_query
    WHERE user_id = u.user_id
    AND organisation_id = u.organisation_id
    AND date >= d.date
    ORDER BY date ASC
    LIMIT 1
  ) AS stopped_reason_id,
  (
    SELECT dossier_created
    FROM current_history_data_query
    WHERE user_id = u.user_id
    AND organisation_id = u.organisation_id
    AND date >= d.date
    ORDER BY date ASC
    LIMIT 1
  ) AS dossier_created
FROM dates d
JOIN (SELECT DISTINCT user_id, organisation_id FROM organisation_user_link) u
ORDER BY d.date DESC, u.user_id;

SQL 使用了递归子查询造出日期序列,结构复杂,使用多层嵌套查询和窗口函数在状态变更时打标记,再用 join 语句填充空白日期的数据,代码繁琐。 SPL 对账户分组后可以不汇总,而是对分组子集继续计算,SPL 提供了生成日期序列的函数、按日期序列生成记录的函数:https://try.esproc.com/splx?3Li


 A

1

$select date('2024-03-14') as date,user_id,organisation_id,status_id,stopped_reason_id,dossier_created from organisation_user_link.txt

2

$select timestamp as date,user_id,organisation_id,status_id,stopped_reason_id,null as dossier_created from organisation_user_link_status_history.txt order by date desc

3

=(A1|A2).group(user_id)

4

=A3.conj(~.news(periods@x(~.date,ifn(~[1].date,date("2024-02-29")),-1);date(~):date,user_id,organisation_id,status_id,stopped_reason_id,A3.dossier_created))

5

=A4.sort(-date,user_id)

A1, A2:加载数据。当前状态表追加了 data 字段,值为今天的日期。历史状态表追加了账户创建日期,值为空。

A3:合并当前状态和历史状态,按账户分组,但不汇总。

A4:处理每组数据:根据每条记录和下一条记录的日期,生成一个日期序列,如果是最后一条记录,则和查询日期的前一天生成日期序列;对每个日期序列生成一批新记录,其中创建日期来自于本组的第一条记录。函数 periods 生成日期序列,@x 表示不含后端点。函数 news 对当前组的每条记录生成一批新记录。

问题来源:https://stackoverflow.com/questions/78159534/fetch-values-from-previous-non-null-value-rows