从 SQL 到 SPL:Fetch values from previous non-null value rows
数据库表 organisation_user_link 存储着账户的当前状态,其中 dossier_created 是账户创建时间。
id |
user_id |
organisation_id |
status_id |
stopped_reason_id |
dossier_created |
1 |
3 |
73 |
2 |
2021-10-29 07:50:21 |
|
2 |
9 |
1199 |
4 |
5 |
2021-05-19 17:44:07 |
库表 organisation_user_link_status_history 存储着账户状态的变更历史。
timestamp |
user_id |
organisation_id |
status_id |
stopped_reason_id |
2024-03-11 12:05:30 |
3 |
73 |
1 |
|
2024-03-08 11:15:35 |
3 |
73 |
3 |
|
2024-03-05 13:25:40 |
3 |
73 |
4 |
3 |
2024-03-13 02:07:10 |
9 |
1199 |
1 |
|
2024-03-11 02:07:10 |
9 |
1199 |
2 |
现在要根据指定的查询日期,列出今天(2024-03-14)到查询日期(2024-03-01)这段日期区间内,每天的账户状态。要求:合理地补充上空白日期的状态,比如,今天到最近的变更日期,要补成今天的,最近的变更日期到次近的变更日期,要补成最近的;补充账户创建日期;最后按账户和日期逆序进行排序。
date |
user_id |
organisation_id |
status_id |
stopped_reason_id |
dossier_created |
2024-03-14 |
3 |
73 |
2 |
2021-10-29 07:50:21 |
|
2024-03-14 |
9 |
1199 |
4 |
5 |
2021-05-19 17:44:07 |
2024-03-13 |
3 |
73 |
2 |
2021-10-29 07:50:21 |
|
2024-03-13 |
9 |
1199 |
1 |
2021-05-19 17:44:07 |
|
2024-03-12 |
3 |
73 |
2 |
2021-10-29 07:50:21 |
|
2024-03-12 |
9 |
1199 |
1 |
2021-05-19 17:44:07 |
|
2024-03-11 |
3 |
73 |
1 |
2021-10-29 07:50:21 |
|
2024-03-11 |
9 |
1199 |
2 |
2021-05-19 17:44:07 |
|
2024-03-10 |
3 |
73 |
1 |
2021-10-29 07:50:21 |
|
2024-03-10 |
9 |
1199 |
2 |
2021-05-19 17:44:07 |
|
2024-03-09 |
3 |
73 |
1 |
2021-10-29 07:50:21 |
|
2024-03-09 |
9 |
1199 |
2 |
2021-05-19 17:44:07 |
|
2024-03-08 |
3 |
73 |
3 |
2021-10-29 07:50:21 |
|
2024-03-08 |
9 |
1199 |
2 |
2021-05-19 17:44:07 |
|
2024-03-07 |
3 |
73 |
3 |
2021-10-29 07:50:21 |
|
2024-03-07 |
9 |
1199 |
2 |
2021-05-19 17:44:07 |
|
2024-03-06 |
3 |
73 |
3 |
2021-10-29 07:50:21 |
|
2024-03-06 |
9 |
1199 |
2 |
2021-05-19 17:44:07 |
|
2024-03-05 |
3 |
73 |
4 |
3 |
2021-10-29 07:50:21 |
2024-03-05 |
9 |
1199 |
2 |
2021-05-19 17:44:07 |
|
2024-03-04 |
3 |
73 |
4 |
3 |
2021-10-29 07:50:21 |
2024-03-04 |
9 |
1199 |
2 |
2021-05-19 17:44:07 |
|
2024-03-03 |
3 |
73 |
4 |
3 |
2021-10-29 07:50:21 |
2024-03-03 |
9 |
1199 |
2 |
2021-05-19 17:44:07 |
|
2024-03-02 |
3 |
73 |
4 |
3 |
2021-10-29 07:50:21 |
2024-03-02 |
9 |
1199 |
2 |
2021-05-19 17:44:07 |
|
2024-03-01 |
3 |
73 |
4 |
3 |
2021-10-29 07:50:21 |
2024-03-01 |
9 |
1199 |
2 |
2021-05-19 17:44:07 |
SQL:
WITH RECURSIVE dates ( date ) AS (
SELECT DATE('2024-03-01')
UNION ALL
SELECT DATE(date) + INTERVAL 1 DAY
FROM dates
WHERE DATE(DATE) < (NOW() - INTERVAL 1 DAY)
),
current_history_data_query AS (
SELECT
current_history_data.*
FROM (
SELECT
DATE(timestamp) AS date,
user_id,
organisation_id,
status_id,
stopped_reason_id,
dossier_created,
'history-data' AS src
FROM (
SELECT
oulsh.user_id,
oulsh.organisation_id,
oulsh.timestamp,
oulsh.status_id,
oulsh.stopped_reason_id,
oul.dossier_created,
ROW_NUMBER() OVER (PARTITION BY oulsh.user_id, oulsh.organisation_id, DATE(oulsh.timestamp) ORDER BY oulsh.timestamp DESC) AS row_num
FROM organisation_user_link_status_history AS oulsh
INNER JOIN organisation_user_link AS oul ON oulsh.user_id = oul.user_id AND oulsh.organisation_id = oul.organisation_id
) AS numbered_rows
WHERE row_num = 1 AND DATE(timestamp) != DATE(NOW())
UNION ALL
SELECT CURRENT_DATE AS date, oul.user_id, oul.organisation_id, oul.status_id, oul.stopped_reason_id, oul.dossier_created, 'current-data' AS src
FROM organisation_user_link AS oul
) AS current_history_data
ORDER BY DATE DESC
)
SELECT d.date, u.user_id, u.organisation_id,
(
SELECT status_id
FROM current_history_data_query
WHERE user_id = u.user_id
AND organisation_id = u.organisation_id
AND date >= d.date
ORDER BY date ASC
LIMIT 1
) AS status_id,
(
SELECT stopped_reason_id
FROM current_history_data_query
WHERE user_id = u.user_id
AND organisation_id = u.organisation_id
AND date >= d.date
ORDER BY date ASC
LIMIT 1
) AS stopped_reason_id,
(
SELECT dossier_created
FROM current_history_data_query
WHERE user_id = u.user_id
AND organisation_id = u.organisation_id
AND date >= d.date
ORDER BY date ASC
LIMIT 1
) AS dossier_created
FROM dates d
JOIN (SELECT DISTINCT user_id, organisation_id FROM organisation_user_link) u
ORDER BY d.date DESC, u.user_id;
SQL 使用了递归子查询造出日期序列,结构复杂,使用多层嵌套查询和窗口函数在状态变更时打标记,再用 join 语句填充空白日期的数据,代码繁琐。 SPL 对账户分组后可以不汇总,而是对分组子集继续计算,SPL 提供了生成日期序列的函数、按日期序列生成记录的函数:https://try.esproc.com/splx?3Li
A |
|
1 |
$select date('2024-03-14') as date,user_id,organisation_id,status_id,stopped_reason_id,dossier_created from organisation_user_link.txt |
2 |
$select timestamp as date,user_id,organisation_id,status_id,stopped_reason_id,null as dossier_created from organisation_user_link_status_history.txt order by date desc |
3 |
=(A1|A2).group(user_id) |
4 |
=A3.conj(~.news(periods@x(~.date,ifn(~[1].date,date("2024-02-29")),-1);date(~):date,user_id,organisation_id,status_id,stopped_reason_id,A3.dossier_created)) |
5 |
=A4.sort(-date,user_id) |
A1, A2:加载数据。当前状态表追加了 data 字段,值为今天的日期。历史状态表追加了账户创建日期,值为空。
A3:合并当前状态和历史状态,按账户分组,但不汇总。
A4:处理每组数据:根据每条记录和下一条记录的日期,生成一个日期序列,如果是最后一条记录,则和查询日期的前一天生成日期序列;对每个日期序列生成一批新记录,其中创建日期来自于本组的第一条记录。函数 periods 生成日期序列,@x 表示不含后端点。函数 news 对当前组的每条记录生成一批新记录。
问题来源:https://stackoverflow.com/questions/78159534/fetch-values-from-previous-non-null-value-rows
英文版 https://c.esproc.com/article/1741080653789