从 SQL 到 SPL:Find consecutive alphabetical characters in string
某库表的字段是字符串。
VALUE |
Test |
Test1 |
Tesssst |
TTTTest |
ABCDTest |
Testuvwxyz |
现在要找出至少含有 5 个连续按字母表排序的字母的字符串,或者说,连续升序排列的子串中,最长的子串的长度大于等于 5 的字符串。注意,排除掉非字母的字符。
VALUE |
ABCDTest |
Testuvwxyz |
SQL:
SELECT value
FROM table_name t
CROSS JOIN LATERAL (
SELECT MIN(lvl) AS start_pos
FROM (
SELECT LEVEL AS lvl,
SUBSTR(t.value, LEVEL, 1) AS ch
FROM DUAL
CONNECT BY LEVEL <= LENGTH(t.value)
)
MATCH_RECOGNIZE(
ORDER BY lvl
MEASURES
FIRST(lvl) AS lvl
PATTERN (first_row consecutive{3,})
DEFINE first_row AS (ch BETWEEN 'A' AND 'Z' OR ch BETWEEN 'a' AND 'z'),
consecutive AS (ch BETWEEN 'A' AND 'Z' OR ch BETWEEN 'a' AND 'z')
AND ASCII(PREV(ch)) + 1 = ASCII(ch)
)
HAVING MIN(lvl) > 0
)
SQL要用嵌套子查询 +CONNECT BY 模拟序列,再用 MATCH_RECOGNIZE 处理序列,代码繁琐难懂。SPL 直接提供了序列计算函数:https://try.esproc.com/splx?4yy
A |
|
1 |
$select * from table_name.txt |
2 |
=A1. select(VALUE.split().group@i(~<=~[-1] || !isalpha(~[-1])).max(~.len())>=5) |
A1:加载数据。
A2:将字符串按字符拆成序列,对序列分组,当当前成员小于等于上一个成员或者上个成员非字母时,开始新的分组,选出最长的组的长度大于等于 5 的字符串。函数 group 在分组后可以保持分组子集用于后续处理,而不是立即计算,@i 表示按条件分组。
问题来源:https://stackoverflow.com/questions/78151968/find-consecutive-alphabetical-characters-in-string
英文版 https://c.esproc.com/article/1741759442693