spark sql Find the number of extensions for a record

 

问题

https://stackoverflow.com/questions/70470272/spark-sql-find-the-number-of-extensions-for-a-record

I have a dataset as below

col1 extension_col1
2345 2246
2246 2134
2134 2091
2091 Null
1234 1111
1111 Null

I need to find the number of extensions available for each record incol1whereby records are sorted already and contiguously in terms of sets which are terminated by anull.

the final result as below

col1 extension_col1 No_Of_Extensions
2345 2246 3
2246 2134 2
2134 2091 1
2091 Null 0
1234 1111 1
1111 Null 0

value 2345 extends as2345>2246>2134>2091>nulland hence it has 3 extension relations excluding null.

How to get the 3rd column(No_Of_Extensions) using spark sql/scala?

解答

对第一列有序的数据,每当上一条记录的第二列字段值为null时,将数据分组,组内按要求添加序号列即可。用SQL解决这个问题就非常麻烦了,需要先创建行号,再按要求创建标识列,最后利用标识列与行号才能完成按条件分组。通常的办法是读出来用PythonSPL来做, SPL(一种 Java 的开源包)更容易被Java应用集成,代码也更简单一点,只要两句:


A
1 =MYSQL.query("select * from t4")
2 =A1.group@i(#2[-1]==null).run(len=~.len(),~=~.derive(len-#:No_Of_Extensions)).conj()

SPL源代码:https://github.com/SPLWare/esProc

问答搜集