spark sql Find the number of extensions for a record
问题
https://stackoverflow.com/questions/70470272/spark-sql-find-the-number-of-extensions-for-a-record
I have a dataset as below
col1 |
extension_col1 |
2345 |
2246 |
2246 |
2134 |
2134 |
2091 |
2091 |
Null |
1234 |
1111 |
1111 |
Null |
I need to find the number of extensions available for each record incol1whereby records are sorted already and contiguously in terms of sets which are terminated by anull.
the final result as below
col1 |
extension_col1 |
No_Of_Extensions |
2345 |
2246 |
3 |
2246 |
2134 |
2 |
2134 |
2091 |
1 |
2091 |
Null |
0 |
1234 |
1111 |
1 |
1111 |
Null |
0 |
value 2345 extends as2345>2246>2134>2091>nulland hence it has 3 extension relations excluding null.
How to get the 3rd column(No_Of_Extensions) using spark sql/scala?
解答
对第一列有序的数据,每当上一条记录的第二列字段值为null时,将数据分组,组内按要求添加序号列即可。用SQL解决这个问题就非常麻烦了,需要先创建行号,再按要求创建标识列,最后利用标识列与行号才能完成按条件分组。通常的办法是读出来用Python或SPL来做, SPL(一种 Java 的开源包)更容易被Java应用集成,代码也更简单一点,只要两句:
A |
|
1 |
=MYSQL.query("select * from t4") |
2 |
=A1.group@i(#2[-1]==null).run(len=~.len(),~=~.derive(len-#:No_Of_Extensions)).conj() |
English version