spark sql Find the number of extensions for a record

 

问题

https://stackoverflow.com/questions/70470272/spark-sql-find-the-number-of-extensions-for-a-record

I have a dataset as below

col1

extension_col1

2345

2246

2246

2134

2134

2091

2091

Null

1234

1111

1111

Null

I need to find the number of extensions available for each record incol1whereby records are sorted already and contiguously in terms of sets which are terminated by anull.

the final result as below

col1

extension_col1

No_Of_Extensions

2345

2246

3

2246

2134

2

2134

2091

1

2091

Null

0

1234

1111

1

1111

Null

0

value 2345 extends as2345>2246>2134>2091>nulland hence it has 3 extension relations excluding null.

How to get the 3rd column(No_Of_Extensions) using spark sql/scala?

解答

对第一列有序的数据,每当上一条记录的第二列字段值为null时,将数据分组,组内按要求添加序号列即可。用SQL解决这个问题就非常麻烦了,需要先创建行号,再按要求创建标识列,最后利用标识列与行号才能完成按条件分组。通常的办法是读出来用PythonSPL来做, SPL(一种 Java 的开源包)更容易被Java应用集成,代码也更简单一点,只要两句:


A

1

=MYSQL.query("select * from t4")

2

=A1.group@i(#2[-1]==null).run(len=~.len(),~=~.derive(len-#:No_Of_Extensions)).conj()

SPL源代码:https://github.com/SPLWare/esProc

问答搜集