Java Stream - Retrieving repeated records from CSV

 

问题

https://stackoverflow.com/questions/68651921/java-stream-retrieving-repeated-records-from-csv

I searched the site and didn't find something similar. I'm newbie to using the Java stream, but I understand that it's a replacement for a loop command. However, I would like to know if there is a way to filter a CSV file using stream, as shown below, where only the repeated records are included in the result and grouped by the Center field.

Initial CSV file

Id,Name,Mother,Birth,Center

1,A,A,2000-01-01,1

2,C,A,2000-01-02,1

3,P,M,2000-01-03,2

4,D,S,2000-01-04,3

5,R,H,2000-01-05,4

6,P,M,2000-01-03,2

7,A,A,2000-01-01,1

8,P,C,2000-01-08,2

9,R,I,2000-01-07,3

10,P,M,2000-01-03,2

Final result

Id,Name,Mother,Birth,Center

1,A,A,2000-01-01,1

7,A,A,2000-01-01,1

3,P,M,2000-01-03,2

6,P,M,2000-01-03,2

10,P,M,2000-01-03,2

In addition, the same pair cannot appear in the final result inversely, as shown in the table below:

This shouldn't happen

Id,Name,Mother,Birth,Center

1,A,A,2000-01-01,1

7,A,A,2000-01-01,1

7,A,A,2000-01-01,1

1,A,A,2000-01-01,1

Is there a way to do it using stream and grouping at the same time, since theoretically, two loops would be needed to perform the task?

Thanks in advance.

解答

这个问题需要将csv 中的数据按非 id 字段去重并按 Center 字段分组。Java 实现则代码较长。

Java 下的开源包 SPL 很容易写,只要 1 句:


A

1

=file("repeated.csv").import@ct().group(Name,Mother,Birth,Center).select(~.len()>1).conj()

SPL提供了JDBC 供 JAVA 调用,把上面的脚本存为 repeated.splx,在 JAVA 中以存储过程的方式调用脚本文件:

Class.forName("com.esproc.jdbc.InternalDriver");

con= DriverManager.getConnection("jdbc:esproc:local://");

st=con.prepareCall("call repeated()");

st.execute();

或在JAVA 中以 SQL 方式直接执行 SPL 串:

st = con.prepareStatement("==file(\"repeated.csv\").import@ct().group(Name,Mother,Birth,Center).select(~.len()>1).conj()");
st.execute();

SPL 源代码:https://github.com/SPLWare/esProc

问答搜集