How to Compare two large CSV file in java

 

问题

https://stackoverflow.com/questions/69357566/how-to-compare-two-large-csv-file-in-java

I need compare two large csv files and find differences.

First CSV file will be like:

c71f55b6c18248b8915d8a26

64b7d2d4eab74d7999a967c0

ceb792ad21054fe0a27ec410

95319566f9424c57ba2145f9

682a4fe26c154050b8f5c6f1

88e0209e2af74049ad9bf2bd

5c462b42763d41d7bb67029f

0ee74c227fc84e39a9ecc1da

66f7ab6f56374ba08d2fb92d

3ed793e35f9441b58562c9ba

baad81ac8ba54188afe63fb8

...

Each row has just one id, and total row count is approximately 5 Millions. Second CSV file will be like First one with total row count 3 Millions.

I need to remove ids of the second csv from the first csv and put them into a MongoDb. When i take all lines into memory then compare both CSVs file, I got out of memory error. I have 512Mb memory space and I will get at least 30 request in a day. Rows of CSV is changing 1Million-10Million. I can receive two request at same time and do same things simultaneously.

Is there any other way on this?

Thanks.

解答

这个问题需要从第一个csv 中删除在第二个 csv 重复出现的数据,并且 csv 文件较大,内存无法装下。Java 实现则代码较长。

Java 下的开源包 SPL 很容易写,只要 1 句:


A

1

=file("result.csv").export([file("csv1.csv").cursor@i().sortx(~),file("csv2.csv").cursor@i().sortx(~)].mergex@d())

SPL提供了JDBC 供 JAVA 调用,把上面的脚本存为 diff.splx,在 JAVA 中以存储过程的方式调用脚本文件:

Class.forName("com.esproc.jdbc.InternalDriver");

con= DriverManager.getConnection("jdbc:esproc:local://");

st=con.prepareCall("call diff()");

st.execute();

或在JAVA 中以 SQL 方式直接执行 SPL 串:

st = con.prepareStatement("==file(\"result.csv\").export([file(\"csv1.csv\").cursor@i().sortx(~),file(\"csv2.csv\").cursor@i().sortx(~)].mergex@d ())");
st.execute();

SPL 源代码:https://github.com/SPLWare/esProc

问答搜集