How to Compare two large CSV file in java
问题
https://stackoverflow.com/questions/69357566/how-to-compare-two-large-csv-file-in-java
I need compare two large csv files and find differences.
First CSV file will be like:
c71f55b6c18248b8915d8a26
64b7d2d4eab74d7999a967c0
ceb792ad21054fe0a27ec410
95319566f9424c57ba2145f9
682a4fe26c154050b8f5c6f1
88e0209e2af74049ad9bf2bd
5c462b42763d41d7bb67029f
0ee74c227fc84e39a9ecc1da
66f7ab6f56374ba08d2fb92d
3ed793e35f9441b58562c9ba
baad81ac8ba54188afe63fb8
...
Each row has just one id, and total row count is approximately 5 Millions. Second CSV file will be like First one with total row count 3 Millions.
I need to remove ids of the second csv from the first csv and put them into a MongoDb. When i take all lines into memory then compare both CSVs file, I got out of memory error. I have 512Mb memory space and I will get at least 30 request in a day. Rows of CSV is changing 1Million-10Million. I can receive two request at same time and do same things simultaneously.
Is there any other way on this?
Thanks.
解答
这个问题需要从第一个csv 中删除在第二个 csv 重复出现的数据,并且 csv 文件较大,内存无法装下。Java 实现则代码较长。
用Java 下的开源包 SPL 很容易写,只要 1 句:
A |
|
1 |
=file("result.csv").export([file("csv1.csv").cursor@i().sortx(~),file("csv2.csv").cursor@i().sortx(~)].mergex@d()) |
SPL提供了JDBC 供 JAVA 调用,把上面的脚本存为 diff.splx,在 JAVA 中以存储过程的方式调用脚本文件:
…
Class.forName("com.esproc.jdbc.InternalDriver");
con= DriverManager.getConnection("jdbc:esproc:local://");
st=con.prepareCall("call diff()");
st.execute();
…
或在JAVA 中以 SQL 方式直接执行 SPL 串:
…
st = con.prepareStatement("==file(\"result.csv\").export([file(\"csv1.csv\").cursor@i().sortx(~),file(\"csv2.csv\").cursor@i().sortx(~)].mergex@d ())");
st.execute();
…
English version