二分法在有序文件中查找

【问题】

I am working with a relatively large text file (70GB uncompressed, 15GB gzipped) which contains 3 columns. The lines of the file are of the form:

x1 | y1 | a1

x1 | y2 | a2

x2 | y3 | a3

x3 | y4 | a4

The x and y are sequences of words that can contain between 1 and 4 words. The strings in the first column are sorted and not unique. The strings in the second column are not unique, and sorted for a same element in the first column.

There are about 700,000,000 lines in the uncompressed text file, and what I want to do is, for a query of tuples (x, y), get the value a in the third column. I need to access it in a time as short as possible.

What I tried to do to create 2 Dictionaries of (strings, List of integers), so that the first dictionary maps a string to the index of the lines that contain this string in the first column, and the same for the second dictionary and the second column. Then for a query (x, y) I can intersect these two lists and get the line that contains “x | y | a”. I can then use a dictionary that maps a line number to its offset in the file and use random access file to read the line.

The problem is that this requires way too much memory (maybe it’s also because I’m using Java!). I am looking for a solution that can query the text file very quickly, but doesn’t require more than 20 / 30 GB of ram.

I guess there are methods to do this kind of things but I’m not familiar with them. Any ideas?

【回答】

可以使用二分法快速定位已排序大文件中的记录,但 Java 直接编写比较麻烦,要处理半条记录的情况,比如文件有 70G,在 35G 的位置上可能是记录的一部,这时要进行去头补尾的操作才能正确定位记录。可以用 SPL 辅助实现,代码更直观易懂:


A

B

1

=file("D:\\filebig.txt")

=[argX,argY]

2

=A1.iselect@t([B1],[#1,#2];;"|").fetch@x()



其中,argX 和 argY 是输入参数,代表前两个字段。#1,#2 表示记录的第 1 个和第 2 个字段。

然后通过 iselect 函数使用二分法定位文件中的记录。

关于文件游标,参考【集算器游标处理大文本文件

之后可用 JDBC 读取集算器的计算结果,详见【集算器集成与应用之被 JAVA 调用】。