大文本聚合查询

【问题】

I want to:

· read a few 100 GB from csv files > convert some columns to a specific format > save to DB. This should be run a few times a day.

· query the DB with 4 parameters and order the results by an attribute (eg PRICE ASC), then pick the TOP entry.

Postgres performs fine in general, means: 1-2 GB / min (using java). Also selects are quite fine when I put anindexon my 4 paramters that are used for the select.

Still, it will take a “long” time when importing a few hundred GBs.

Question: could it be worthwhile to try the same using anoSQLengine likeapache Cassandra? I mean:

· could a noSQL DB be faster for mass inserts of data

· can I retrieve a row from noSQL based on a few parameters as quick as in an indexed postgres db?

【回答】

这 100G 数据每天变化多次,如果每次都导入数据库会非常消耗时间,使用 nosql 也是一样,导入时间远远超过查询时间。

用集算器可以直接针对 csv 计算,它支持多线程计算和游标计算用于处理大文件,比如:



A

1

=file("D:/data.csv").cursor@tmc()

另外还有丰富的结构化 / 半结构化计算能力,可以实现类型转换和你列出的算法,比如:

A2=A1.select(file1>1 && like(field2, “*some*”)).total(top(4; PRICE))

先查询满足条件的记录,然后对游标中的记录做聚合,取出前 4 条。

当然,除了上面这种写法外,集算器里还能直接写 SQL 查询 txt 文件,如:



A

1

$select * from data.csv where file1>1 and field2 like "*some*" order by PRICE limit 4