大文本聚合查询
【问题】
I want to:
· read a few 100 GB from csv files > convert some columns to a specific format > save to DB. This should be run a few times a day.
· query the DB with 4 parameters and order the results by an attribute (eg PRICE ASC), then pick the TOP entry.
Postgres performs fine in general, means: 1-2 GB / min (using java). Also selects are quite fine when I put anindexon my 4 paramters that are used for the select.
Still, it will take a “long” time when importing a few hundred GBs.
Question: could it be worthwhile to try the same using anoSQLengine likeapache Cassandra? I mean:
· could a noSQL DB be faster for mass inserts of data
· can I retrieve a row from noSQL based on a few parameters as quick as in an indexed postgres db?
【回答】
这 100G 数据每天变化多次,如果每次都导入数据库会非常消耗时间,使用 nosql 也是一样,导入时间远远超过查询时间。
用集算器可以直接针对 csv 计算,它支持多线程计算和游标计算用于处理大文件,比如:
A |
|
1 |
=file("D:/data.csv").cursor@tmc() |
另外还有丰富的结构化 / 半结构化计算能力,可以实现类型转换和你列出的算法,比如:
A2=A1.select(file1>1 && like(field2, “*some*”)).total(top(4; PRICE))
先查询满足条件的记录,然后对游标中的记录做聚合,取出前 4 条。
当然,除了上面这种写法外,集算器里还能直接写 SQL 查询 txt 文件,如:
A |
|
1 |
$select * from data.csv where file1>1 and field2 like "*some*" order by PRICE limit 4 |