"【问题】 I want to: · read a few 100 GB from csv files > convert some columns to a specific forma .."

sln 北京
乾学院 23 号会员
827 浏览 • 6 年前

大文本聚合查询

应用计算

csv(39) 大数据(7)

【问题】

I want to:

· read a few 100 GB from csv files > convert some columns to a specific format > save to DB. This should be run a few times a day.

· query the DB with 4 parameters and order the results by an attribute (eg PRICE ASC), then pick the TOP entry.

Postgres performs fine in general, means: 1-2 GB / min (using java). Also selects are quite fine when I put anindexon my 4 paramters that are used for the select.

Still, it will take a “long” time when importing a few hundred GBs.

Question: could it be worthwhile to try the same using anoSQLengine likeapache Cassandra? I mean:

· could a noSQL DB be faster for mass inserts of data

· can I retrieve a row from noSQL based on a few parameters as quick as in an indexed postgres db?

【回答】

这 100G 数据每天变化多次，如果每次都导入数据库会非常消耗时间，使用 nosql 也是一样，导入时间远远超过查询时间。

用集算器可以直接针对 csv 计算，它支持多线程计算和游标计算用于处理大文件，比如：

	A
1	=file("D:/data.csv").cursor@tmc()

另外还有丰富的结构化 / 半结构化计算能力，可以实现类型转换和你列出的算法，比如：

A2=A1.select(file1>1 && like(field2, “*some*”)).total(top(4; PRICE))

先查询满足条件的记录，然后对游标中的记录做聚合，取出前 4 条。

当然，除了上面这种写法外，集算器里还能直接写 SQL 查询 txt 文件，如：

	A
1	$select * from data.csv where file1>1 and field2 like "some" order by PRICE limit 4

csv(39) 大数据(7)