如何读取并计算大数据文本文件
【问题】
I am new to Revolution r, so have this basic question. I am trying to open a large CSV file. 13GB. It is dataset from kaggle competition.
R is not able to open it, so I turned towards Revolution r enterprise. Can you please help as to how can I read a CSV file on my system and can convert it into xdf format and load in Revolution R enterprise to run further analysis.
My file path is “C:\Users\admin\Desktop\Kaggle\dog_1_both_marked.csv”
I tried something like this but got error.
sampleDataDir <- rxGetOption("Kaggle")
inputFile <- file.path("C:\\Users\\admin\\Desktop\\Kaggle\\dog\_1\_both\_marked.csv", "dog\_1\_both\_marked.csv")
outputFile <- file.path(tempdir(), "basicClaims.xdf")
rxTextToXdf(inFile = inputFile, outFile = outputFile, overwrite = TRUE)
rxGetInfo(data = outputFile, getVarInfo = TRUE, numRows = 100000)
file.remove(outputFile)
【回答】
R 可以分段读取大文件,也可以并行处理,但代码很繁琐而且,性能非常差。R 擅长的是数学统计类运算,对于这种结构化大文本文件的运算,R 并不是一个好工具,用 SPL 会更方便些。比如:
1、游标打开大数据文本文件
A |
|
1 |
=file("C:\Users\admin\Desktop\Kaggle\dog_1_both_marked.csv").cursor@t() |
2、查询:
A |
|
1 |
=file("C:\Users\admin\Desktop\Kaggle\dog_1_both_marked.csv").cursor@t() |
2 |
=A1.select(BIRTHDAY>=date(1981,1,1) && GENDER=="F") |
3、分组汇总:
A |
|
1 |
=file("C:\Users\admin\Desktop\Kaggle\dog_1_both_marked.csv").cursor@t() |
2 |
=A1.groups(DEPT:dept;count(~):count,sum(SALARY):salary) |
A |
|
1 |
=file("C:\Users\admin\Desktop\Kaggle\dog_1_both_marked.csv").cursor@t() |
2 |
=A1.sortx(BIRTHDAY) |
··· ···
具体内容可以参考集算器教程【文本数据】小节