如何读取并计算大数据文本文件

【问题】

I am new to Revolution r, so have this basic question. I am trying to open a large CSV file. 13GB. It is dataset from kaggle competition. 

R is not able to open it, so I turned towards Revolution r enterprise. Can you please help as to how can I read a CSV file on my system and can convert it into xdf format and load in Revolution R enterprise to run further analysis. 

My file path is “C:\Users\admin\Desktop\Kaggle\dog_1_both_marked.csv”

I tried something like this but got error.

sampleDataDir <- rxGetOption("Kaggle")  
inputFile <- file.path("C:\\Users\\admin\\Desktop\\Kaggle\\dog\_1\_both\_marked.csv", "dog\_1\_both\_marked.csv")  
outputFile <- file.path(tempdir(), "basicClaims.xdf")  
rxTextToXdf(inFile = inputFile, outFile = outputFile, overwrite = TRUE)  
rxGetInfo(data = outputFile, getVarInfo = TRUE, numRows = 100000)  
file.remove(outputFile)

【回答】

R 可以分段读取大文件,也可以并行处理,但代码很繁琐而且,性能非常差。R 擅长的是数学统计类运算,对于这种结构化大文本文件的运算,R 并不是一个好工具,用 SPL 会更方便些。比如:

1、游标打开大数据文本文件


A

1

=file("C:\Users\admin\Desktop\Kaggle\dog_1_both_marked.csv").cursor@t()


2、查询:


A

1

=file("C:\Users\admin\Desktop\Kaggle\dog_1_both_marked.csv").cursor@t()

2

=A1.select(BIRTHDAY>=date(1981,1,1)   && GENDER=="F")


3、分组汇总:


A

1

=file("C:\Users\admin\Desktop\Kaggle\dog_1_both_marked.csv").cursor@t()

2

=A1.groups(DEPT:dept;count(~):count,sum(SALARY):salary)


4、排序:

A

1

=file("C:\Users\admin\Desktop\Kaggle\dog_1_both_marked.csv").cursor@t()

2

=A1.sortx(BIRTHDAY)


··· ···

具体内容可以参考集算器教程【文本数据】小节