读入后分组,要快
【问题】
I am writing a script in perl. but got stuck in one part. Below is the sample of my csv files.
"MP","918120197922","20150806125001","prepaid","prepaid","3G","2G"
"GJ","919904303790","20150806125002","prepaid","prepaid","2G","3G"
"MH","919921990805","20150806125003","prepaid","prepaid","2G",
"MP","918120197922","20150806125001","prepaid","prepaid","3G","2G"
"GJ","919904303790","20150806125002","prepaid","prepaid","2G","3G"
"MH","919921990805","20150806125003","prepaid","prepaid","2G",
"MP","918120197922","20150806125004","prepaid","prepaid","2G",
"MUM","919904303790","20150806125005","prepaid","prepaid","2G","3G"
"MUM","918652624178","20150806125005","prepaid","prepaid","2G","3G"
"MP","918120197922","20150806125005","prepaid","prepaid","2G","3G"
Now I need to take unique records on the basis of 2nd column (i.e. mobile numbers) but considering only the latest value of 3rd column (ie timestamp) eg: for mobile number "918120197922".
"MP","918120197922","20150806125001","prepaid","prepaid","3G","2G"
"MP","918120197922","20150806125004","prepaid","prepaid","2G"
"MP","918120197922","20150806125005","prepaid","prepaid","2G","3G"
it should select the 3rd record as it has the latest value of timestamp (20150806125005). Please help.
Additional Info: Sorry for inconsistency in data..I have rectified it now. Yes data is in order which means latest timestamp will appear in the latest rows. One more thing that my file has the size of more than 1 gb so is there any way to efficiently do this? Will awk work faster than perl in this case. Please help?
【回答】
算法不难,就是求各组最大值,不过文本的解析一向很慢,应当尽量用多线程,另外分组时要用hash方法,简单的遍历比较很慢。Perl写这些代码有些繁琐,建议用SPL写,脚本会简单一些:
A |
|
1 |
=file("d:\\source.csv").cursor@qmc() |
2 |
=A1.groups(#2;top(-1;#3):a) |
3 |
=A2.(a).conj() |
4 |
=file("d:\\result.csv").export@c(A3) |
A1:读取文件source.csv中的内容,剥离引号,返回成多路游标。
A2:多路游标,每路游标先按照第2列分组,再选出每组中第3列最大的值对应的记录。
A3:合并。
A4:将A3结果导入到文件result.csv。