有规则不定行文本结构化
【问题】
I have a CSV file with a non standardized content, it goes something like this:
John, 001
01/01/2015, hamburger
02/01/2015, pizza
03/01/2015, ice cream
Mary, 002
01/01/2015, hamburger
02/01/2015, pizza
John, 003
04/01/2015, chocolate
Now, what I'm trying to do is to write a logic in java to separate them.I would like"John, 001"as the header and to put all the rows under John, before Mary to be John's.
Will this be possible? Or should I just do it manually?
Edit:
For the input, even though it is not standardized, a noticeable pattern is that the row that do not have names will always starts with a date.
My output goal would be a java object, where I can store it in the database eventually in the format below.
Name, hamburger, pizza, ice cream, chocolate
John, 01/01/2015, 02/01/2015, 03/01/2015, NA
Mary, 01/01/2015, 02/01/2015, NA, NA
John, NA, NA, NA, 04/01/2015
【回答】
本问题需要大量的结构化计算才能实现,JAVA缺乏相关的类库,实现过程复杂,代码可读性差。这种情况下可以用SPL辅助实现,代码更直观易懂:
A |
B |
|
1 |
=file("D:\\noneStand.csv").cursor@c() |
=["hamburger","pizza","ice cream","chocolate"] |
2 |
=create(name,${foodlist}) |
|
3 |
for A1;!isdigit(left(#1,1)) |
=A3.to(2,).align(B1,#2) |
4 |
=A2.record(A3.#1 | B3.(#1)) |
A1:以游标方式读入文件noneStand.csv,分隔符是逗号。
A2:创建存放结果的二维表。${foodlist}会将参数动态解析为表达式。foodlist为参数,参数值为hamburger,pizza,'ice cream',chocolate
A3:循环A1,每次将完整的一组数据存入A3。当某行第1个字段的首字符是字母时,这行之前的数据分为一组。B3,B4是循环的作用范围。
B3:将A3(循环变量)的第2条以后的数据按foodlist对齐。比如Mary组对齐的结果是:
01/01/2015, hamburger
02/01/2015, pizza
NA,NA
NA,NA
B4:向A2追加记录。A3.#1返回A3的第1条记录的第1个字段(比如:Mary)。B3.(#1)表示B3的第1个字段形成的集合,即[01/01/2015, 02/01/2015,NA,NA]。"|"表示合并。