按标记分组再结构化
【问题】
using the below awk and Perl solutions: I need to understand how to add additional field that don't repeat their output much like"name"in these examples. ie devtwr1 only occurs once in the sample data, If i need to add additional fields that also only occur once, how is this done? Sorry Tried extensively but can't work it out...
here is the source:
have a file: Ntab.txt which is a sample file of two hosts, there is numerous hosts in the real file one after each other.
Each host has multiple "displayName"s (addresses) with corresponding figures for each 'displayName'.
>cat Ntab.txt
name devtwr1
displayName 00:67:BB
capacityInKB 104,857,600
consumedCapacityInKB 4,042,752
dpPoolID 20
displayName 00:7B:FD
capacityInKB 52,428,800
consumedCapacityInKB 14,880,768
dpPoolID 10
displayName 00:7C:28
capacityInKB 34,179,712
consumedCapacityInKB 29,804,544
dpPoolID 20
displayName 00:7C:29
capacityInKB 34,179,712
consumedCapacityInKB 5,462,016
dpPoolID 20
name devtwr2
displayName 00:67:BB
capacityInKB 104,857,600
consumedCapacityInKB 4,042,752
dpPoolID 20
displayName 00:7B:FD
capacityInKB 52,428,800
consumedCapacityInKB 14,880,768
dpPoolID 10
displayName 00:7C:28
capacityInKB 34,179,712
consumedCapacityInKB 29,804,544
dpPoolID 20
displayName 00:7C:29
capacityInKB 34,179,712
consumedCapacityInKB 5,462,016
dpPoolID 20
I need to be able to produce the $2 data from the rows into columns after each 'name' (host) and in csv type format, headings optional. I cant use (,:) as separators as the data includes them so either (tab or ;).
Like:
name;displayName;capacityInKB;consumedCapacityInKB;dpPoolID
devtwr1;00:67:BB;104,857,600;4,042,752;20
devtwr1;00:7B:FD;52,428,800;14,880,768;10
devtwr1;00:7C:28;34,179,712;29,804,544;20
devtwr1;00:7C:29;34,179,712;5,462,016;20
devtwr2;00:67:BB;104,857,600;4,042,752;20
devtwr1;00:7B:FD;52,428,800;14,880,768;10
devtwr1;00:7C:28;34,179,712;29,804,544;20
devtwr1;00:7C:29;34,179,712;5,462,016;20
Here is a Awk solution but cant work out how to add additional field like "name" that dosn't have data in all fields.
$ awk '$1=="name"{name=$2}
$1 ~/^(displayName|capacityInKB|consumedCapacityInKB)$/{out=out";"$2}
$1=="dpPoolID"{print name out";"$2; out=""}' tmp2.txt
devtwr1;00:67:BB;104,857,600;4,042,752;20
devtwr1;00:7B:FD;52,428,800;14,880,768;10
devtwr1;00:7C:28;34,179,712;29,804,544;20
devtwr1;00:7C:29;34,179,712;5,462,016;20
devtwr2;00:67:BB;104,857,600;4,042,752;20
devtwr2;00:7B:FD;52,428,800;14,880,768;10
devtwr2;00:7C:28;34,179,712;29,804,544;20
devtwr2;00:7C:29;34,179,712;5,462,016;20
And here is a Perl solution but once again couldn't work out how to add additional fields like the"name"field that dosn't always have data.
perl -lane '
BEGIN {
@fields = qw(name displayName capacityInKB consumedCapacityInKB dpPoolID);
print join ";", @fields;
}
if (/^(name|displayName)/ && $data{displayName}) {
print join ";", @data{@fields};
%data = ( name => $data{name} );
}
$data{$F[0]} = $F[1];
END {
print join ";",@data{@fields};
}' tmp2.txt > Report.csvv
【回答】
按照第1列是否等于"name"将数据分为2N组(N是name的数量),然后从奇数组的单条记录取出name,从偶数组的M条记录中拼出M/4条记录。上述算法比较直观,但涉及有序计算,用perl实现很麻烦,建议用SPL来实现,简单很多:
A |
B |
C |
|
1 |
=file("/Ntab.txt").cursor() |
>file("/Result.txt").write("name;displayName;capacityInKB;consumedCapacityInKB;dpPoolID" |
|
2 |
for A1;#1=="name" |
if A2.#1=="name" |
=A2.#2 |
3 |
=A2.(#2).to(2,).group((#-1)\4).new(C2,~(1),~(2),~(3),~(4)) |
||
4 |
>file("/Result.txt").export@a(B3;";") |
A1:以游标方式读入文件,默认分隔符是tab。文件大小不受限制。
B1:在第一行写表头。
A2:循环A1,按照第1列是否等于"name"分组读入数据,奇数次循环读入的是
"name devtwr1"和"name devtwr2",偶数次循环读入的是剩余数据。
B2-C2:如果循环变量的第1个字段等于name,则将第2个字段临时存入C2(比如devtwr1)待用,然后执行B3-B4。
B3:取出A2的第2列(此时为多值),按序号分组,并将每组拼成name相同的记录。A2.(#2)表示A2的第2列,group中的#表示序号
B4:将B3追加写入文件。