如何拆分字符串解析并结构化
【问题】
I have the following code, however there seems to be an error within it somewhere. I get output (a) but require output (b) - see below. Can anyone see where I am going wrong? All files are tab-delimited.
Code:
import sys
outfile_name = sys.argv[-1]
filename1 = sys.argv[-2]
filename2 = sys.argv[-3]
fileIn1 = open(filename1, "r")
fileIn2 = open(filename2, "r")
fileOut = open(outfile_name, "w")
dict = {}
a = open(filename1)
b = open(filename2)
for line in a:
words = line.split("\t")
if len(words) != 1:
target = words[0]
for word in words[1:]:
dict[word] = target
for line in b:
words = line.split("\t")
if words[0] in dict.keys() and words[1] in dict.keys():
fileOut.write(dict[words[0]] + "\t" + dict[words[1]] + "\n")
elif words[0] in dict.keys() and words[1] not in dict.keys():
fileOut.write(dict[words[0]] + "\t" + words[1] + "\n")
elif words[0] not in dict.keys() and words[1] in dict.keys():
fileOut.write(words[0] + "\t" + dict[words[1]] + "\n")
elif words[0] not in dict.keys() and words[1] not in dict.keys():
fileOut.write(words[0] + "\t" + words[1] + "\n")
fileOut.close()
filename1:
Area_1 Area_2
A B
A C
A D
D B
D C
L B
L C
L A
D L
K A
K B
K C
K D
K L
D P
D R
L P
L R
K P
K R
A H
D H
L H
K H
B P
B R
R P
A I
D I
I L
I K
C H
I H
C H
J K
J X
J Y
J Z
K X
K Y
Y Z
K Z
X Y
X Z
M G
N T
O S
S Q
filename2:
Incident_00000001 A D L K
Incident_00000002 B P R
Incident_00000003 C F W
Incident_00000004 J I
M
N
O
Incident_00000005 Q S
X
Y
Z
G
T
output (b) - undesired output that I am getting:
Area_1 Area_2
Incident_00000001 B
Incident_00000001 C
Incident_00000001 D
Incident_00000001 B
Incident_00000001 C
Incident_00000001 B
Incident_00000001 C
Incident_00000001 A
Incident_00000001 L
K A
K B
K C
K D
K L
Incident_00000001 P
Incident_00000001 Incident_00000002
Incident_00000001 P
Incident_00000001 Incident_00000002
K P
K Incident_00000002
Incident_00000001 H
Incident_00000001 H
Incident_00000001 H
K H
Incident_00000002 P
Incident_00000002 Incident_00000002
R P
Incident_00000001 Incident_00000003
Incident_00000001 Incident_00000003
I L
I Incident_00000004
Incident_00000003 H
I H
Incident_00000003 H
Incident_00000004 Incident_00000004
Incident_00000004 X
Incident_00000004 Y
Incident_00000004 Z
K X
K Y
Y Z
K Z
X Y
X Z
M G
N T
O S
Incident_00000005 Incident_00000005
What I am looking to get (output (c)) is:
Area_1 Area_2
Incident_00000001 Incident_00000002
Incident_00000001 Incident_00000003
Incident_00000001 Incident_00000001
Incident_00000001 Incident_00000002
Incident_00000001 Incident_00000003
Incident_00000001 Incident_00000002
Incident_00000001 Incident_00000003
Incident_00000001 Incident_00000001
Incident_00000001 Incident_00000001
Incident_00000001 Incident_00000001
Incident_00000001 Incident_00000002
Incident_00000001 Incident_00000003
Incident_00000001 Incident_00000001
Incident_00000001 Incident_00000001
Incident_00000001 Incident_00000002
Incident_00000001 Incident_00000002
Incident_00000001 Incident_00000002
Incident_00000001 Incident_00000002
Incident_00000001 Incident_00000002
Incident_00000001 Incident_00000002
Incident_00000001 H
Incident_00000001 H
Incident_00000001 H
Incident_00000001 H
Incident_00000002 Incident_00000002
Incident_00000002 Incident_00000002
Incident_00000002 Incident_00000002
Incident_00000001 Incident_00000004
Incident_00000001 Incident_00000004
Incident_00000004 Incident_00000001
Incident_00000004 Incident_00000001
Incident_00000003 H
Incident_00000004 H
Incident_00000003 H
Incident_00000004 Incident_00000001
Incident_00000004 X
Incident_00000004 Y
Incident_00000004 Z
Incident_00000001 X
Incident_00000001 Y
Y Z
Incident_00000001 Z
X Y
X Z
M G
N T
O Incident_00000005
Incident_00000005 Incident_00000005
【回答】
可将file2过滤再整理为便于查询的二维表,字段incident用于显示值,字段code是集合,用于在file1中进行查询。之后就是简单的编码反显:如果file1中的编码属于file2某条记录的code集合,则输出该条记录的incident,否则输出原编码。
上述算法涉及结构化计算、集合运算、有序拆分字符串,逻辑虽然比较简单,但python要从底层实现,代码比较难写。如无特殊要求可用SPL实现,代码简单易懂:
A |
|
1 |
=file("d:/file1.txt").import@t() |
2 |
=file("d:/file2.txt").read@n() |
3 |
=A2.select(pos(~,"Incident")) |
4 |
=A3.new((t=~.split("\t"))(1):incident,t.to(2,):code) |
5 |
=A1.new(ifn(A4.select@1(code.pos(A1.Area_1)).incident,Area_1):Area_1,ifn(A4.select@1(code.pos(A1.Area_2)).incident,Area_2):Area_2) |
A1:读取file1文本
A2:按行读取file2文本
A3:从A2中选出含"Incident"子串的行
A4:将A3整理为incident, code两个字段组成的序表
通过~.split("\t")把A3的每一行按照分隔符拆成序列,其中incident字段值为该序列第一个成员,code字段值为剩下的成员组成的序列。
A5:根据序表A1生成Area_1和Area_2组成的新序表,如果file1中的编码属于file2某条记录的code集合,新序表的字段值为该条记录的incident值,否则为原编码。