如何拆分字符串解析并结构化

【问题】

I have the following code, however there seems to be an error within it somewhere. I get output (a) but require output (b) - see below. Can anyone see where I am going wrong? All files are tab-delimited.

Code:

import sys

outfile_name = sys.argv[-1]

filename1 = sys.argv[-2]

filename2 = sys.argv[-3]

fileIn1 = open(filename1, "r")

fileIn2 = open(filename2, "r")

fileOut = open(outfile_name, "w")

dict = {}

a = open(filename1)

b = open(filename2)

for line in a:

words = line.split("\t")

if len(words) != 1:

    target = words[0]

    for word in words[1:]:

        dict[word] = target

for line in b:

words = line.split("\t")

if words[0] in dict.keys() and words[1] in dict.keys():

        fileOut.write(dict[words[0]] + "\t" + dict[words[1]] + "\n")

elif words[0] in dict.keys() and words[1] not in dict.keys():

        fileOut.write(dict[words[0]] + "\t" + words[1] + "\n")

elif words[0] not in dict.keys() and words[1] in dict.keys():

        fileOut.write(words[0] + "\t" + dict[words[1]] + "\n")

elif words[0] not in dict.keys() and words[1] not in dict.keys():

        fileOut.write(words[0] + "\t" + words[1] + "\n")

fileOut.close()

filename1:

Area_1 Area_2

A   B

A   C

A   D

D   B

D   C

L   B

L   C

L   A

D   L

K   A

K   B

K   C

K   D

K   L

D   P

D   R

L   P

L   R

K   P

K   R

A   H

D   H

L   H

K   H

B   P

B   R

R   P

A   I

D   I

I   L

I   K

C   H

I   H

C   H

J   K

J   X

J   Y

J   Z

K   X

K   Y

Y   Z

K   Z

X   Y

X   Z

M   G

N   T

O   S

S   Q

filename2:

Incident_00000001       A       D       L       K

Incident_00000002       B       P       R

Incident_00000003       C       F       W

Incident_00000004       J       I

M

N

O

Incident_00000005       Q       S

X

Y

Z

G

T

output (b) - undesired output that I am getting:

Area_1  Area_2

Incident_00000001   B

Incident_00000001   C

Incident_00000001   D

Incident_00000001   B

Incident_00000001   C

Incident_00000001   B

Incident_00000001   C

Incident_00000001   A

Incident_00000001   L

K   A

K   B

K   C

K   D

K   L

Incident_00000001   P

Incident_00000001   Incident_00000002

Incident_00000001   P

Incident_00000001   Incident_00000002

K   P

K   Incident_00000002

Incident_00000001   H

Incident_00000001   H

Incident_00000001   H

K   H

Incident_00000002   P

Incident_00000002   Incident_00000002

R   P

Incident_00000001   Incident_00000003

Incident_00000001   Incident_00000003

I   L

I   Incident_00000004

Incident_00000003   H

I   H

Incident_00000003   H

Incident_00000004   Incident_00000004

Incident_00000004   X

Incident_00000004   Y

Incident_00000004   Z

K   X

K   Y

Y   Z

K   Z

X   Y

X   Z

M   G

N   T

O   S

Incident_00000005   Incident_00000005

What I am looking to get (output (c)) is:

Area_1  Area_2

Incident_00000001   Incident_00000002

Incident_00000001   Incident_00000003

Incident_00000001   Incident_00000001

Incident_00000001   Incident_00000002

Incident_00000001   Incident_00000003

Incident_00000001   Incident_00000002

Incident_00000001   Incident_00000003

Incident_00000001   Incident_00000001

Incident_00000001   Incident_00000001

Incident_00000001   Incident_00000001

Incident_00000001   Incident_00000002

Incident_00000001   Incident_00000003

Incident_00000001   Incident_00000001

Incident_00000001   Incident_00000001

Incident_00000001   Incident_00000002

Incident_00000001   Incident_00000002

Incident_00000001   Incident_00000002

Incident_00000001   Incident_00000002

Incident_00000001   Incident_00000002

Incident_00000001   Incident_00000002

Incident_00000001   H

Incident_00000001   H

Incident_00000001   H

Incident_00000001   H

Incident_00000002   Incident_00000002

Incident_00000002   Incident_00000002

Incident_00000002   Incident_00000002

Incident_00000001   Incident_00000004

Incident_00000001   Incident_00000004

Incident_00000004   Incident_00000001

Incident_00000004   Incident_00000001

Incident_00000003   H

Incident_00000004   H

Incident_00000003   H

Incident_00000004   Incident_00000001

Incident_00000004   X

Incident_00000004   Y

Incident_00000004   Z

Incident_00000001   X

Incident_00000001   Y

Y   Z

Incident_00000001   Z

X   Y

X   Z

M   G

N   T

O   Incident_00000005

Incident_00000005   Incident_00000005

【回答】

可将file2过滤再整理为便于查询的二维表,字段incident用于显示值,字段code是集合,用于在file1中进行查询。之后就是简单的编码反显:如果file1中的编码属于file2某条记录的code集合,则输出该条记录的incident,否则输出原编码。

上述算法涉及结构化计算、集合运算、有序拆分字符串,逻辑虽然比较简单,但python要从底层实现,代码比较难写。如无特殊要求可用SPL实现,代码简单易懂: 


A

1

=file("d:/file1.txt").import@t()

2

=file("d:/file2.txt").read@n()

3

=A2.select(pos(~,"Incident"))

4

=A3.new((t=~.split("\t"))(1):incident,t.to(2,):code)

5

=A1.new(ifn(A4.select@1(code.pos(A1.Area_1)).incident,Area_1):Area_1,ifn(A4.select@1(code.pos(A1.Area_2)).incident,Area_2):Area_2)

A1:读取file1文本

A2:按行读取file2文本

A3:从A2中选出含"Incident"子串的行

A4:将A3整理为incident, code两个字段组成的序表

通过~.split("\t")A3的每一行按照分隔符拆成序列,其中incident字段值为该序列第一个成员,code字段值为剩下的成员组成的序列。

A5:根据序表A1生成Area_1Area_2组成的新序表,如果file1中的编码属于file2某条记录的code集合,新序表的字段值为该条记录的incident值,否则为原编码。