"【问题】 I have the following code, however there seems to be an error within it somewhere. I get ou .."

yangcl
乾学院 22 号会员
1,367 浏览 • 6 年前

如何拆分字符串解析并结构化

桌面处理

【问题】

I have the following code, however there seems to be an error within it somewhere. I get output (a) but require output (b) - see below. Can anyone see where I am going wrong? All files are tab-delimited.

Code:

import sys

outfile_name = sys.argv[-1]

filename1 = sys.argv[-2]

filename2 = sys.argv[-3]

fileIn1 = open(filename1, "r")

fileIn2 = open(filename2, "r")

fileOut = open(outfile_name, "w")

dict = {}

a = open(filename1)

b = open(filename2)

for line in a:

words = line.split("\t")

if len(words) != 1:

target = words[0]

for word in words[1:]:

dict[word] = target

for line in b:

words = line.split("\t")

if words[0] in dict.keys() and words[1] in dict.keys():

fileOut.write(dict[words[0]] + "\t" + dict[words[1]] + "\n")

elif words[0] in dict.keys() and words[1] not in dict.keys():

fileOut.write(dict[words[0]] + "\t" + words[1] + "\n")

elif words[0] not in dict.keys() and words[1] in dict.keys():

fileOut.write(words[0] + "\t" + dict[words[1]] + "\n")

elif words[0] not in dict.keys() and words[1] not in dict.keys():

fileOut.write(words[0] + "\t" + words[1] + "\n")

fileOut.close()

filename1:

Area_1 Area_2

A B

A C

A D

D B

D C

L B

L C

L A

D L

K A

K B

K C

K D

K L

D P

D R

L P

L R

K P

K R

A H

D H

L H

K H

B P

B R

R P

A I

D I

I L

I K

C H

I H

C H

J K

J X

J Y

J Z

K X

K Y

Y Z

K Z

X Y

X Z

M G

N T

O S

S Q

filename2:

Incident_00000001 A D L K

Incident_00000002 B P R

Incident_00000003 C F W

Incident_00000004 J I

Incident_00000005 Q S

output (b) - undesired output that I am getting:

Area_1 Area_2

Incident_00000001 B

Incident_00000001 C

Incident_00000001 D

Incident_00000001 B

Incident_00000001 C

Incident_00000001 B

Incident_00000001 C

Incident_00000001 A

Incident_00000001 L

K A

K B

K C

K D

K L

Incident_00000001 P

Incident_00000001 Incident_00000002

Incident_00000001 P

Incident_00000001 Incident_00000002

K P

K Incident_00000002

Incident_00000001 H

K H

Incident_00000002 P

Incident_00000002 Incident_00000002

R P

Incident_00000001 Incident_00000003

I L

I Incident_00000004

Incident_00000003 H

I H

Incident_00000003 H

Incident_00000004 Incident_00000004

Incident_00000004 X

Incident_00000004 Y

Incident_00000004 Z

K X

K Y

Y Z

K Z

X Y

X Z

M G

N T

O S

Incident_00000005 Incident_00000005

What I am looking to get (output (c)) is:

Area_1 Area_2

Incident_00000001 Incident_00000002

Incident_00000001 Incident_00000003

Incident_00000001 Incident_00000001

Incident_00000001 Incident_00000002

Incident_00000001 Incident_00000003

Incident_00000001 Incident_00000002

Incident_00000001 Incident_00000003

Incident_00000001 Incident_00000001

Incident_00000001 Incident_00000002

Incident_00000001 Incident_00000003

Incident_00000001 Incident_00000001

Incident_00000001 Incident_00000002

Incident_00000001 H

Incident_00000002 Incident_00000002

Incident_00000001 Incident_00000004

Incident_00000004 Incident_00000001

Incident_00000003 H

Incident_00000004 H

Incident_00000003 H

Incident_00000004 Incident_00000001

Incident_00000004 X

Incident_00000004 Y

Incident_00000004 Z

Incident_00000001 X

Incident_00000001 Y

Y Z

Incident_00000001 Z

X Y

X Z

M G

N T

O Incident_00000005

Incident_00000005 Incident_00000005

【回答】

可将file2过滤再整理为便于查询的二维表，字段incident用于显示值，字段code是集合，用于在file1中进行查询。之后就是简单的编码反显：如果file1中的编码属于file2某条记录的code集合，则输出该条记录的incident，否则输出原编码。

上述算法涉及结构化计算、集合运算、有序拆分字符串，逻辑虽然比较简单，但python要从底层实现，代码比较难写。如无特殊要求可用SPL实现，代码简单易懂：

	A
1	=file("d:/file1.txt").import@t()
2	=file("d:/file2.txt").read@n()
3	=A2.select(pos(~,"Incident"))
4	=A3.new((t=~.split("\t"))(1):incident,t.to(2,):code)
5	=A1.new(ifn(A4.select@1(code.pos(A1.Area_1)).incident,Area_1):Area_1,ifn(A4.select@1(code.pos(A1.Area_2)).incident,Area_2):Area_2)

A1：读取file1文本

A2：按行读取file2文本

A3：从A2中选出含"Incident"子串的行

A4：将A3整理为incident, code两个字段组成的序表

通过~.split("\t")把A3的每一行按照分隔符拆成序列，其中incident字段值为该序列第一个成员，code字段值为剩下的成员组成的序列。

A5：根据序表A1生成Area_1和Area_2组成的新序表，如果file1中的编码属于file2某条记录的code集合，新序表的字段值为该条记录的incident值，否则为原编码。

文本(31) 字符串拆分(28) 结构化(8)

如何拆分字符串解析并结构化

【问题】

【回答】

目录