找出与指定集合有共同成员的行

【问题】

I needed to extract all hits from one list (list.txt) which can be found in one of the columns of another (here in Data.txt) into a third (output.txt).

Data.txt (tab delimited)

some\_data more\_data other\_data here yet\_more_data etc

A B 2 Gee;Whiz;Hello 13 12

A B 2 Gee;Whizz;Hi 56 32

E 4 Btm;Lol 16 2

T 3 Whizz 13 3

List.txt

Gee

Whiz

Lol

Ideally output.txt looks like

some\_data more\_data other\_data here yet\_more_data etc

A B 2 Gee;Whiz;Hello 13 12

A B 2 Gee;Whizz;Hi 56 32

E 4 Btm;Lol 16 2

So I tried a shell script

for ids in List.txt

do

grep $ids Data.txt >> output.txt

done

except I typed out everything (cut and paste actually) in List.txt in said script.

Unfortunately it gave me an output.txt including the last line, I assume as ‘Whizz’ contains ‘Whiz’.

I also tried cat Data.txt | egrep -F “List.txt” and that resulted in grep: conflicting matchers specified -- I suppose that was too naive of me. The actual files: List.txt contains a sorted list of 985 words, Data.txt has 115576 rows with 17 columns.

【回答】

如果将 Data.txt 的 here 字段转化为集合,再和 List.txt 进行交集运算,就能直观轻松地实现你的需求。不过用 Shell 做集合运算有些麻烦,可以考虑用 SPL 来实现:


A

1

=file("/Data.txt").import@t()

2

=file("/List.txt").read@n()

3

=A1.select(here.array(";")^A2!=[])


代码中的”^”表示求交集,”[]”表示空集。更多集合相关知识请参考:【集算器的集合运算举例