找出与指定集合有共同成员的行
【问题】
I needed to extract all hits from one list (list.txt) which can be found in one of the columns of another (here in Data.txt) into a third (output.txt).
Data.txt (tab delimited)
some\_data more\_data other\_data here yet\_more_data etc
A B 2 Gee;Whiz;Hello 13 12
A B 2 Gee;Whizz;Hi 56 32
E 4 Btm;Lol 16 2
T 3 Whizz 13 3
List.txt
Gee
Whiz
Lol
Ideally output.txt looks like
some\_data more\_data other\_data here yet\_more_data etc
A B 2 Gee;Whiz;Hello 13 12
A B 2 Gee;Whizz;Hi 56 32
E 4 Btm;Lol 16 2
So I tried a shell script
for ids in List.txt
do
grep $ids Data.txt >> output.txt
done
except I typed out everything (cut and paste actually) in List.txt in said script.
Unfortunately it gave me an output.txt including the last line, I assume as ‘Whizz’ contains ‘Whiz’.
I also tried cat Data.txt | egrep -F “List.txt” and that resulted in grep: conflicting matchers specified -- I suppose that was too naive of me. The actual files: List.txt contains a sorted list of 985 words, Data.txt has 115576 rows with 17 columns.
【回答】
如果将 Data.txt 的 here 字段转化为集合,再和 List.txt 进行交集运算,就能直观轻松地实现你的需求。不过用 Shell 做集合运算有些麻烦,可以考虑用 SPL 来实现:
A |
|
1 |
=file("/Data.txt").import@t() |
2 |
=file("/List.txt").read@n() |
3 |
=A1.select(here.array(";")^A2!=[]) |
代码中的”^”表示求交集,”[]”表示空集。更多集合相关知识请参考:【集算器的集合运算举例】