两文件关联过滤生成新文件

【问题】
I am trying to make a comparison between two large files, tab delimited. I have been trying to use awk & bash (Ubuntu 15.10), python (v3.5) and powershell (windows 10). My only background is Java but my field tends to stick with the scripting languages.

I am trying to see

File 1 A[ ]

1 gramene gene 4854 9652 - ID=gene:GRMZM2G059865;biotype=protein_coding;description=Uncharacterized protein \[Source:UniProtKB/TrEMBL%3BAcc:C0P8I2\];gene_id=GRMZM2G059865;logic_name=genebuilder;version=1
1 gramene gene 9882 10387 - ID=gene:GRMZM5G888250;biotype=protein_coding;gene_id=GRMZM5G888250;logic_name=genebuilder;version=1
1 gramene gene 109519 111769 - ID=gene:GRMZM2G093344;biotype=protein_coding;gene_id=GRMZM2G093344;logic_name=genebuilder;version=1
1 gramene gene 136307 138929 + ID=gene:GRMZM2G093399;biotype=protein_coding;gene_id=GRMZM2G093399;logic_name=genebuilder;version=1

File 2 B [ ]

S1_6370 T/C 1 6370 +
S1_8210 T 1 8210 +
S1_8376 A 1 8376 +
S1_9889 A 1 9889 +

Output

1 ID=gene:GRMZM2G059865 4857 9652 - S1_6370 T/C 6370 + 
1 ID=gene:GRMZM2G059865 4857 9652 -S1_8210 T 8210 +
1 ID=gene:GRMZM2G059865 4857 9652 -S1_8376 A 8376 +
1 ID=gene:GRMZM5G888250 9882 10387 -S1_9889 A 9889 +

My general logic

loop (until  end of A\[  \]  and B\[  \])``if``B\[$4\]>A\[$4\]  && B\[$4\]<A\[$5\] #if the value in B column 4 is in between the values in A columns 4 & 5.``then``-F”\\t”  print  {A\[1\], A\[9(filtered)\], A\[$4FS$5\], B\[$1\], B\[$2\], B\[$3\], B\[$4\], B\[$5\]} #hopefully reflects awk column calls if the two files were able to have their columns defined that way.``movea++  \# to see if the next set of B column 4 values is in between the values in A columns 4 & 5 ``else``moveb++  #to see if the next set of A columns 4&5 values contain the current vales of B column 4 in them.

I know this logic doesn’t follow any language that I am aware of but is similar in parts. It seems like NR and FNR are two built in running values in awk. Awk helped me split up File 2 that had 10 values in B[$1] into 10 files quite easily and also cut helped with cutting out the few hundred columns (~255+) beyond the 5 you see here. Now I am working File 2 sizes around a couple MB instead of 1 file of 1.6 GB. Other than cutting down loading times, I wanted to simplify the loops. I haven’t backtracked to my previous attempts of python or powershell since I trimmed the file sizes down. I convinced myself they just weren’t going to read my files with their built in libraries or cmdlets. Which I’ll try sometime soon if I am unable to figure out an awk solution.

comparing multiple files and columns using awk #referenced Awk greater than less than but within a set range #referenced efficiently splitting one file into several files by value of column #the one thing that worked Using awk to get a specific string in line #might be able to filter column 9 How to check value of a column lies between values of two columns in other file and print corresponding value from column in Unix? #this seemed the closest but without all the printing out in a third file I wanted, still not able to figure out the syntax completely

python bash powershell

别人解答:

awk '

BEGIN{

x=getline s <"B"

split(s,b,"\\t")

}

!x {

exit

}

{

sub(/;.*/,"",$9)

while (x && $4<b\[4\] && b\[4\]<$5) {

print $1,$9,$4,$5,$7,b\[1\],b\[2\],b\[4\],b\[5\]

x=getline s <"B"; split(s,b,"\\t")

}

}

' OFS='\\t' A

【回答】
File2 较小,可读入内存加快查询速度,file1 较大,可用游标批量读取。按行读入后使用 for 循环,按列号查询并将结果追加到文件中。

JAVA 直接编码会相当繁琐。Awk 的代码很简单,但效率较差,不适合处理太大的文件。这类问题还可以用 SPL 实现,代码简单,性能也很好,且可用命令行执行,也可集成 JAVA:



A

B

1

=file("file2.txt").import()

2

=file("file1.txt").cursor()

3

for A2

=A1.select(#4>A3.#4 && #4> A3.#5)

4

=B3.new(A3.#1,A3.#7.split(";")(1),A3.#4,A3.#5,"-",#1,#2,#4,#5)

5

=file("D:/result.txt").export@a(B4)

A3:循环 A2 中的游标记录

B3:查询 file1 和 file2 中满足条件的记录

B4:按列表获取字段值,生成新的序表

B5:将查询结果追加导出到文件中

1png