文本文件间的关联计算

【问题】

I am very new to this kind of work so bear with me please :) I am trying to calculate means over ranges of patterns. E.g. I have two files which are tab delimited:

The file coverage.txt contains two colums. The first colum indicates the position and the second the value assigned to that postion. There are ca. 4*10^6 positions.

coverage.txt

1  10

2  30

3  5

4  10

The second file “patterns.txt” contains three columns 1. the name of the pattern, 2. the starting position of the pattern and 3. end position of the pattern. The pattern ranges do not overlap. There are ca. 3000 patterns.

patterns.txt

rpoB 1  2

gyrA 3  4

Now I want to calculate the mean of the values assigned to the positions of the different patterns and write the output to a new file containing the first colum of the patterns.txt as an identifier.

output.txt

rpoB 20

gyrA 7.5

I think this can be accomplished usingawkbut I do not know where to start. Your help would be greatly appreciated!

别人解答:

With four million positions, it might be time to reach for a more substantial programming language than shell/awk, but you can do it in a single pass with something like this:

awk '{

 if (FILENAME ~ "patterns.txt") {

 min[$1]=$2

 max[$1]=$3

 } else {

 for (pat in min) {

 if ($1 >= min[pat] && $1 <= max[pat]) {

 total[pat] += $2

 count[pat] += 1

 }

 }

 }

}

END {

 for (pat in total) {

 print pat,total[pat]count[pat]

 }

}' patterns.txt coverage.txt

This omits any patterns that don’t have any data in the coverage file; you can change the loop in theENDto loop over everything in the patterns file instead and just output 0s for the ones that didn’t show up.

【回答】

除了 awk,还可以用集算器实现,代码更简短,执行速度也快。

SPL 脚本如下:



A

1

=file("coverage.txt").import()

2

=file("patterns.txt").import()

3

=A2.new(#1,A1.select(#1>=A2.#2 && #1<=A2.#3).avg(#2))

4

=file("coverage.txt").export(A3)

集算器提供 JDBC 接口,可以像数据库一样嵌入到 JAVA 应用程序中,用起来很简单,可参考Java 如何调用 SPL 脚本