Excel 工作表做 JOIN

【问题】
I’m working on combining two excel worksheets. Before I start, I’d like to mention that I also have mysql workbench, so I’m open to working on this issue in either sql or vba (I should learn both). I’m working with .bed files, which are lists of genomic coordinates. In short, the data is indexed by chromosome number (ie:chr2) and then has a numerical start and stop location on the chromosome. These numerical locations can span a large range (ie:100-10,000) or be a single position (ie: 999-1000). I have a list of coordinates that cover a large range, and in a separate file I have a list of single positions.

Example of a file with ranges:

chromosome    start   stop

chr1          45616321

chr3          984211253

Example of file with single positions:

chromosome   start   stop

chr1          52135214

chr3          1025410255

I would like to combine these worksheets such that if a location in my list of single positions is found within the range in my list of ranges, the locations for both are listed in the same row. The lists are 1000s of locations long, so I’d also like this program to loop through every row. Using the example data listed above, I’d like my output to look like the following:

Example of desired output:

chromosome  start   stop  chromosome  start  stop

chr1         45616321    chr1      52135214

chr3         984211253    chr3     1025410255

There is a high probability that multiple single positions will fall within a single range, and I would like these to be listed as separate rows.

有人给出 SQL: 正确,只是要出库入库,稍显麻烦。

select a.chromosome, a.start, a stop, b.chromosome, b.start, b.stop from ranges_table a, positions_table b where b.start >= a.start and b.stop <= a.stop

有人给出 VBA:正确,SQL+ 出库入库代码。

SubSqlJoin()

Dim oConnAsNew ADODB.Connection

Dim oRSAsNew ADODB.Recordset

Dim sPath

Dim sSQLAsString, wbAsWorkbook

Set wb =ThisWorkbook

    sSQL ="select a.chromosome, a.start, a stop," & _

"b.chromosome, b.start, b.stop" & _

"from <ranges_table> a, <positions_table> b" & _

"where b.start >= a.start and b.stop <= a.stop"

    sSQL =Replace(sSQL,“<ranges_table>”, _

Rangename(wb.Worksheets(“Ranges”).Range(“A1”).CurrentRegion))

    sSQL =Replace(sSQL,“<positions_table>”, _

Rangename(wb.Worksheets(“Positions”).Range(“A1”).CurrentRegion))

If wb.Path <>""Then

      sPath = wb.FullName

Else

MsgBox"The workbook must be saved first!"

ExitSub

EndIf

    oConn.Open"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=‘" & sPath &"’;" & _

“Extended Properties=‘Excel 12.0;HDR=Yes;IMEX=1’;”

    oRS.Open sSQL, oConn

IfNot oRS.EOFThen

        wb.Worksheets(“Results”).Range(“A2”).CopyFromRecordset oRS

Else

MsgBox"No records found"

EndIf

    oRS.Close

    oConn.Close

EndSub

FunctionRangename(rAsRange)AsString

Rangename =“[” & r.Parent.Name &“$” & _

                r.Address(False,False) &“]”

EndFunction

【回答】
用 VBA 太麻烦了,用 SPL 几句就算出来了

A
1 =file(“D:\\range.xlsx”).xlsimport@t()
2 =file(“D:\\position.xlsx”).xlsimport@t()
3 =xjoin(A1;A2,start>A1.start && stop<A1.stop)
4 =file(“D:\\result.xlsx”).xlsexport(A3,#1.chromosome,#1.start,#1.stop,#1.chromosome,#2.start,#2.stop)

运行结果:

A1:读取 excel 文件返回序表

A2:读取 excel 文件返回序表

A3:将两个序列进行叉乘

A4:将序表输出到新 excel 文件中