"一、 SQL 及分析查询SQL语句如下： select * from ( select s_name, count(*) as numwait from supplier,lineitem .."

sjr 北京
乾学院 32 号会员
1 回帖 • 1,436 浏览 • 3 年前

用 TPCH 练习性能优化 Q21

计算＆AI

一、 SQL及分析

查询SQL语句如下：

select * from (
    select
        s_name,
        count(*) as numwait
    from
        supplier,lineitem l1,orders,nation
    where
        s_suppkey = l1.l_suppkey
        and o_orderkey = l1.l_orderkey
        and o_orderstatus = 'F'
        and l1.l_receiptdate > l1.l_commitdate
        and exists (
            select
                *
            from
                lineitem l2
            where
                l2.l_orderkey = l1.l_orderkey
                and l2.l_suppkey <> l1.l_suppkey
        )
        and not exists (
            select
                *
            from
                lineitem l3
            where
                l3.l_orderkey = l1.l_orderkey
                and l3.l_suppkey <> l1.l_suppkey
                and l3.l_receiptdate > l3.l_commitdate
        )
        and s_nationkey = n_nationkey
        and n_name = 'CHINA'
    group by
        s_name
    order by
        numwait desc,
        s_name
) where rownum<=100;

这主查询比较简单，就是主子表和外键表的连接分组聚合运算，过滤条件比较复杂，有两个exists的子查询。

二、 SPL实现

两个exists子查询都是针对同一个l_orderkey下的lineitem记录的一些运算。

我们知道，lineitem已经按l_orderkey排序，而且子表记录可以看作是主表的集合字段，如果我们将orders和lineitem的连接结果集按orderkey做有序分组（但先不聚合），可以得到一个个由相同l_orderkey值的lineitem记录的小集合，再在这个小集合中去计算上述那两个exists条件就会比较简单一些。

	A
1	=now()
2	>name="CHINA"
3	=file("nation.btx").import@b().select@1(N_NAME==name).N_NATIONKEY
4	=file("supplier.ctx").open().cursor@m(S_SUPPKEY,S_NAME;S_NATIONKEY==A3).fetch().keys@im(S_SUPPKEY)
5	=file("orders.ctx").open().cursor@m(O_ORDERKEY;O_ORDERSTATUS=="F")
6	=file("lineitem.ctx").open().news@r(A5,L_ORDERKEY:ORDERKEY;A4.find(L_SUPPKEY))
7	=file("lineitem.ctx").open().news(A6:ORDERKEY,L_ORDERKEY,L_SUPPKEY,L_RECEIPTDATE,L_COMMITDATE)
8	=A7.group(L_ORDERKEY)
9	=A8.conj(if((t=~.select(L_RECEIPTDATE>L_COMMITDATE)) && (s=t.#2) && !t.pselect(#2!=s) && ~.pselect(#2!=s),t,null) )
10	=A9.switch@i(L_SUPPKEY,A4)
11	=A10.groups@u(L_SUPPKEY.S_NAME:s_name;count(1):numwait)
12	=A11.top(100;-numwait,s_name)
13	=interval@ms(A1,now())

A6中对orders和lineitem做连接过滤出满足条件的ORDERKEY，A7再与A6连接选出需要的字段，A8再对其做有序分组，分成由相同l_orderkey值的lineitem记录集合，A9中对这些小集合实现exists判断，过滤掉不满足条件的。后面的代码把其它外键表关联上再做常规分组运算即可。

三、进一步优化

1. 优化方法

本例中要使用Q1中介绍的日期整数化优化方法，lineitem中的L_COMMITDATE和L_RECEIPTDATE已在前面例子中转换过了。要使用Q1中介绍的字符串整数化方法，本例要对orders中的O_ORDERSTATUS进行转换。还要使用Q2中介绍的维表主键序号化方法，supplier中的S_SUPPKEY、lineitem中的L_SUPPKEY都已在之前的例子中转换过了。

2. 数据转换代码

2.1 nation、supplier、lineitem转换

复制nation_20.btx、supplier_20.ctx、lineitem_20.ctx分别重命名为nation_21.btx、supplier_21.ctx、lineitem_21.ctx。

2.2 orders转换

	A
1	=file("orders.ctx").open().cursor(O_ORDERSTATUS)
2	=A1.id(O_ORDERSTATUS).sort()
3	=file("o_orderstatus.txt").export(A2)
4	=file("orders_18.ctx").open().cursor()
5	=A4.run(O_ORDERSTATUS=A2.pos@b(O_ORDERSTATUS))
6	=file("orders_21.ctx").create(#O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE,O_ORDERDATE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT)
7	>A6.append(A5)

3. 数据转换后的计算代码

维表和字符串字段取值列表需要预加载，加载代码如下：

	A
1	>env(o_orderstatus,file("o_orderstatus.txt").import@si())
2	>env(nation, file("nation_21.btx").import@b())
3	>env(supplier, file("supplier_21.ctx").open().import())

在查询计算之前，需要先运行预加载代码，将小维表读入内存。

计算代码：

	A
1	=now()
2	>name="CHINA"
3	=o_orderstatus.pos@b("F")
4	=nation.select@1(N_NAME==name).N_NATIONKEY
5	=supplier.@m(if(S_NATIONKEY==A4,S_NAME,null))
6	=file("orders_21.ctx").open().cursor@m(O_ORDERKEY;O_ORDERSTATUS==A3)
7	=file("lineitem_21.ctx").open().news@r(A6,L_ORDERKEY:ORDERKEY;A5(L_SUPPKEY))
8	=file("lineitem_21.ctx").open().news(A7:ORDERKEY,L_ORDERKEY,L_SUPPKEY,L_COMMITDATE,L_RECEIPTDATE)
9	=A8.group(L_ORDERKEY)
10	=A9.conj(if((t=~.select(L_RECEIPTDATE>L_COMMITDATE)) && (s=t.#2) && !t.pselect(#2!=s) && ~.pselect(#2!=s),t.select(A5(L_SUPPKEY)),null) )
11	=A10.groups@u(L_SUPPKEY:s_name;count(1):numwait).run@m(s_name=A5(s_name))
12	=A11.top(100;-numwait,s_name)
13	=interval@ms(A1,now())

四、使用企业版列式计算

1. 原始数据

	A
1	=now()
2	>name="CHINA"
3	=file("nation.btx").import@b().select@1(N_NAME==name).N_NATIONKEY
4	=file("supplier.ctx").open().cursor@mv(S_SUPPKEY,S_NAME;S_NATIONKEY==A3).fetch().keys@im(S_SUPPKEY)
5	=file("orders.ctx").open().cursor@mv(O_ORDERKEY;O_ORDERSTATUS=="F")
6	=file("lineitem.ctx").open().news@r(A5,L_ORDERKEY:ORDERKEY,min(L_SUPPKEY):SK,count(1):cnt; A4.find(L_SUPPKEY) && L_RECEIPTDATE>L_COMMITDATE)
7	=file("lineitem.ctx").open().news@r(A6:ORDERKEY,L_ORDERKEY,SK,cnt,count(L_RECEIPTDATE>L_COMMITDATE && SK!=L_SUPPKEY):c1,count(SK!=L_SUPPKEY):c2).select@v(c1==0 && c2!=0)
8	=A7.groups@u(SK;sum(cnt):numwait).join(SK,A4,S_NAME).new(S_NAME:s_name,numwait)
9	=A8.top(100;-numwait,s_name)
10	=interval@ms(A1,now())

2. 优化后数据