"一、 SQL 及分析查询SQL语句如下： select cntrycode, count(*) as numcust, sum(c_acctbal) as totacctbal from ( .."

sjr 北京
乾学院 32 号会员
1 回帖 • 966 浏览 • 2 年前

用 TPCH 练习性能优化 Q22

计算＆AI

一、 SQL及分析

查询SQL语句如下：

select
    cntrycode,
    count(*) as numcust,
    sum(c_acctbal) as totacctbal
from (
    select
        substr(c_phone,1,2) as cntrycode,
        c_acctbal
    from
        customer
    where
        substr(c_phone,1,2) in ('11', '14', '15', '19', '20', '21', '23')
        and c_acctbal > (
            select
                avg(c_acctbal)
            from
                customer
            where
                c_acctbal > 0.00
                and substr(c_phone,1,2) in    ('11', '14', '15', '19', '20', '21', '23')
        )
        and not exists (
            select
                *
            from
                orders
            where
                o_custkey = c_custkey
        )
    ) custsale
group by
    cntrycode
order by
    cntrycode;

这个主查询外层是个常规的分组聚合，内层有两个用于条件的子查询。

二、 SPL实现

这两个子查询能对应一些满足条件的customer记录，后一个子查询是用了not exist修饰，最后结果应当是这两个子查询对应的customer记录集合的差集。

	A
1	=now()
2	=["11","14","15","19","20","21","23"]
3	=file("customer.ctx").open().cursor@m(C_CUSTKEY,C_PHONE,C_ACCTBAL;C_ACCTBAL>0.0 && A2.contain(left(C_PHONE,2))).fetch()
4	=A3.avg(C_ACCTBAL)
5	=A3.select@m(C_ACCTBAL>A4).derive@o().keys@im(C_CUSTKEY)
6	=file("orders.ctx").open().cursor@m(O_CUSTKEY;O_CUSTKEY:A5)
7	=A6.run(O_CUSTKEY.C_CUSTKEY=null).skip()
8	=A5.select@m(C_CUSTKEY)
9	=A8.groups(left(C_PHONE,2):cntrycode;count(1):numcust, sum(C_ACCTBAL):totacctbal)
10	=interval@ms(A1,now())

A5选出满足第一个子查询条件的记录，A6中与A5关联后选出第二个子查询的记录，A7中令这些记录的C_CUSTKEY为null，意即从A5中删除它们，从而实现第二个子查询的条件not exists，A8中选出的即是满足两个子查询的记录。

三、进一步优化

1. 优化方法

本例中要使用Q2中介绍的维表主键序号化方法，customer中的C_CUSTKEY、orders中的O_CUSTKEY都已在之前的例子中转换过了。

2. 数据转换代码

复制customer_18.ctx、orders_21.ctx分别重命名为customer_22.ctx、orders_22.ctx。

3. 数据转换后的计算代码

维表需要预加载，加载代码如下：

	A
1	>env(customer, file("customer_22.ctx").open().import())

在查询计算之前，需要先运行预加载代码，将小维表读入内存。

计算代码：

	A
1	=now()
2	=["11","14","15","19","20","21","23"]
3	=customer.@m(if(C_ACCTBAL>0.0 && A2.contain(left(C_PHONE,2)),C_ACCTBAL,null))
4	=A3.avg()
5	=A3.@m(~>A4)
6	=file("orders_22.ctx").open().cursor@m(O_CUSTKEY;A5(O_CUSTKEY)).run(A5(O_CUSTKEY)=false).skip()
7	=customer(A5.pselect@am(~)).groups@m(left(C_PHONE,2):cntrycode;count(1):numcust, sum(C_ACCTBAL):totacctbal)
8	=interval@ms(A1,now())

四、使用企业版列式计算

1. 原始数据

	A
1	=now()
2	=["11","14","15","19","20","21","23"]
3	=file("customer.ctx").open().cursor@mv(C_CUSTKEY,C_PHONE,C_ACCTBAL;C_ACCTBAL>0.0 && A2.contain(left(C_PHONE,2))).fetch()
4	=A3.avg(C_ACCTBAL)
5	=A3.select@mv(C_ACCTBAL>A4).derive@o().keys@im(C_CUSTKEY)
6	=file("orders.ctx").open().cursor@mv(O_CUSTKEY;O_CUSTKEY:A5)
7	=A6.run(O_CUSTKEY.C_CUSTKEY=null).skip()
8	=A5.select@mv(C_CUSTKEY)
9	=A8.groups@m(left(C_PHONE,2):cntrycode;count(1):numcust, sum(C_ACCTBAL):totacctbal)
10	=interval@ms(A1,now())

2. 优化后数据