用 TPCH 练习性能优化 Q22
一、 SQL及分析
查询SQL语句如下:
select
cntrycode,
count(*) as numcust,
sum(c_acctbal) as totacctbal
from (
select
substr(c_phone,1,2) as cntrycode,
c_acctbal
from
customer
where
substr(c_phone,1,2) in ('11', '14', '15', '19', '20', '21', '23')
and c_acctbal > (
select
avg(c_acctbal)
from
customer
where
c_acctbal > 0.00
and substr(c_phone,1,2) in ('11', '14', '15', '19', '20', '21', '23')
)
and not exists (
select
*
from
orders
where
o_custkey = c_custkey
)
) custsale
group by
cntrycode
order by
cntrycode;
这个主查询外层是个常规的分组聚合,内层有两个用于条件的子查询。
二、 SPL实现
这两个子查询能对应一些满足条件的customer记录,后一个子查询是用了not exist修饰,最后结果应当是这两个子查询对应的customer记录集合的差集。
A | |
1 | =now() |
2 | =["11","14","15","19","20","21","23"] |
3 | =file("customer.ctx").open().cursor@m(C_CUSTKEY,C_PHONE,C_ACCTBAL;C_ACCTBAL>0.0 && A2.contain(left(C_PHONE,2))).fetch() |
4 | =A3.avg(C_ACCTBAL) |
5 | =A3.select@m(C_ACCTBAL>A4).derive@o().keys@im(C_CUSTKEY) |
6 | =file("orders.ctx").open().cursor@m(O_CUSTKEY;O_CUSTKEY:A5) |
7 | =A6.run(O_CUSTKEY.C_CUSTKEY=null).skip() |
8 | =A5.select@m(C_CUSTKEY) |
9 | =A8.groups(left(C_PHONE,2):cntrycode;count(1):numcust, sum(C_ACCTBAL):totacctbal) |
10 | =interval@ms(A1,now()) |
A5选出满足第一个子查询条件的记录,A6中与A5关联后选出第二个子查询的记录,A7中令这些记录的C_CUSTKEY为null,意即从A5中删除它们,从而实现第二个子查询的条件not exists,A8中选出的即是满足两个子查询的记录。
三、 进一步优化
1. 优化方法
本例中要使用Q2中介绍的维表主键序号化方法,customer中的C_CUSTKEY、orders中的O_CUSTKEY都已在之前的例子中转换过了。
2. 数据转换代码
复制customer_18.ctx、orders_21.ctx分别重命名为customer_22.ctx、orders_22.ctx。
3. 数据转换后的计算代码
维表需要预加载,加载代码如下:
A | |
1 | >env(customer, file("customer_22.ctx").open().import()) |
在查询计算之前,需要先运行预加载代码,将小维表读入内存。
计算代码:
A | |
1 | =now() |
2 | =["11","14","15","19","20","21","23"] |
3 | =customer.@m(if(C_ACCTBAL>0.0 && A2.contain(left(C_PHONE,2)),C_ACCTBAL,null)) |
4 | =A3.avg() |
5 | =A3.@m(~>A4) |
6 | =file("orders_22.ctx").open().cursor@m(O_CUSTKEY;A5(O_CUSTKEY)).run(A5(O_CUSTKEY)=false).skip() |
7 | =customer(A5.pselect@am(~)).groups@m(left(C_PHONE,2):cntrycode;count(1):numcust, sum(C_ACCTBAL):totacctbal) |
8 | =interval@ms(A1,now()) |
四、 使用企业版列式计算
1. 原始数据
A | |
1 | =now() |
2 | =["11","14","15","19","20","21","23"] |
3 | =file("customer.ctx").open().cursor@mv(C_CUSTKEY,C_PHONE,C_ACCTBAL;C_ACCTBAL>0.0 && A2.contain(left(C_PHONE,2))).fetch() |
4 | =A3.avg(C_ACCTBAL) |
5 | =A3.select@mv(C_ACCTBAL>A4).derive@o().keys@im(C_CUSTKEY) |
6 | =file("orders.ctx").open().cursor@mv(O_CUSTKEY;O_CUSTKEY:A5) |
7 | =A6.run(O_CUSTKEY.C_CUSTKEY=null).skip() |
8 | =A5.select@mv(C_CUSTKEY) |
9 | =A8.groups@m(left(C_PHONE,2):cntrycode;count(1):numcust, sum(C_ACCTBAL):totacctbal) |
10 | =interval@ms(A1,now()) |
2. 优化后数据
维表需要预加载,加载代码如下:
A | |
1 | >env(customer, file("customer_22.ctx").open().import@v()) |
在查询计算之前,需要先运行预加载代码,将小维表读入内存。
计算代码:
A | |
1 | =now() |
2 | =["11","14","15","19","20","21","23"] |
3 | =customer.@m(if(C_ACCTBAL>0.0 && A2.contain(left(C_PHONE,2)),C_ACCTBAL,null)) |
4 | =A3.avg() |
5 | =A3.@m(~>A4) |
6 | =file("orders_22.ctx").open().cursor@mv(O_CUSTKEY;A5(O_CUSTKEY)).run(A5(O_CUSTKEY)=false).skip() |
7 | =customer(A5.pselect@am(~)).groups@m(left(C_PHONE,2):cntrycode;count(1):numcust, sum(C_ACCTBAL):totacctbal) |
8 | =interval@ms(A1,now()) |
五、 测试结果
单位:秒
常规 | 列式 | |
优化前 | 7.5 | 4.2 |
优化后 | 3.9 | 2.2 |
英文版