用 TPCH 练习性能优化 Q4
一、 SQL及分析
查询SQL语句如下:
select
o_orderpriority,
count(*) as order_count
from
orders
where
o_orderdate >= date '1995-10-01'
and o_orderdate < date '1995-10-01' + interval '3' month
and exists (
select * from lineitem
where l_orderkey = o_orderkey and l_commitdate < l_receiptdate
)
group by
o_orderpriority
order by
o_orderpriority;
它有一个exists子查询,找出子表lineitem中有满足条件l_commitdate < l_receiptdate的记录。exists子句都可以改写成join,将exists改写成join之后是这样:
select
o_orderpriority,
count(*) as order_count
from (
select
distinct l_orderkey,o_orderpriority
from orders
join lineitem on
l_orderkey = o_orderkey
where
o_orderdate >= date '1995-10-01'
and o_orderdate < date '1995-10-01' + interval '3' month
and l_commitdate < l_receiptdate
)
group by
o_orderpriority
order by
o_orderpriority;
改写之后可以发现,这仍然是个主子表连接的问题,而且这里的连接运算是为了生成针对主表的某种过滤条件,中间结果集实际上是和orders表对应的,所以还要对连接结果做针对orderkey的去重运算。
二、 SPL实现
采用Q3所说的有序归并方法实现连接,同时统计出针对主表的过虑条件。
A |
|
1 |
=now() |
2 |
1995-10-01 |
3 |
=elapse@m(A2,3) |
4 |
=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERPRIORITY;O_ORDERDATE>=A2 && O_ORDERDATE<A3) |
5 |
=file("lineitem.ctx").open().news@r(A4,O_ORDERPRIORITY;L_COMMITDATE<L_RECEIPTDATE) |
6 |
=A5.groups(O_ORDERPRIORITY;count(1):order_count) |
7 |
=interval@ms(A1,now()) |
A4、A5使用了前面解释过的技巧,A5中用news@r对子表lineitem按L_ORDERKEY分组,过滤掉不满足L_COMMITDATE<L_RECEIPTDATE条件的组,每组保留一条记录。
A6中对O_ORDERPRIORITY分组,统计各组的记录数。
三、 进一步优化
1. 优化方法
本例中要使用Q1中介绍的日期整数化优化方法,orders中的O_ORDERDATE已在Q3中转换过了,只须对lineitem中的L_COMMITDATE和L_RECEIPTDATE进行转化。
还要使用Q1中介绍的字符串整数化方法,对O_ORDERPRIORITY进行转化。
2. 数据转换代码
2.1 orders转换
A |
|
1 |
=file("orders.ctx").open().cursor(O_ORDERPRIORITY) |
2 |
=A1.id(O_ORDERPRIORITY).sort() |
3 |
=file("o_orderpriority.txt").export(A2) |
4 |
=file("orders_3.ctx").open().cursor() |
5 |
=A4.run(O_ORDERPRIORITY=A2.pos@b(O_ORDERPRIORITY)) |
6 |
=file("orders_4.ctx").create(#O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE, O_ORDERDATE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT) |
7 |
>A6.append(A5) |
2.2 lineitem转换
A |
|
1 |
=file("lineitem_3.ctx").open().cursor() |
2 |
=A1.run(L_COMMITDATE=days@o(L_COMMITDATE), L_RECEIPTDATE=days@o(L_RECEIPTDATE)) |
3 |
=file("lineitem_4.ctx").create@p(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT) |
4 |
>A3.append(A2) |
A2 将L_COMMITDATE和L_RECEIPTDATE转换为小整数
3. 数据转换后的计算代码
A |
|
1 |
>o_orderpriority=file("o_orderpriority.txt").import@si() |
2 |
=now() |
3 |
1995-10-01 |
4 |
=days@o(elapse@m(A3,3)) |
5 |
=days@o(A3) |
6 |
=file("orders_4.ctx").open().cursor@m(O_ORDERKEY,O_ORDERPRIORITY;O_ORDERDATE>=A5 && O_ORDERDATE<A4) |
7 |
=file("lineitem_4.ctx").open().news@r(A6,O_ORDERPRIORITY;L_COMMITDATE<L_RECEIPTDATE) |
8 |
=A7.groups(O_ORDERPRIORITY;count(1):order_count) |
9 |
=A8.run(O_ORDERPRIORITY=o_orderpriority(O_ORDERPRIORITY)) |
10 |
=interval@ms(A2,now()) |
A9 将整数化后的O_ORDERPRIORITY再还原成原始字符串。
四、 使用企业版列式计算
1. 原始数据
A |
|
1 |
=now() |
2 |
1995-10-01 |
3 |
=elapse@m(A2,3) |
4 |
=file("orders.ctx").open().cursor@mv(O_ORDERKEY,O_ORDERPRIORITY;O_ORDERDATE>=A2 && O_ORDERDATE<A3) |
5 |
=file("lineitem.ctx").open().news@r(A4,O_ORDERPRIORITY;L_COMMITDATE<L_RECEIPTDATE) |
6 |
=A5.groups(O_ORDERPRIORITY;count(1):order_count) |
7 |
=interval@ms(A1,now()) |
A4中cursor函数加选项@v,表示使用列式游标读数。A5中使用news函数与A4创建同步游标,它也与同步的A4一样是列式游标。
2. 优化后数据
A |
|
1 |
>o_orderpriority=file("o_orderpriority.txt").import@si() |
2 |
=now() |
3 |
1995-10-01 |
4 |
=days@o(elapse@m(A3,3)) |
5 |
=days@o(A3) |
6 |
=file("orders_4.ctx").open().cursor@mv(O_ORDERKEY,O_ORDERPRIORITY;O_ORDERDATE>=A5 && O_ORDERDATE<A4) |
7 |
=file("lineitem_4.ctx").open().news@r(A6,O_ORDERPRIORITY;L_COMMITDATE<L_RECEIPTDATE) |
8 |
=A7.groups(O_ORDERPRIORITY;count(1):order_count) |
9 |
=A8.new(o_orderpriority(O_ORDERPRIORITY):O_ORDERPRIORITY,order_count) |
10 |
=interval@ms(A2,now()) |
五、 测试结果
单位:秒
常规 |
列式 |
|
优化前 |
6.5 |
5.2 |
优化后 |
4.7 |
3.3 |
英文版