"一、 SQL 及分析查询SQL语句如下： select c_count, count(*) as custdist from ( select c_custkey, count(o_orde .."

sjr 北京
乾学院 32 号会员
1 回帖 • 881 浏览 • 3 年前

用 TPCH 练习性能优化 Q13

计算＆AI

一、 SQL及分析

查询SQL语句如下：

select
    c_count,
    count(*) as custdist
from (
    select
        c_custkey,
        count(o_orderkey) c_count
    from
        customer left outer join orders on
            c_custkey = o_custkey
            and o_comment not like '%special%accounts%'
    group by
        c_custkey
) c_orders
group by
    c_count
order by
    custdist desc,
    c_count desc;

这个查询简单看是对orders做两轮常规分组，第一轮按custkey分组计算出每个顾客的下单数，第二轮再按下单数分组计算出每种下单数各有多少顾客。

二、 SPL实现

注意到原SQL中有个左连接，需要将没有下单过的顾客（下单数为0）也统计在内，而上述二轮分组过程则会遗漏掉这些数据，需要事后再补充一下。

	A
1	=now()
2	>filter="specialaccounts*"
3	=file("orders.ctx").open().cursor@m(O_CUSTKEY;!like(O_COMMENT,filter))
4	=A3.groups@u(O_CUSTKEY;count(1):c_count)
5	=A4.len()
6	=A4.groups@um(c_count;count(1):custdist)
7	=file("customer.ctx").open().cursor@m().skip()
8	=A6.insert(0,0,A7-A5)
9	=A6.sort@z(custdist,c_count)
10	=interval@ms(A1,now())

A4做第一轮分组，A6做第二轮；A7计算所有客户数，减去已下单的就是没下单的客户数，补充到A6上再一起排序。

A4中用groups@u分组，@u表示分组后不用按分组字段O_CUSTKEY排序。

三、进一步优化

1. 优化方法

SPL中groups函数在分组时，如果分组字段是序号，那么可以用@n选项直接定位，避免hash计算。本例使用Q2中介绍的维表主键序号化方法，customer中的C_CUSTKEY和orders中的O_CUSTKEY都已在前面的例子中转换过了。

2. 数据转换代码

复制 customer_10.ctx、orders_12.ctx分别重命名为customer_13.ctx、orders_13.ctx。

3. 数据转换后的计算代码

计算代码：

	A
1	=now()
2	>filter="specialaccounts*"
3	=file("customer_13.ctx").open().cursor().skip()
4	=file("orders_13.ctx").open().cursor@m(O_CUSTKEY;!like(O_COMMENT,filter))
5	=A4.groups@n0(O_CUSTKEY;count(1):c_count;A3)
6	=A5.len()
7	=A5.groups@um(c_count;count(1):custdist)
8	=A7.insert(0,0,A3-A6)
9	=A7.sort@z(custdist,c_count)
10	=interval@ms(A1,now())

A5中groups@n0表示直接按O_CUSTKEY的值定位到第几组，@0表示去掉空组。

上述采用序表游标分组的方法占用内存较大，序号化后直接使用序列来进行分组统计，可以少占内存，提高性能。

计算代码：

	A	B
1	=now()
2	>filter="specialaccounts*"
3	=file("customer_13.ctx").open().cursor().skip()
4	=file("orders_13.ctx").open().cursor@m(O_CUSTKEY;!like(O_COMMENT,filter))
5	fork A4	=A3.(0)
6		=A5.run(B5(O_CUSTKEY)+=1).skip()
7		return B5
8	=transpose(A5).(~.sum()).groups@m(~:c_count;count(1):custdist)
9	=A8.sort@z(custdist,c_count)
10	=interval@ms(A1,now())

A3 求出customer表的记录数(为便于下文描述，记为n)

A5 对A4游标进行多线程分段计算，每个线程在B5中定义一个有n个成员的序列，赋初值为0。在B6中循环当前线程对应的orders游标分段，让序列的第O_CUSTKEY个成员值加1。假如并行线程数为m，多线程运行完后，会在A5得到m个长度为n的序列，分别表示各分段中O_CUSTKEY的下单数。

A8 对A5进行行列转置后，将每行加起来，即得到总的orders表中各O_CUSTKEY的下单数。然后用groups计算出每种下单数对应的顾客数。

四、使用企业版列式计算

1. 原始数据

	A
1	=now()
2	>filter="specialaccounts*"
3	=file("orders.ctx").open().cursor@mv(O_CUSTKEY;!like(O_COMMENT,filter))
4	=A3.groups@uz(O_CUSTKEY;count(1):c_count;10000000)
5	=A4.len()
6	=A4.groups@um(c_count;count(1):custdist).o()
7	=file("customer.ctx").open().cursor().skip()
8	=A6.insert(0,0,A7-A5)
9	=A6.sort@z(custdist,c_count)
10	=interval@ms(A1,now())

A4中groups加选项@z，表示多路并行分组时各路共享结果集，最后一个参数是分组比对时的HASH空间长度，一般应设为分组结果集的分组数，本例中分组后有9999818组，将参数设成了10000000。选项@z在分组结果集很大时(一般百万组以上)使用。

A6中的o()函数将列式序表转换成常规序表，以便在A8中调用insert方法

2. 优化后数据

	A
1	=now()
2	>filter="specialaccounts*"
3	=file("orders_13.ctx").open().cursor@mv(O_CUSTKEY;!like(O_COMMENT,filter))
4	=A3.groups@uz(O_CUSTKEY;count(1):c_count;10000000)
5	=A4.len()
6	=A4.groups@um(c_count;count(1):custdist).o()
7	=file("customer_13.ctx").open().cursor().skip()
8	=A6.insert(0,0,A7-A5)
9	=A6.sort@z(custdist,c_count)
10	=interval@ms(A1,now())

列式计算还未提供groups@n方法，所以与原始数据的脚本是一样的

五、测试结果

单位：秒

	常规	列式
优化前	132	8.4
序号化后用groups@n	80	8.4
序号化后用序列	44	-

tpch(25) 性能优化(158)