"select sum(l_extendedprice) / 7.0 as avg_yearly from lineitem,part where p_partkey = l_partkey a .."

terminator 北京
陈虎。真心想用集算器这样的工具降低程序员们的劳动强度，真心想成为高性能复杂计算、大数据计算编程难题的终结者。
1 回帖 • 472 浏览 • 1 年前

性能优化案例课程 TPCH-Q17

计算＆AI

集算器(535)

select
    sum(l_extendedprice) / 7.0 as avg_yearly
from
    lineitem,part
where
    p_partkey = l_partkey
    and p_brand = 'Brand#33'
    and p_container = 'LG DRUM'
    and l_quantity < (
        select
            0.2 * avg(l_quantity) 
        from
            lineitem
        where
            l_partkey = p_partkey
);

这种在子查询中用等值条件与主表关联的情况，都可以转换成 JOIN 来计算，相当于如下 SQL：

select
    sum(l_extendedprice) / 7.0 as avg_yearly
from
    lineitem,part,
    (select 
        l_partkey lp, 
        0.2*avg(l_quantity) lq
    from 
        lineitem
    group by 
        l_partkey
    ) lpq
where
    p_partkey = l_partkey
    and p_brand = 'Brand#33'
    and p_container = 'LG DRUM'
    and l_partkey = lp
    and l_quantity < lq;

子查询 lpq 是分组聚合运算，结果可以看成以 lp 为主键的表。这样，主查询就相当于是事实表 lineitem 关联两个维表 lpq 和 part 后的聚合运算。

1. 数据存储

lineitem、part 表没有什么特殊要求，按照主键有序存储。

继续使用题目 Q3 中的 lineitem.ctx 和 Q2 中的 part.ctx，复制到本题的主目录中。

2. 一般实现

计算代码：

	A	B
1	=now()
2	=brand="Brand#33"	=container="LG DRUM"
3	=file("part.ctx").open().cursor@m(P_PARTKEY;P_BRAND==brand && P_CONTAINER==container).fetch().keys@im(P_PARTKEY)
4	=file("lineitem.ctx").open()
5	=A4.cursor@m(L_PARTKEY,L_QUANTITY;A3.find(L_PARTKEY))
6	=A5.groups@u(L_PARTKEY;avg(L_QUANTITY):avg).run(avg/=5).keys@im(L_PARTKEY)
7	=A4.cursor@m(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE;L_PARTKEY:A6)
8	=A7.total(sum(if(L_QUANTITY<L_PARTKEY.avg,L_EXTENDEDPRICE,0)))/7.0
9	=interval@ms(A1,now())

A5、A7 采用了游标前过滤，过滤条件是主键查找。

测试结果：

测试项目	执行时间（秒）
一般实现	10

3. 数据转换

本题要使用维表主键序号化方法，Q8 中的 lineitem_4.ctx 和 Q16 中的 part_2.ctx 复制到本题的主目录。

在 Q16 的 part_2.ctx 已经完成了枚举型字符串字段 p_brand 的数字化，还要把枚举型字符串字段 p_container 数字化。

数据转换代码：

	A
1	=file("part_16_3.ctx").open().cursor().fetch()
2	=A1.id(P_CONTAINER).sort()
3	=file("p_container.btx").export@b(A2)
4	=A1.run(P_CONTAINER=A2.pos@b(P_CONTAINER))
5	=file("part_17_3.ctx").create(#P_PARTKEY, P_NAME,P_MFGR, P_BRAND, P_TYPE, P_SIZE, P_CONTAINER, P_RETAILPRICE, P_COMMENT)
6	>A5.append(A4.cursor())

计算代码：

	A	B
1	=now()
2	=file("p_brand.btx").import@b().(_1)
3	=file("p_container.btx").import@b().(_1)
4	=brand=A2.pos@b("Brand#33")	=container=A3.pos@b("LG DRUM")
5	=file("part_17_3.ctx").open()	=A5.cursor().skip().(false)
6	=A5.cursor@m(P_PARTKEY;P_BRAND==A4 && P_CONTAINER==B4).fetch().(B5(P_PARTKEY)=true)
7	=file("lineitem_4.ctx").open()
8	=A7.cursor@m(L_PARTKEY,L_QUANTITY;B5(L_PARTKEY))
9	=A8.groups@u(L_PARTKEY;avg(L_QUANTITY):avg).run(B5(L_PARTKEY)=avg*0.2)
10	=A7.cursor@m(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE;L_PARTKEY:B5:#)
11	=A10.total(sum(if(L_QUANTITY<L_PARTKEY,L_EXTENDEDPRICE,0)))/7.0
12	=interval@ms(A1,now())

A6 是对位序列，值是 true、false。

A9 中把对位序列 A6 的值更新为 false 或者 avg*0.2，这样在 A10 中，符合条件记录的 L_PARTKEY 会被赋值为 avg*0.2，就可以在 A11 中参加后续计算。

测试结果：

测试项目	执行时间（秒）
一般实现	10
数据变换	9

数据变换后提速不明显，是因为上一节的游标前过滤条件是主键查找计算，性能和本节使用的对位序列相差不是很大。

4. 列式计算

计算代码：

	A	B
1	=now()
2	=file("p_brand.btx").import@b().(_1)
3	=file("p_container.btx").import@b().(_1)
4	=brand=A2.pos@b("Brand#33")	=container=A3.pos@b("LG DRUM")
5	=file("part_17_3.ctx").open()	=A5.cursor().skip().(false)
6	=A5.cursor@mv(P_PARTKEY;P_BRAND==A4 && P_CONTAINER==B4).fetch().(B5(P_PARTKEY)=true)
7	=file("lineitem_4.ctx").open()
8	=A7.cursor@mv(L_PARTKEY,L_QUANTITY;B5(L_PARTKEY))
9	=A8.groups@u(L_PARTKEY;avg(L_QUANTITY):avg).run(B5(L_PARTKEY)=avg*0.2)
10	=A7.cursor@m(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE;B5(L_PARTKEY))
11	=A10.total(sum(if(L_QUANTITY<B5(L_PARTKEY),L_EXTENDEDPRICE,0)))/7.0
12	=interval@ms(A1,now())

列式游标不能用 switch 给字段赋值，所以 A10 的过滤条件由上一节的 L_PARTKEY:B5:# 变为 B5(L_PARTKEY)，只过滤不赋值，在 A11 中再用 B5(L_PARTKEY) 计算一次。

测试结果：

测试项目	执行时间（秒）
一般实现	10
数据转换	9
列式计算	6

集算器(535)