"select * from ( select c_custkey,c_name, sum(l_extendedprice * (1 - l_discount)) as revenue, c_a .."

terminator 北京
陈虎。真心想用集算器这样的工具降低程序员们的劳动强度，真心想成为高性能复杂计算、大数据计算编程难题的终结者。
1 回帖 • 619 浏览 • 1 年前

性能优化案例课程 TPCH-Q10

计算＆AI

集算器(535)

select * from (
    select
        c_custkey,c_name,
        sum(l_extendedprice * (1 - l_discount)) as revenue,
        c_acctbal,n_name,c_address,c_phone,c_comment
    from
        customer,orders,lineitem,nation
    where
        c_custkey = o_custkey
        and l_orderkey = o_orderkey
        and o_orderdate >= date '1993-05-01'
        and o_orderdate < date '1993-05-01' + interval '3' month
        and l_returnflag = 'R'
        and c_nationkey = n_nationkey
    group by
        c_custkey,
        c_name,
        c_acctbal,
        c_phone,
        n_name,
        c_address,
        c_comment
    order by
        revenue desc
) where rownum <=20;

这是多表关联（其中包括主子表 orders 和 lineitem 关联）过滤后的分组聚合运算，然后按分组聚合值排序后取前 20 条记录。

1. 数据存储

大表 orders、lineitem 按照主键 orderkey 有序存储，连接的时候可以有序归并。其他表没有特殊的排序要求，就按照主键有序存储。

这样，就可以继续使用题目 Q3 中的 orders.ctx、lineitem.ctx、customer.ctx。

nation 表直接使用题目 Q2 的 nation.btx。

将这些表复制到本题的主目录中。

2. 一般实现

	A
1	=now()
2	1993-5-1
3	=elapse@m(A2,3)
4	=file("nation.btx").import@b().keys@i(N_NATIONKEY)
5	=file("customer.ctx").open().cursor@m(C_CUSTKEY,C_NAME,C_ADDRESS,C_NATIONKEY,C_PHONE,C_ACCTBAL,C_COMMENT;C_NATIONKEY:A4).fetch().keys@i(C_CUSTKEY)
6	=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A2 && O_ORDERDATE<A3,O_CUSTKEY:A5)
7	=file("lineitem.ctx").open().news@r(A6,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):dp,O_CUSTKEY;L_RETURNFLAG=="R")
8	=A7.groups@u(O_CUSTKEY:c_custkey;sum(dp):revenue)
9	=A8.top(-20;revenue)
10	=A9.new(c_custkey.C_CUSTKEY,c_custkey.C_NAME,revenue,c_custkey.C_ACCTBAL,c_custkey.C_NATIONKEY.N_NAME,c_custkey.C_ADDRESS,c_custkey.C_PHONE,c_custkey.C_COMMENT)
11	=A10.sort@z(revenue)
12	=interval@ms(A1,now())

代码中用到了前面讲到过的游标前过滤、主子表有序归并、外键字段属性化等优化方法。

测试结果：

测试项目	执行时间（秒）
一般实现	17

3. 大维表查找

我们发现，这个运算可以改造成两步，第一步把 orders 和 lineitem 表关联，完成针对 custkey 的分组汇总运算，然后再基于这个结果集继续和维表 customer 做关联运算。用 SQL 大致可以改写成这样：

select c_custkey,c_name,revenue,c_acctbal,n_name,c_address,c_phone,c_comment
from (
    select o_custkey, revenue
    from (
    	select o_custkey,
       	    sum(l_extendedprice * (1 - l_discount)) as revenue
    	from
            orders,lineitem
    	where
       	    l_orderkey = o_orderkey
        	    and o_orderdate >= date '1993-05-01'
        	    and o_orderdate < date '1993-05-01' + interval '3' month
        	    and l_returnflag = 'R'
            group by o_custkey
            order by revenue desc
    ) orders_lineitem 
    where rownum <=20
) orders_lineitem20,customer,nation
where
    c_custkey = o_custkey
    and c_nationkey = n_nationkey
order by revenue desc;

因为最终结果只要取 20 条记录，子查询 orders_lineitem20 也只要返回 20 条记录，然后用这 20 条记录和 cutstomer 表做后续的关联计算即可，而原来的方法相当于子查询 orders_lineitem 的每一条记录都要和 customer 表关联。

这是个小事实表 orders_lineitem20 和大维表 customer 连接的场景，可以采用大维表查找的方法，将连接转换为对 customer 表主键的批量查找。customer 表对主键 c_custkey 有序，可以快速的完成查找计算。

计算代码：

	A
1	=now()
2	1993-5-1
3	=elapse@m(A2,3)
4	=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A2 && O_ORDERDATE<A3)
5	=file("lineitem.ctx").open().news@r(A4,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):dp,O_CUSTKEY;L_RETURNFLAG=="R")
6	=A5.groups@u(O_CUSTKEY:c_custkey;sum(dp):revenue)
7	=A6.top(-20;revenue)
8	=file("nation.btx").import@b().keys@i(N_NATIONKEY)
9	=file("customer.ctx").open()
10	=A7.joinx@q(c_custkey,A9:C_CUSTKEY,C_NAME,C_ADDRESS,C_NATIONKEY,C_PHONE,C_ACCTBAL,C_COMMENT)
11	=A10.switch(C_NATIONKEY,A8:N_NATIONKEY)
12	=A11.new(c_custkey:C_CUSTKEY,C_NAME,revenue,C_ACCTBAL,C_NATIONKEY.N_NAME,C_ADDRESS,C_PHONE,C_COMMENT)
13	=A12.sort@z(revenue)
14	=interval@ms(A1,now())

A7 算出来之后，用 joinx@q 从 customer 表中有序匹配快速地把相关记录取出来，不必将 customer 装入内存，也不必用 orders 表中的大量数据去连接 customer 表，减少数据读取量、计算量。

测试结果：

测试项目	执行时间（秒）
一般实现	17
大维表查找	7

4. 数据变换

数据变换要采用前面题目提到的枚举型字符串字段数字化、维表主键序号化、日期整数化手段。

nation 表，可以直接使用以前的题目 Q2 转换的 nation_2.btx。

customer 使用 Q5 中的 customer_3.ctx。

orders 表使用 Q3 中的 orders_5.ctx。

将这些表复制到本题的主目录中。

Q1 虽然将 lineitem 中的 L_RETURNFLAG 转化成数字了，但 lineitem.ctx 没有主键，不满足有序归并的要求。需要在 Q3 的 lineitem.ctx 基础上做枚举字段数字化转换。

数据转换代码：

	A
1	=file("lineitem.ctx").open()
2	=A1.cursor(L_RETURNFLAG).id(L_RETURNFLAG).sort()
3	=file("l_returnflag.btx").export@b(A2)
4	=A1.cursor().run(L_RETURNFLAG =A2.pos@b(L_RETURNFLAG))
5	=file("lineitem4.ctx").create(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS,L_SHIPDATE,L_COMMITDATE,L_RECEIPTDATE,L_SHIPINSTRUCT,L_SHIPMODE,L_COMMENT)
6	>A5.append(A4)

计算代码：

	A
1	=now()
2	1993-5-1
3	=days@o(elapse@m(A2,3))
4	=days@o(A2)
5	=file("nation_2.btx").import@b()
6	=file("l_returnflag.btx").import@b().(_1=="R")
7	=file("orders_5.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A4 && O_ORDERDATE<A3)
8	=file("lineitem4.ctx").open().news@r(A7,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):dp,O_CUSTKEY;A6(L_RETURNFLAG))
9	=A8.groups@u(O_CUSTKEY:c_custkey;sum(dp):revenue)
10	=A9.top(-20;revenue)
11	=file("customer.ctx").open()
12	=A10.joinx@q(c_custkey,A11:C_CUSTKEY,C_NAME,C_ADDRESS,C_NATIONKEY,C_PHONE,C_ACCTBAL,C_COMMENT)
13	=A12.new(c_custkey:C_CUSTKEY,C_NAME,revenue,C_ACCTBAL,A5(C_NATIONKEY).N_NAME,C_ADDRESS,C_PHONE,C_COMMENT)
14	=A13.sort@z(revenue)
15	=interval@ms(A1,now())

A3、A4 是日期数字化，A6 是对位序列。

A13 的 nation 表使用了外键字段序号化。

测试结果：

测试项目	执行时间（秒）
一般实现	17
大维表查找	7
数据变换	5

5. 列式计算

	A
1	=now()
2	1993-5-1
3	=days@o(elapse@m(A2,3))
4	=days@o(A2)
5	=file("nation_2.btx").import@b()
6	=file("l_returnflag.btx").import@b().(_1=="R")
7	=file("orders_5.ctx").open().cursor@mv(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A4 && O_ORDERDATE<A3)
8	=file("lineitem4.ctx").open().news@r(A7,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):dp,O_CUSTKEY;A6(L_RETURNFLAG))
9	=A8.groups@u(O_CUSTKEY:c_custkey;sum(dp):revenue)
10	=A9.top(-20;revenue)
11	=file("customer.ctx").open()
12	=A10.joinx@q(c_custkey,A11:C_CUSTKEY,C_NAME,C_ADDRESS,C_NATIONKEY,C_PHONE,C_ACCTBAL,C_COMMENT)
13	=A12.new(c_custkey:C_CUSTKEY,C_NAME,revenue,C_ACCTBAL,A5(C_NATIONKEY).N_NAME,C_ADDRESS,C_PHONE,C_COMMENT)
14	=A13.sort@z(revenue)
15	=interval@ms(A1,now())

测试结果：

测试项目	执行时间（秒）
一般实现	17
大维表查找	7
数据变换	5
列式计算	3

集算器(535)