"select o_year, sum(case when nation = 'CHINA' then volume else 0 end) / sum(volume) as mkt_share .."

terminator 北京
陈虎。真心想用集算器这样的工具降低程序员们的劳动强度，真心想成为高性能复杂计算、大数据计算编程难题的终结者。
3 回帖 • 697 浏览 • 7 个月前

性能优化案例课程 TPCH-Q8

计算＆AI

集算器(524)

select
    o_year,
    sum(case
        when nation = 'CHINA' then volume
        else 0
        end) / sum(volume) as mkt_share
from
    (
        select
            extract(year from o_orderdate) as o_year,
            l_extendedprice * (1 - l_discount) as volume,
            n2.n_name as nation
        from
            part,
            supplier,
            lineitem,
            orders,
            customer,
            nation n1,
            nation n2,
            region
        where
            p_partkey = l_partkey
            and s_suppkey = l_suppkey
            and l_orderkey = o_orderkey
            and o_custkey = c_custkey
            and c_nationkey = n1.n_nationkey
            and n1.n_regionkey = r_regionkey
            and r_name = 'ASIA'
            and s_nationkey = n2.n_nationkey
            and o_orderdate between date '1995-01-01' and date '1996-12-31'
            and p_type = 'STANDARD POLISHED BRASS'
    ) all_nations
group by
    o_year
order by
    o_year;

这就是多表关联过滤后的分组聚合运算，其中包括主子表 orders 和 lineitem 关联。

1. 数据存储

大表 orders、lineitem 按照主键 orderkey 有序存储，连接的时候可以有序归并。其他表没有特殊的排序要求，就按照主键有序存储。

这样，就可以继续使用题目 Q3 中的 orders.ctx、lineitem.ctx、customer.ctx。

part、supplier 表直接使用题目 Q2 的 part.ctx、supplier.ctx。

region、nation 表直接使用题目 Q2 的 regin.btx，nation.btx。

将这些表复制到本题的主目录中。

2. 一般实现

	A	B
1	=now()
2	1995-01-01	1996-12-31
3	>nation="CHINA"	>name="ASIA"
4	>type="STANDARD POLISHED BRASS"
5	=file("region.btx").import@b().select@1(R_NAME==name).R_REGIONKEY
6	=file("nation.btx").import@b(N_NATIONKEY,N_NAME,N_REGIONKEY).keys@i(N_NATIONKEY)
7	=A6.select@1(N_NAME==nation).N_NATIONKEY
8	=A6.select(N_REGIONKEY==A5).derive@o().keys(N_NATIONKEY)
9	=file("supplier.ctx").open().cursor(S_SUPPKEY,S_NATIONKEY;S_NATIONKEY:A6).fetch().keys@im(S_SUPPKEY)
10	=file("part.ctx").open().cursor@m(P_PARTKEY;P_TYPE==type).fetch().keys@im(P_PARTKEY)
11	=file("customer.ctx").open().cursor@m(C_CUSTKEY;A8.find(C_NATIONKEY)).fetch().keys@im(C_CUSTKEY)
12	=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERDATE;O_ORDERDATE>=A2 && O_ORDERDATE <=B2 && A11.find(O_CUSTKEY))
13	=file("lineitem.ctx").open().news(A12,L_ORDERKEY,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE;A10.find(L_PARTKEY),L_SUPPKEY:A9)
14	=A13.run(L_EXTENDEDPRICE*=(1-L_DISCOUNT))
15	=A14.groups(year(O_ORDERDATE):o_year;sum(if(L_SUPPKEY.S_NATIONKEY.N_NAME==nation,L_EXTENDEDPRICE,0)):s1,sum(L_EXTENDEDPRICE):s2)
16	=A15.new(o_year,s1/s2:mkt_share)
17	=interval@ms(A1,now())

这里使用了前面题目中说过的游标建立时过滤、主子表有序关联和将关联字段转换成外键表指针的技巧。

不过，A15 中的 L_SUPPKEY.S_NATIONKEY.N_NAME 并不是前面讲过，在 join 或者 where 中出现的维表过滤，而是出现在条件求和表达式中。这里的实现方式是 A9 中 supplier 表连接 nation 表，A13 中 lineitem 表连接 supplier 表，在 A15 中 sum(if) 的时候，外键字段作为条件出现。

测试项目	执行时间（秒）
一般实现	18

3. 维表过滤用在计算式中

上一节，A15 中的 L_SUPPKEY.S_NATIONKEY.N_NAME 关联的表比较多，而 sum(case) 的本质是要根据 l_suppkey 是否符合条件，对 volume 做有条件求和。

对于这个计算需求，如果先过滤 supplier，得到的 s_suppkey 集合作为 l_suppkey 的条件，实现有条件的求和。这样可以避免 supplier 参与后面的关联计算。

用 SQL 写出来是这样的：

select
    o_year,
    sum(case
        when l_suppkey in(
			        select s_suppkey
				from supplier,nation
				where 
                                   s_nationkey=n_nationkey
                                   and n_name='CHINA'
                          ) 
        then volume
        else 0
        end) / sum(volume) as mkt_share
from
    (
        select
            extract(year from o_orderdate) as o_year,
            l_extendedprice * (1 - l_discount) as volume,
            l_suppkey
        from
            part,
            lineitem,
            orders,
            customer,
            nation n1, 
            region
        where
            p_partkey = l_partkey
            and s_suppkey = l_suppkey
            and l_orderkey = o_orderkey
            and o_custkey = c_custkey
            and c_nationkey = n1.n_nationkey
            and n1.n_regionkey = r_regionkey
            and r_name = 'ASIA'
            and o_orderdate between date '1995-01-01' and date '1996-12-31'
            and p_type = 'STANDARD POLISHED BRASS'
    ) all_nations
group by
    o_year
order by
    o_year;

改写后，sum(case) 的条件变成了一个 in 子句，其中的子查询是个有 join 和 where 的语句，SPL 中可以再用维表过滤的办法优化。这样，后面关联表中就可以没有 supplier 及其维表 nation(n2) 了。

计算代码：

	A	B
1	=now()
2	1995-01-01	1996-12-31
3	>nation="CHINA"	>name="ASIA"
4	>type="STANDARD POLISHED BRASS"
5	=file("region.btx").import@b().select@1(R_NAME==name).R_REGIONKEY
6	=file("nation.btx").import@b()
7	=A6.select@1(N_NAME==nation).N_NATIONKEY
8	=A6.select(N_REGIONKEY==A5).derive@o().keys(N_NATIONKEY)
9	=file("supplier.ctx").open().cursor@m(S_SUPPKEY;S_NATIONKEY==A7).fetch().keys@im(S_SUPPKEY)
10	=file("part.ctx").open().cursor@m(P_PARTKEY;P_TYPE==type).fetch().keys@im(P_PARTKEY)
11	=file("customer.ctx").open().cursor@m(C_CUSTKEY;A8.find(C_NATIONKEY)).fetch().keys@im(C_CUSTKEY)
12	=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERDATE;O_ORDERDATE>=A2 && O_ORDERDATE <=B2 && A11.find(O_CUSTKEY))
13	=file("lineitem.ctx").open().news(A12,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE;A10.find(L_PARTKEY))
14	=A13.run(L_EXTENDEDPRICE*=(1-L_DISCOUNT))
15	=A14.groups(year(O_ORDERDATE):o_year;sum(if(A9.find(L_SUPPKEY),L_EXTENDEDPRICE,0)):s1,sum(L_EXTENDEDPRICE):s2)
16	=A15.new(o_year,s1/s2:mkt_share)
17	=interval@ms(A1,now())

A15 采用了前面说的计算方法，用 L_SUPPKEY 查找 A9 的序表主键 S_SUPPKEY，作为求和的条件。这样，A9 中 supplier 不需要连接 nation，A13 中 lineitem 也不再关联 supplier 了。

测试结果：

测试项目	执行时间（秒）
一般实现	18
维表过滤用在计算式中	16

4. 数据变换

数据变换要采用前面题目提到的维表主键序号化、日期整数化手段。

region、nation、supplier、part 表，可以直接使用以前的题目 Q2 转换的 region_2.btx，nation_2.btx、supplier_2.ctx、part_2.ctx。

part 表中的枚举型字符串字段 type 要完成数字化，可以用 Q2 的 p_type.btx。

customer 使用 Q5 中的 customer_3.ctx。

orders 表使用 Q3 中的 orders_5.ctx。

将这些表复制到本题的主目录中。

数据转换代码：

	A
1	=file("part.ctx").open().import(P_PARTKEY).keys@im(P_PARTKEY)
2	=file("lineitem.ctx").open().cursor()
3	=A2.run(L_PARTKEY=A1.pfind(L_PARTKEY))
4	=file("lineitem_4.ctx").create@p(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT)
5	>A4.append(A3)

计算代码：

	A	B
1	=now()
2	=days@o(date("1995-01-01"))	=days@o(date("1996-12-31"))
3	>nationName="CHINA"	>name="ASIA"
4	=p_type=file("p_type.btx").import@b().(_1)
5	=nation= file("nation_2.btx").import@b()
6	>type=p_type.pos@b("STANDARD POLISHED BRASS")
7	=file("region_2.btx").import@b().pselect(R_NAME==name)
8	=nation.pselect(N_NAME==nationName)	=nation.(N_REGIONKEY==A7)
9	=file("supplier_2.ctx").open()	=A9.cursor@m().skip().(false)
10	=A9.cursor@m(S_SUPPKEY;S_NATIONKEY==A8).fetch().(B9(S_SUPPKEY)=true)
11	=file("part_2.ctx").open()	=A11.cursor@m().skip().(false)
12	=A11.cursor@m(P_PARTKEY;P_TYPE==type).fetch().(B11(P_PARTKEY)=true)
13	=file("customer_3.ctx").open()	=A13.cursor@m().skip().(false)
14	=A13.cursor@m(C_CUSTKEY;B8(C_NATIONKEY)).fetch().(B13(C_CUSTKEY)=true)
15	=file("orders_5.ctx").open().cursor@m(O_ORDERKEY,O_ORDERDATE;O_ORDERDATE>=A2 && O_ORDERDATE <=B2 && B13(O_CUSTKEY))
16	=file("lineitem_4.ctx").open().news(A15,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE;B11(L_PARTKEY))
17	=A16.run(L_EXTENDEDPRICE*=(1-L_DISCOUNT))
18	=A17.groups(year(O_ORDERDATE):o_year;sum(if(B9(L_SUPPKEY),L_EXTENDEDPRICE,0)):s1,sum(L_EXTENDEDPRICE):s2)
19	=A18.new(o_year,s1/s2:mkt_share)
20	=interval@ms(A1,now())

B8、B9、B11、B13 是对位序列。

A18 中采用对位序列 B9 来判断 L_SUPPKEY 是否符合求和条件，性能比前面的主键查找更好。

测试结果：

测试项目	执行时间（秒）
一般实现	18
维表过滤用在计算式中	16
数据变换	10

5. 列式计算

	A	B
1	=now()
2	=days@o(date("1995-01-01"))	=days@o(date("1996-12-31"))
3	>nationName="CHINA"	>name="ASIA"
4	=p_type=file("p_type.btx").import@b().(_1)
5	=nation= file("nation_2.btx").import@b()
6	>type=p_type.pos@b("STANDARD POLISHED BRASS")
7	=file("region_2.btx").import@b().pselect(R_NAME==name)
8	=nation.pselect(N_NAME==nationName)	=nation.(N_REGIONKEY==A7)
9	=file("supplier_2.ctx").open()	=A9.cursor@m().skip().(false)
10	=A9.cursor@mv(S_SUPPKEY;S_NATIONKEY==A8).fetch().(B9(S_SUPPKEY)=true)
11	=file("part_2.ctx").open()	=A11.cursor@m().skip().(false)
12	=A11.cursor@mv(P_PARTKEY;P_TYPE==type).fetch().(B11(P_PARTKEY)=true)
13	=file("customer_3.ctx").open()	=A13.cursor@m().skip().(false)
14	=A13.cursor@mv(C_CUSTKEY;B8(C_NATIONKEY)).fetch().(B13(C_CUSTKEY)=true)
15	=file("orders_5.ctx").open().cursor@mv(O_ORDERKEY,O_ORDERDATE;O_ORDERDATE>=A2 && O_ORDERDATE<=B2 && B13(O_CUSTKEY))
	=file("lineitem_4.ctx").open().cursor@v(L_ORDERKEY,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT;B11(L_PARTKEY);A15)
16	=A16.joinx@im(L_ORDERKEY,A15:O_ORDERKEY,O_ORDERDATE)
17	=A17.derive@o(L_EXTENDEDPRICE*(1-L_DISCOUNT):dp)
18	=A18.groups(year(O_ORDERDATE):o_year;sum(if(B9(L_SUPPKEY),dp,0)):s1,sum(dp):s2)
19	=A19.new(o_year,s1/s2:mkt_share)
20	=interval@ms(A1,now())
21	=now()

这里 orders 与 lineitem 主子关联时没有使用 news，而是同步游标后用 joinx@im 进行有序归并连接。当然继续使用 news 也是可以的，实测中用 joinx 要略快一点。

A16 中用 derive@o 新增一列 dp，避免在 A18 中会 2 次计算 dp。在列式计算中，要尽量避免赋值操作，A16 中使用 derive 定义新的计算列，而不再使用 run 给原有列赋值。

测试结果：

测试项目	执行时间（秒）
一般实现	18
维表过滤用在计算式中	16
数据变换	10
列式计算	4

集算器(524)

3 回帖

请输入回帖内容 ...

terminator • 5 个月前

英文版

JohnBian • 5 个月前 via Mac OS

陈虎老师，为啥很少看到用新关联计算函数 (fjoin，pjoin) 来写 SPL 语句, 这个系列基本上都是用 new 和 news 来解决。除了书写上语句的简洁之外，效率上也会有明显优势吗？比如此文第 2 部分中”一般实现“时的 A13 格子，目前用 news 这样写无疑是简洁高效的，但 A13 也可以改成与 A12 同步的组表游标，然后用 pjoin 进行关联，pjoin 在 A13、A12 都是组表游标的前提下也能跳块提速，就是多了一步生成组表游标，把 A13 拆成两步，其余步骤都一样，我测了一下结果是一样的，效率也可喜：😄

	A
13	=file(“lineitem.ctx”).open().cursor@m(L_ORDERKEY,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT;A10.find(L_PARTKEY),L_SUPPKEY:A9;A12)
14	=A13.pjoin@r(L_ORDERKEY,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT;A12,O_ORDERKEY,O_ORDERDATE)

1 回复

terminator • 5 个月前

这个是基础课程，不讲这些新方法，有兴趣的同学可以自己练。