"select n_name, sum(l_extendedprice * (1 - l_discount)) as revenue from customer, orders, lineite .."

terminator 北京
陈虎。真心想用集算器这样的工具降低程序员们的劳动强度，真心想成为高性能复杂计算、大数据计算编程难题的终结者。
1 回帖 • 516 浏览 • 5 个月前

性能优化案例课程 TPCH-Q5

计算＆AI

集算器(507)

select
	n_name,
	sum(l_extendedprice * (1 - l_discount)) as revenue
from
	customer,
	orders,
	lineitem,
	supplier,
	nation,
	region
where
	c_custkey = o_custkey
	and l_orderkey = o_orderkey
	and l_suppkey = s_suppkey
	and c_nationkey = s_nationkey
	and s_nationkey = n_nationkey
	and n_regionkey = r_regionkey
	and r_name = 'ASIA'
	and o_orderdate >= date '1995-01-01'
	and o_orderdate < date '1995-01-01' + interval '1' year
group by
	n_name
order by
revenue desc;

这是多表关联过滤后的分组聚合运算，其中包括主子表 orders 和 lineitem 关联。

1. 数据存储

大表 orders、lineitem 按照主键 orderkey 有序存储，连接的时候可以有序归并。其他表没有特殊的排序要求，就按照主键有序存储。

这样，就可以继续使用题目 Q3 中的 orders.ctx、lineitem.ctx、customer.ctx。

region、nation、supplier 表直接使用题目 Q2 的 regin.btx，nation.btx 和 supplier.ctx。

将这些表复制到本题的主目录中。

2. 一般实现

计算代码：

	A	B
1	=now()
2	1995-1-1	>name="ASIA"
3	=elapse@y(A2,1)
4	=file("region.btx").import@b().select(R_NAME==name)
5	=file("nation.btx").import@b().switch@i(N_REGIONKEY,A4:R_REGIONKEY).keys@i(N_NATIONKEY)
6	=file("supplier.ctx").open().cursor@m(S_SUPPKEY,S_NATIONKEY;A5.find(S_NATIONKEY)).fetch().keys@im(S_SUPPKEY)
7	=file("customer.ctx").open().cursor@m(C_CUSTKEY,C_NATIONKEY;A5.find(C_NATIONKEY)).fetch().keys@im(C_CUSTKEY)
8	=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A2 && O_ORDERDATE < A3,O_CUSTKEY:A7)
9	=file("lineitem.ctx").open().news(A8,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_CUSTKEY;L_SUPPKEY:A6)
10	=A9.select(O_CUSTKEY.C_NATIONKEY==L_SUPPKEY.S_NATIONKEY)
11	=A10.groups@u(L_SUPPKEY.S_NATIONKEY:N_NAME;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue)
12	=A11.sort@z(revenue).run(N_NAME=A5.find(N_NAME).N_NAME)

这里使用了前面题目中说过的游标建立时过滤、主子表有序关联和将关联字段转换成外键表指针的技巧。

与 Q3 不大相同的是，最后用于分组的字段不是有序的 L_ORDERKEY，所以不能使用 group@s。这里用 groups@u，结果不再按照分组字段有序，性能比不带 @u 会好一些。

A11 中用 S_NATIONKEY 代替 N_NAME 进行分组，因为整数比较比字符串比较效率高，分组结束后再把它替换成 N_NAME。

测试结果：

测试项目	执行时间（秒）
一般实现	18

3. 数据变换

数据变换要采用前面题目提到的维表主键序号化、枚举型字符串字段数字化、日期整数化手段。

region、nation、supplier 表，可以直接使用以前的题目 Q2 转换的 region_2.btx，nation_2.btx 和 supplier_2.ctx。

orders 表，可以使用 Q3 中的 orders_5.ctx。

将这些表复制到本题的主目录中。

customer 转换代码：

	A
1	=file("customer.ctx").open().import()
2	=file("nation.btx").import@b()
3	=A1.run(C_CUSTKEY=#,C_NATIONKEY=A2.pselect(N_NATIONKEY==C_NATIONKEY))
4	=file("customer_3.ctx").create(#C_CUSTKEY, C_NAME, C_ADDRESS, C_NATIONKEY, C_PHONE, C_ACCTBAL, C_MKTSEGMENT, C_COMMENT)
5	>A4.append(A3.cursor())

这里使用 Q3 的 customer.ctx 和 Q2 中的 nation.btx。

lineitem 转换代码：

	A
1	=file("supplier.ctx").open().import().keys@i(S_SUPPKEY)
2	=file("lineitem.ctx").open().cursor()
3	=A2.run(L_SUPPKEY=A1.pfind(L_SUPPKEY))
4	=file("lineitem_3.ctx").create@p(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT)
5	>A4.append(A3)

这里使用 Q2 中的 supplier.ctx 和 Q3 中的 lineitem.ctx。

计算代码：

	A	B
1	=now()
2	1995-1-1	>name="ASIA"
3	=days@o(elapse@y(A2,1))	=days@o(A2)
4	=file("region_2.btx").import@b(R_NAME).(R_NAME==name)
5	=file("nation_2.btx").import@b(N_NATIONKEY,N_NAME,N_REGIONKEY)
6	=A5.(A4(N_REGIONKEY))
7	=file("supplier_2.ctx").open()	=A7.cursor@m().skip().(null)
8	=A7.cursor@m(S_SUPPKEY,S_NATIONKEY;A6(S_NATIONKEY)).fetch().(B7(S_SUPPKEY)=S_NATIONKEY)
9	=file("customer_3.ctx").open()	=A9.cursor@m().skip().(null)
10	=A9.cursor@m(C_CUSTKEY,C_NATIONKEY;A6(C_NATIONKEY)).fetch().(B9(C_CUSTKEY)=C_NATIONKEY)
11	=file("orders_5.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=B3 && O_ORDERDATE<A3,O_CUSTKEY:B9:#)
12	=file("lineitem_3.ctx").open().news(A11,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_CUSTKEY;L_SUPPKEY:B7:#)
13	=A12.select(O_CUSTKEY ==L_SUPPKEY)
14	=A13.groups@u(L_SUPPKEY:N_NAME;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue)
15	=A14.sort@z(revenue).run(N_NAME=A5(N_NAME).N_NAME)
16	=interval@ms(A1,now())

B7、B9 是对位序列，成员是 true 或者 false。

A11、A12 把 order 的 O_CUSTKEY 赋值为 C_NATIONKEY，把 lineitem 表的 L_SUPPKEY 复制为 S_NATIONKEY。A13 中的 O_CUSTKEY ==L_SUPPKEY，实际上是 C_NATIONKEY 和 S_NATIONKEY 的比较。

测试结果：

测试项目	执行时间（秒）
一般实现	18
数据变换	10

4. 列式计算

	A	B
1	=now()
2	1995-1-1	>name="ASIA"
3	=days@o(elapse@y(A2,1))	=days@o(A2)
4	=file("region_2.btx").import@b(R_NAME).(R_NAME==name)
5	=file("nation_2.btx").import@b(N_NATIONKEY,N_NAME,N_REGIONKEY)
6	=A5.(A4(N_REGIONKEY))
7	=file("supplier_2.ctx").open()	=A7.cursor@m().skip().(null)
8	=A7.cursor@mv(S_SUPPKEY,S_NATIONKEY;A6(S_NATIONKEY)).fetch().(B7(S_SUPPKEY)=~)
9	=file("customer_3.ctx").open()	=A9.cursor@m().skip().(null)
10	=A9.cursor@mv(C_CUSTKEY,C_NATIONKEY;A6(C_NATIONKEY)).fetch().(B9(C_CUSTKEY)=~)
11	=file("orders_5.ctx").open().cursor@mv(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=B3 && O_ORDERDATE<A3 && B9(O_CUSTKEY))
12	=file("lineitem_3.ctx").open().news(A11,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_CUSTKEY;B7(L_SUPPKEY))
13	=A12.derive@o(B7(L_SUPPKEY).S_NATIONKEY,B9(O_CUSTKEY).C_NATIONKEY).select@v(C_NATIONKEY==S_NATIONKEY)
14	=A13.groups@u(S_NATIONKEY;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue)
15	=A14.sort@z(revenue).new(A5(S_NATIONKEY).N_NAME,revenue)
16	=interval@ms(A1,now())

在列式计算中，要尽量避免使用 switch 操作。A11 中使用 B9(O_CUSTKEY) 仅仅过滤数据，而不再使用 O_CUSTKEY:B9:# 这种关联与过滤同时进行的方法。因为这里 O_CUSTKEY 是序号化的，可以直接在 B9 中通过行号获取 customer 记录而无需关联，可以更好地体现列式计算的优势。

A13 中先用 O_CUSTKEY 在 B9 中的 customer 数据找到对应的记录，然后再取得 S_NATIONKEY 的值。为了达到这个目的，在 B9 中计算出的对位序列就不再是 true 或者 false 了，满足条件的是 customer 记录，不满足条件的是 null。

对于 lineitem 和 supplier 的关联，也是和 customer 类似的处理。

A13 中在对列式游标执行 select 过滤时，加选项 @v 使过滤后的游标仍然是列式游标。

测试结果：

测试项目	执行时间（秒）
一般实现	18
数据变换	10
列式计算	4

集算器(507)