"一、测试任务现代商业智能中，指标分析的页面常常会同时呈现出很多个统计指标，这些指标大都从同一个数据集（比如宽表）中计算出来，使用相同的过滤条件，对关心的测度字段按多个（可能几十个）不同维度 .."

sjr 北京
乾学院 32 号会员
2 回帖 • 1,201 浏览 • 1 年前

SPL 计算性能系列测试：多指标统计

计算＆AI

性能对比(29) 多指标统计(1) 遍历复用(1)

一、测试任务

现代商业智能中，指标分析的页面常常会同时呈现出很多个统计指标，这些指标大都从同一个数据集（比如宽表）中计算出来，使用相同的过滤条件，对关心的测度字段按多个（可能几十个）不同维度汇总（分组聚合）的结果。

基于《SPL 计算性能系列测试：关联表及宽表》中描述的宽表，我们分别测试同时计算1个指标、2个指标、3个指标的性能。SPL有遍历复用机制，可以在一次遍历时计算出多个分组聚合，计算N个指标的时间通常不会是一个指标的N倍。

二、对比技术

这里仅对SPL企业版（20230528版）进行了测试。同时选择了下面两款产品用作对比技术

1. Clickhouse 23.3.1, 传说中世界上最快的OLAP数据库

2. Starrocks 3.0.0, 宣称更快的OLAP数据库

三、测试环境

单台物理服务器，配置：

2颗Intel3014，主频1.7G，共12核CPU

64G内存

SSD固态硬盘

为了能测试出这些产品的外存计算能力以及对内存的敏感性，我们使用了虚拟机来限制CPU数量和内存，根据业界较常见的云虚拟机规格，我们设计了两套测试环境：

VM1：8CPU，32G内存

VM2：4CPU，16G内存

Starrocks至少要安装两个节点BE和FE，将承担计算任务的BE安装在虚拟机上，管理节点FE安装在物理机上，这样不会影响测试效果。

SPL、Clickhouse都只要在虚拟机下安装就可以了。

四、测试过程

1. 1个指标

和《SPL 计算性能系列测试：关联表及宽表》测试中宽表上的运算一样。

select
    s_nationname,
    sum( l_extendedprice * (1 - l_discount) ) as volume
from widetable
where
    s_comment not like '%xxx%yyy%'
    and o_totalprice>5
    and length(p_type) > 2
    and c_nationname is not null
    and s_nationname is not null
    and c_phone is not null
group by s_nationname

SPL脚本：

	A
1	=now()
2	=file("widetable.ctx").open().cursor@mv(S_NATIONNAME,L_EXTENDEDPRICE,L_DISCOUNT;O_TOTALPRICE>5 && C_NATIONNAME!=null && C_PHONE!= null && S_NATIONNAME!=null && len(P_TYPE)>2 && !like(S_COMMENT,"xxxyyy*"))
3	=A2.groups(S_NATIONNAME;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):volume)
4	=interval@ms(A1,now())

2. 2个指标

select
    s_nationname,
    sum( l_extendedprice * (1 - l_discount) ) as volume
from widetable
where
    s_comment not like '%xxx%yyy%'
    and o_totalprice>5
    and length(p_type) > 2
    and c_nationname is not null
    and s_nationname is not null
    and c_phone is not null
group by s_nationname
union all
select
    c_nationname,
    sum( l_extendedprice * (1 - l_discount) ) as volume
from widetable
where
    s_comment not like '%xxx%yyy%'
    and o_totalprice>5
    and length(p_type) > 2
    and c_nationname is not null
    and s_nationname is not null
    and c_phone is not null
group by c_nationname

这里简单把多个指标结果集union all了，方便一次性返回，现实应用时会分别返回。这种调整不会影响性能测试。

SPL脚本：

	A	B
1	=now()
2	=file("widetable.ctx").open().cursor@mv(S_NATIONNAME,C_NATIONNAME,L_EXTENDEDPRICE,L_DISCOUNT;O_TOTALPRICE>5 && C_NATIONNAME!=null && C_PHONE!= null && S_NATIONNAME!=null && len(P_TYPE)>2 && !like(S_COMMENT,"xxxyyy")).derive@o(L_EXTENDEDPRICE(1-L_DISCOUNT):volume)
3	cursor A2	=A3.groups@u(S_NATIONNAME:gid;sum(volume):volume)
4	cursor	=A4.groups@u(C_NATIONNAME:gid;sum(volume):volume)
5	=A3\|A4
6	=interval@ms(A1,now())

3. 3个指标

select
    s_nationname,
    sum( l_extendedprice * (1 - l_discount) ) as volume
from widetable
where
    s_comment not like '%xxx%yyy%'
    and o_totalprice>5
    and length(p_type) > 2
    and c_nationname is not null
    and s_nationname is not null
    and c_phone is not null
group by s_nationname
union all
select
    c_nationname,
    sum( l_extendedprice * (1 - l_discount) ) as volume
from widetable
where
    s_comment not like '%xxx%yyy%'
    and o_totalprice>5
    and length(p_type) > 2
    and c_nationname is not null
    and s_nationname is not null
    and c_phone is not null
group by c_nationname
union all
select
    p_type,
    sum( l_extendedprice * (1 - l_discount) ) as volume
from widetable
where
    s_comment not like '%xxx%yyy%'
    and o_totalprice>5
    and length(p_type) > 2
    and c_nationname is not null
    and s_nationname is not null
    and c_phone is not null
group by p_type

SPL脚本：

	A	B
1	=now()
2	=file("widetable.ctx").open().cursor@mv(S_NATIONNAME,C_NATIONNAME,P_TYPE,L_EXTENDEDPRICE,L_DISCOUNT;O_TOTALPRICE>5 && C_NATIONNAME!=null && C_PHONE!= null && S_NATIONNAME!=null && len(P_TYPE)>2 && !like(S_COMMENT,"xxxyyy")).derive@o(L_EXTENDEDPRICE(1-L_DISCOUNT):volume)
3	cursor A2	=A3.groups@u(S_NATIONNAME:gid;sum(volume):volume)
4	cursor	=A4.groups@u(C_NATIONNAME:gid;sum(volume):volume)
5	cursor	=A5.groups@u(P_TYPE:gid;sum(volume):volume)
6	=A3\|A4\|A5
7	=interval@ms(A1,now())

4. SPL关联后3个指标

《SPL 计算性能系列测试：关联表及宽表》测试表明，SPL的关联性能较好，我们再多做一次临时关联后再计算3个指标的运算。

SPL脚本：

	A	B
1	=now()
2	=file("nation.btx").import@bv(N_NAME).(if(N_NAME,N_NAME,null))
3	=file("customer.ctx").open().import@mv(C_NATIONKEY,C_PHONE).(if(A2(C_NATIONKEY) && C_PHONE,C_NATIONKEY,null))
4	=file("supplier.ctx").open().import@mv(S_NATIONKEY,S_COMMENT).(if(A2(S_NATIONKEY) && !like(S_COMMENT,"xxxyyy*"),S_NATIONKEY,null))
5	=file("part.ctx").open().import@mv(P_TYPE).(if(len(P_TYPE)>2,P_TYPE,null))
6	=file("orders.ctx").open().cursor@mv(O_ORDERKEY,O_CUSTKEY;A3(O_CUSTKEY) && O_TOTALPRICE>5)
7	=file("lineitem.ctx").open().news(A6,L_SUPPKEY,L_PARTKEY,O_CUSTKEY,L_EXTENDEDPRICE,L_DISCOUNT;A5(L_PARTKEY) && A4(L_SUPPKEY)).derive@o(L_EXTENDEDPRICE*(1-L_DISCOUNT):volume)
8	cursor A7	=A8.groups@u(A2(A4(L_SUPPKEY)):gid;sum(volume):volume)
9	cursor	=A9.groups@u(A2(A3(O_CUSTKEY)):gid;sum(volume):volume)
10	cursor	=A10.groups@u(A5(L_PARTKEY):gid;sum(volume):volume)
11	=A8\|A9\|A10
12	=interval@ms(A1,now())

五、测试结果

时间单位：秒

	VM1			VM2
统计指标数	1	2	3	1	2	3
SPL	57.7	61.6	64.6	114.2	119.5	124.1
Starrocks	62.2	104.6	156.2	135.7	253.6	402.6
Clickhouse	34.7	69.0	106.4	77.4	156.0	249.6
SPL关联			49.5			100.5

六、结果点评

1. SPL的遍历复用效果明显，增加指标数量后，计算时长的增加幅度很小，没有多次重复遍历数据集。

2. 两款SQL都不会做优化，N个指标的计算时长接近1个指标计算时长的N倍，说明数据集很大可能性被多次重复遍历了。

3. 1个指标时，SPL的宽表遍历性能低于Clickhouse，但使用遍历复用技术后，在2个指标时就反超，3个指标时超越幅度就更为明显，现实业务中常常会十几个甚至几十个指标同时计算，差距就会非常巨大。

4. 内存敏感程度和《SPL 计算性能系列测试：关联表及宽表》测试基本一致，SPL对内存敏感程度较低，两款SQL产品对内存都较为敏感，即性能降低幅度超过CPU核数的减少幅度，说明还会受到内存减少的影响。

5. 测试结论：

提供了遍历复用优化机制的SPL比SQL型产品更为适合实现多指标统计，在指标数较多时的优势会非常巨大。

性能对比(29) 多指标统计(1) 遍历复用(1)

SPL 计算性能系列测试：多指标统计

一、 测试任务

二、 对比技术

三、 测试环境

四、 测试过程

1. 1个指标

2. 2个指标

3. 3个指标

4. SPL关联后3个指标

五、 测试结果

六、 结果点评

目录

一、测试任务

二、对比技术

三、测试环境

四、测试过程

五、测试结果

六、结果点评