"一、测试任务多维分析中常见的事实表和多个及多层维表的关联后按维度的汇总统计，以及宽表按维度的统计。基于TPCH 100G数据集，设计大事实表和多个维表关联的运算： 1. 一个事实表和一个 .."

sjr 北京
乾学院 32 号会员
2 回帖 • 1,056 浏览 • 1 年前

SPL 计算性能系列测试：关联表及宽表

计算＆AI

性能对比(29) 关联(4) 宽表(7)

一、测试任务

多维分析中常见的事实表和多个及多层维表的关联后按维度的汇总统计，以及宽表按维度的统计。

基于TPCH 100G数据集，设计大事实表和多个维表关联的运算：

1. 一个事实表和一个维表关联，共两表关联运算

2. 主子事实表和四个维表关联，其中一个维表被使用两次，共七表关联运算

3. 将2中的七表关联拼成宽表，基于宽表的统计运算。

二、对比技术

这里仅对SPL企业版（20230528版）进行了测试。同时选择了下面两款产品用作对比技术

1. Clickhouse 23.3.1, 传说中世界上最快的OLAP数据库

2. Starrocks 3.0.0, 宣称更快的OLAP数据库

三、测试环境

单台物理服务器，配置：

2颗Intel3014，主频1.7G，共12核CPU

64G内存

SSD固态硬盘

TPCH 100G中最大表也就约70G，简单压缩后就很可能小于机器的物理内存。为了能测试出这些产品的外存计算能力以及对内存的敏感性，我们使用了虚拟机来限制CPU数量和内存，根据业界较常见的云虚拟机规格，我们设计了两套测试环境：

VM1：8CPU，32G内存

VM2：4CPU，16G内存

Starrocks至少要安装两个节点BE和FE，将承担计算任务的BE安装在虚拟机上，管理节点FE安装在物理机上，这样不会影响测试效果。

SPL、Clickhouse都只要在虚拟机下安装就可以了。

测试数据的准备请参考SPL 计算性能系列测试—TPCH。

四、测试过程

1. 两表关联

查询的SQL语句如下：

select
    l_year,
    sum(volume) as revenue
from
    (
        select
            extract(year from l_shipdate) as l_year,
            ( l_extendedprice * (1 - l_discount) ) as volume
        from
            lineitem,
            part
        where
            p_partkey = l_partkey
            and length(p_type)>2
    ) shipping
group by
    l_year
order by
    l_year;

SPL脚本：

	A
1	=now()
2	=file("part.ctx").open().import@mv(P_TYPE).(len(P_TYPE)>2)
3	=file("lineitem.ctx").open().cursor@mv(L_SHIPDATE,L_EXTENDEDPRICE,L_DISCOUNT;A2(L_PARTKEY))
4	=A3.groups(year(L_SHIPDATE):l_year;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue)
5	=interval@ms(A1,now())

2. 七表关联

查询的SQL语句如下：

select
    s_nationname,
    sum(volume) as revenue
from
    (
        select
            n1.n_name as s_nationname,
            ( l_extendedprice * (1 - l_discount) ) as volume
        from
            supplier,
            lineitem,
            orders,
            customer,
            part,
            nation n1,
            nation n2
        where
            s_suppkey = l_suppkey
            and p_partkey = l_partkey
            and o_orderkey = l_orderkey
            and c_custkey = o_custkey
            and s_nationkey = n1.n_nationkey
            and c_nationkey = n2.n_nationkey
            and s_comment not like '%xxx%yyy%'
            and o_totalprice>5
            and length(p_type) > 2
            and n1.n_name is not null
            and n2.n_name is not null
            and c_phone is not null
    ) shipping
group by
    s_nationname
order by
    s_nationname;

SPL脚本：

	A
1	=now()
2	=file("nation.btx").import@bv(N_NAME).(if(N_NAME,N_NAME,null))
3	=file("customer.ctx").open().import@mv(C_NATIONKEY,C_PHONE).(A2(C_NATIONKEY) && C_PHONE)
4	=file("supplier.ctx").open().import@mv(S_NATIONKEY,S_COMMENT).(if(A2(S_NATIONKEY) && !like(S_COMMENT,"xxxyyy*"),S_NATIONKEY,null))
5	=file("part.ctx").open().import@mv(P_TYPE).(len(P_TYPE)>2)
6	=file("orders.ctx").open().cursor@mv(O_ORDERKEY;A3(O_CUSTKEY) && O_TOTALPRICE>5)
7	=file("lineitem.ctx").open().news(A6,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT;A5(L_PARTKEY) && A4(L_SUPPKEY))
8	=A7.groups(A2(A4(L_SUPPKEY)):s_nationname;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue)
9	=interval@ms(A1,now())

3. 宽表

用前节中7个原始表关联后的数据创建一张宽表，然后从新表中查询出同样的结果，与7表关联查询比较运行用时。

创建表的SQL语句如下：

create table widetable ( l_orderkey    int,
    l_linenumber  int,
    l_partkey     int,
    l_suppkey     int,
    l_shipdate    date ,
    l_extendedprice  decimal(15, 2) NOT NULL,
    l_discount    decimal(15, 2) NOT NULL,
    o_totalprice  decimal(15, 2),
    o_custkey     int,
    c_nationname  varchar(25),
    c_phone       varchar(15),
    s_nationname  varchar(25),
    s_comment     varchar(101),
    p_type        varchar(25) );

往表中插入数据的SQL语句如下：

select l_orderkey,l_linenumber,l_partkey,l_suppkey,l_shipdate,
    l_extendedprice,l_discount,o_totalprice,o_custkey,
    n2.n_name as c_nationname,c_phone,
    n1.n_name as s_nationname,s_comment,p_type
into widetable
from supplier,lineitem,orders,customer,part,nation n1,nation n2
where
    s_suppkey = l_suppkey
    and p_partkey = l_partkey
    and o_orderkey = l_orderkey
    and c_custkey = o_custkey
    and s_nationkey = n1.n_nationkey
    and c_nationkey = n2.n_nationkey

从宽表中执行查询的SQL语句如下：

select
    s_nationname,
    sum( l_extendedprice * (1 - l_discount) ) as volume
from widetable
where
    s_comment not like '%xxx%yyy%'
    and o_totalprice>5
    and length(p_type) > 2
    and c_nationname is not null
    and s_nationname is not null
    and c_phone is not null
group by
    s_nationname
order by
    s_nationname;

用与宽表相同的数据创建组表widetable.ctx，SPL脚本如下：

	A
1	=file("nation.btx").import@b(N_NATIONKEY,N_NAME).keys@i(N_NATIONKEY)
2	=file("customer.ctx").open().import(C_CUSTKEY,C_NATIONKEY,C_PHONE).switch(C_NATIONKEY,A1).keys@im(C_CUSTKEY)
3	=file("supplier.ctx").open().import(S_SUPPKEY,S_NATIONKEY,S_COMMENT).switch(S_NATIONKEY,A1).keys@im(S_SUPPKEY)
4	=file("part.ctx").open().import(P_PARTKEY,P_TYPE)
5	=file("orders.ctx").open().cursor@mv(O_ORDERKEY,O_TOTALPRICE,O_CUSTKEY).switch(O_CUSTKEY,A2)
6	=file("lineitem.ctx").open().news(A5,L_ORDERKEY,L_LINENUMBER,L_PARTKEY,L_SUPPKEY,L_SHIPDATE,L_EXTENDEDPRICE,L_DISCOUNT,O_TOTALPRICE,O_CUSTKEY)
7	=A6.switch(L_PARTKEY,A4:P_PARTKEY;L_SUPPKEY,A3:S_SUPPKEY)
8	=A7.new(L_ORDERKEY,L_LINENUMBER,L_PARTKEY.P_PARTKEY:L_PARTKEY,L_SUPPKEY.S_SUPPKEY:L_SUPPKEY,L_SHIPDATE,L_EXTENDEDPRICE,L_DISCOUNT,O_TOTALPRICE,O_CUSTKEY.C_CUSTKEY:O_CUSTKEY,O_CUSTKEY.C_NATIONKEY.N_NAME:C_NATIONNAME,O_CUSTKEY.C_PHONE,L_SUPPKEY.S_NATIONKEY.N_NAME:S_NATIONNAME,L_SUPPKEY.S_COMMENT,L_PARTKEY.P_TYPE)
9	=file("widetable.ctx").create(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY,L_SUPPKEY,L_SHIPDATE,L_EXTENDEDPRICE,L_DISCOUNT,O_TOTALPRICE,O_CUSTKEY,C_NATIONNAME,C_PHONE,S_NATIONNAME,S_COMMENT,P_TYPE)
10	=A9.append@i(A8)

执行查询的SPL脚本：

	A
1	=now()
2	=file("widetable.ctx").open().cursor@mv(S_NATIONNAME,L_EXTENDEDPRICE,L_DISCOUNT;O_TOTALPRICE>5 && C_NATIONNAME!=null && C_PHONE!= null && S_NATIONNAME!=null && len(P_TYPE)>2 && !like(S_COMMENT,"xxxyyy*"))
3	=A2.groups(S_NATIONNAME;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):volume)
4	=interval@ms(A1,now())

五、测试结果

时间单位：秒

	VM1			VM2
	两表关联	七表关联	宽表	两表关联	七表关联	宽表
SPL	11.5	30.6	57.7	21.5	55.6	114.2
Starrocks	35.1	73.3	62.1	78.8	152.5	129.9
Clickhouse	89.3	内存溢出	33.2	204.1	内存溢出	74.3

六、结果点评

1. 关联性能：SPL最好，明显优于其它两款产品；ClickHouse表现非常差，不仅性能远远低于其它两款产品，还会出现内存溢出的现象；

2. 宽表性能：Clikhouse最好，远优于SPL和Starrocks，SPL比Starrocks略强，但差距并不大；

3. 关联较多时，Starrocks和Clickhouse的性能都会下降明显，低于在宽表上的性能，所以这些产品经常会建议使用宽表。

4. SPL在关联运算上的性能超过自身的宽表运算性能，也能超过另两款产品的宽表运算性能，对于SPL来讲，没有必要建立宽表。

5. 观察从VM1到VM2上的性能下降幅度，Clickhouse和Starrocks都超过了2倍（CPU核数的差距），说明其性能还有较强的内存敏感性（VM2的内存也减半了），即运算会占用较多内存。而SPL的下降幅度低于2，说明SPL运算时占用内存更少。

6. 测试结论

SPL的关联性能很好，不需要建立宽表就能应用于多维分析，能大大减少数据准备时间并减少硬盘占用，提高分析的实时性。而且SPL占用内存较小，硬件成本更低。

Clickhouse的宽表性能较好，但完全无法胜任有关联的场景。应用于多维分析场景时需要建立宽表，会比较麻烦且占用更大的硬盘空间。

Starrocks则两头不靠，关联性能虽然超过了Clickhouse但仍然不够好，宽表性能表现也不佳，这两种多维分析的场景都不适合应用。

性能对比(29) 关联(4) 宽表(7)

SPL 计算性能系列测试：关联表及宽表

一、 测试任务

二、 对比技术

三、 测试环境

四、 测试过程