Python 和 SPL 数据读取与计算性能测试对比

测试环境

系统:CentOS7

虚拟机:VMWare 15.5.1build-15018445

内存:32G

CPU4

数据:1G规模的TPCH

数据读取

数据源分为两种,分别是文本文件和数据库。需要说明的是,本文中使用的数据不是特别大,能够完全放入内存中。

数据读取使用orders表,数据量是150万行。

文本数据

文本文件大小是172M,以符号”|”分割,数据形式如下:

1|18451|O|193738.97|1996-01-02|5-LOW|Clerk#000000951|0|nstructions sleep furiously among |

2|39001|O|41419.39|1996-12-01|1-URGENT|Clerk#000000880|0| foxes. pending accounts at the pending, silent asymptot|

3|61657|F|208403.57|1993-10-14|5-LOW|Clerk#000000955|0|sly final accounts boost. carefully regular ideas cajole carefully. depos|

4|68389|O|30234.04|1995-10-11|5-LOW|Clerk#000000124|0|sits. slyly regular warthogs cajole. regular, regular theodolites acro|

5|22243|F|149065.30|1994-07-30|5-LOW|Clerk#000000925|0|quickly. bold deposits sleep slyly. packages use slyly|

...

Python读取文本文件:

import pandas as pd

orders_file="/svmiracle/tpchdata/orders.tbl"

data = pd.read_csv(orders_file,index_col=0,sep="|",header=None)

耗时7.84秒。

SPL读取文本文件:


A

1

'/svmiracle/tpchdata/orders.tbl

2

=file(A1).import(;,"|")

耗时6.21秒。

SPL拥有独有的二进制文件格式,可以大大提高读取效率。

orders存成了 btx 格式的二进制文件,大小是 148.5M, SPL读取:


A

1

'/svmiracle/tpchdata/orders.btx

2

=file(A1).import@b()

耗时为2.56秒。

Python没有通行的二进制文件格式,这里也就不对比了。

SPL还可以并行读取文件:


A

1

'/svmiracle/tpchdata/orders.tbl

2

=file(A1).import@m(;,"|")

耗时1.68秒。

@m选项可以并行读取数据,加快读取速度,但结果集的顺序可能变化,当顺序很重要时,要慎用。

@m选项的线程数在 SPL 的工具——选项中设置,如下图:

btx文件也可以并行读取,同样是增加 @m 即可:


A

1

'/svmiracle/tpchdata/orders.btx

2

=file(A1).import@mb()

耗时0.73秒。

Python 的自身的并行是伪并行,依赖其他第三方库的并行又不方便而且效果也不好,所以这里就不测试了。

数据库

数据库以oracle为例测试。

Python读取数据库:

import cx_Oracle

import pandas as pd

pd.set_option('display.max_columns',None)

db = cx_Oracle.connect('kezhao','kezhao','192.168.0.121:1521/ORCLCDB')

cs = db.cursor()

sql = "select * from orders"

orders_cs = cs.execute(sql)

orders_data = orders_cs.fetchall()

columns = ["O_ORDERKEY","O_CUSTKEY","O_ORDERSTATUS","O_TOTALPRICE","O_ORDERDATE","O_ORDERPRIORITY","O_CLERK","O_SHIPPRIORITY","O_COMMENT"]

orders=pd.DataFrame(orders_data,columns=columns)

耗时18.15秒。

SPL读取数据库:


A

1

=connect("oracle19")

2

=A1.query@d("select * from orders")

耗时47.41秒。

同样的,SPL 还可以并行读取数据库:


A

B

1

>n=4

/线程数

2

fork to(n)

=connect(“oracle19”)

3


=B2.query@dx(“select * from orders where mod(O_ORDERKEY,?)=?”,n,A2-1)

4

=A2.conj()


耗时10.78秒。

小结

读取数据能力对比表,单位:秒

数据源

Python

SPL 串行

SPL 并行

文本文件

7.84

6.21

1.68

二进制文件 (btx)


2.56

0.73

数据库

18.15

47.41

10.78

对于文本文件,单线程时 SPL 与 Python 的读取性能区别不大,但 SPL 利用并行后能大幅提速。如果将文本文件存成 SPL 的二进制文件 btx,那么读取数据的性能也能大幅提升,远远优于 Python。

对于读取数据库方面,单线程时,Python 的优势明显,SPL 受累于 Oracle JDBC 的低性能,表现不佳(本文用 Oracle 最常用的 thin JDBC,Oracle 还有一种 oci JDBC 据说性能更好,但配置很麻烦也不常用,所以这里就没测试了)。好在 SPL 可以利用并行弥补这一劣势。

基础运算

这部分我们先测试在 SQL 中不需要子查找就能描述基础运算,如常规聚合 sum、分组 group、连接 join。数据使用文本文件数据。

Q1

查询SQL语句如下:

select

l_returnflag,

l_linestatus,

sum(l_quantity) as sum_qty,

sum(l_extendedprice) as sum_base_price,

sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,

sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,

avg(l_quantity) as avg_qty,

avg(l_extendedprice) as avg_price,

avg(l_discount) as avg_disc,

count(*) as count_order

from

lineitem

where

l_shipdate <= date '1995-12-01' - interval '90' day(3)

group by

l_returnflag,

l_linestatus

order by

l_returnflag,

l_linestatus;

这是个小结果集的常规分组汇总。

Python代码:

import time
import pandas as pd
dir = "/svmiracle/tpchtbl1g/"
pd.set_option('display.max_columns',None)
columns =["L_SHIPDATE",
"L_QUANTITY",
"L_EXTENDEDPRICE",
"L_DISCOUNT",
"L_TAX",
"L_RETURNFLAG",
"L_LINESTATUS"]
lineitem_all=pd.read_csv(dir+"lineitem.tbl",usecols=columns,sep="|",
parse_dates=['L_SHIPDATE'],
infer_datetime_format=True)
s = time.time()
date = pd.to_datetime("1995-12-01")-pd.to_timedelta(90,unit="D")
lineitem = lineitem_all[lineitem_all['L_SHIPDATE']<=date].copy()
lineitem['REVENUE'] = lineitem.L_EXTENDEDPRICE*(1-lineitem.L_DISCOUNT)
lineitem['CHARGE'] = lineitem.REVENUE*(lineitem.L_TAX+1)
res = lineitem.groupby(['L_RETURNFLAG', 'L_LINESTATUS']) \
.agg(sum_qty=pd.NamedAgg(column='L_QUANTITY',aggfunc='sum'),\
sum_base_price=pd.NamedAgg(column='L_EXTENDEDPRICE',aggfunc='sum'), \
sum_disc_price = pd.NamedAgg(column='REVENUE', aggfunc='sum'), \
sum_charge = pd.NamedAgg(column='CHARGE', aggfunc='sum'), \
avg_qty = pd.NamedAgg(column='L_QUANTITY', aggfunc='mean'), \
avg_price = pd.NamedAgg(column='L_EXTENDEDPRICE', aggfunc='mean'), \
avg_disc = pd.NamedAgg(column='L_DISCOUNT', aggfunc='mean'), \
count_order = pd.NamedAgg(column='L_SHIPDATE', aggfunc='count')) \
.reset_index()
e = time.time()
print(e-s)
print(res)

耗时1.84秒。

SPL社区版:


A

1

'/svmiracle/tpchtbl1g/

2

1995-12-01

3

=A2-90

4

=file(A1/"lineitem.tbl").import@t(L_SHIPDATE,L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS;,"|")

5

=now()

6

=A4.select(L_SHIPDATE<=A3)

7

=A6.groups(L_RETURNFLAG, L_LINESTATUS; sum(L_QUANTITY):sum_qty, sum(L_EXTENDEDPRICE):sum_base_price, sum(dp=L_EXTENDEDPRICE*(1-L_DISCOUNT)):sum_disc_price, sum(dp*L_TAX)+sum_disc_price:sum_charge, avg(L_QUANTITY):avg_qty, avg(L_EXTENDEDPRICE):avg_price, avg(L_DISCOUNT):avg_disc, count(1):count_order)

8

=interval@ms(A5,now())

耗时2.45秒。

SPL企业版的提供了向量风格的列式计算,能获得更好的性能:


A

1

'/svmiracle/tpchtbl1g/

2

1995-12-01

3

=A2-90

4

=file(A1/"lineitem.tbl").import@t(L_SHIPDATE,L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS;,"|").i()

5

=now()

6

=A4.cursor().select(L_SHIPDATE<=A3)

7

=A6.derive@o(L_EXTENDEDPRICE*(1-L_DISCOUNT):dp)

8

=A7.groups(L_RETURNFLAG, L_LINESTATUS; sum(L_QUANTITY):sum_qty, sum(L_EXTENDEDPRICE):sum_base_price, sum(dp):sum_disc_price, sum(dp*L_TAX)+sum_disc_price:sum_charge, avg(L_QUANTITY):avg_qty, avg(L_EXTENDEDPRICE):avg_price, avg(L_DISCOUNT):avg_disc, count(1):count_order)

9

=interval@ms(A5,now())

耗时1.30秒。

SPL的代码都可以多线程并行。

SPL社区版并行:


A

1

'/svmiracle/tpchtbl1g/

2

1995-12-01

3

=A2-90

4

=file(A1/"lineitem.tbl").import@mt(L_SHIPDATE,L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS;,"|")

5

=now()

6

=A4.select@m(L_SHIPDATE<=A3)

7

=A6.groups@m(L_RETURNFLAG, L_LINESTATUS; sum(L_QUANTITY):sum_qty, sum(L_EXTENDEDPRICE):sum_base_price, sum(dp=L_EXTENDEDPRICE*(1-L_DISCOUNT)):sum_disc_price, sum(dp*L_TAX)+sum_disc_price:sum_charge, avg(L_QUANTITY):avg_qty, avg(L_EXTENDEDPRICE):avg_price, avg(L_DISCOUNT):avg_disc, count(1):count_order)

8

=interval@ms(A5,now())

耗时1.25秒。

SPL企业版列式并行:


A

1

'/svmiracle/tpchtbl1g/

2

1995-12-01

3

=A2-90

4

=file(A1/"lineitem.tbl").import@tm(L_SHIPDATE,L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS;,"|").i()

5

=now()

6

=A4.cursor@m().select@m(L_SHIPDATE<=A3)

7

=A6.derive@om(L_EXTENDEDPRICE*(1-L_DISCOUNT):dp)

8

=A7.groups@m(L_RETURNFLAG, L_LINESTATUS; sum(L_QUANTITY):sum_qty, sum(L_EXTENDEDPRICE):sum_base_price, sum(dp):sum_disc_price, sum(dp*L_TAX)+sum_disc_price:sum_charge, avg(L_QUANTITY):avg_qty, avg(L_EXTENDEDPRICE):avg_price, avg(L_DISCOUNT):avg_disc, count(1):count_order)

9

=interval@ms(A5,now())

耗时0.59

Q3

查询SQL语句如下:

select * from (

select

l_orderkey,

sum(l_extendedprice * (1 - l_discount)) as revenue,

o_orderdate,

o_shippriority

from

customer,

orders,

lineitem

where

c_mktsegment = 'BUILDING'

and c_custkey = o_custkey

and l_orderkey = o_orderkey

and o_orderdate < date '1995-03-15'

and l_shipdate > date '1995-03-15'

group by

l_orderkey,

o_orderdate,

o_shippriority

order by

revenue desc,

o_orderdate

) where rownum<=10;

这是典型的主子表关联,关联后进行分组统计。

Python代码:

import time
import pandas as pd
dir = "/svmiracle/tpchtbl1g/"
pd.set_option('display.max_columns',None)
cust_cols = ["C_CUSTKEY","C_MKTSEGMENT"]
cust_all = pd.read_csv(dir+"customer.tbl",usecols=cust_cols,sep="|")
orders_cols = ['O_ORDERKEY','O_CUSTKEY','O_ORDERDATE','O_SHIPPRIORITY']
orders_all = pd.read_csv(dir+"orders.tbl",usecols=orders_cols,sep="|",
parse_dates=['O_ORDERDATE'],
infer_datetime_format=True)
lineitem_cols =['L_ORDERKEY','L_EXTENDEDPRICE','L_DISCOUNT','L_SHIPDATE']
lineitem_all = pd.read_csv(dir+"lineitem.tbl",usecols=lineitem_cols,sep="|",
parse_dates=['L_SHIPDATE'],
infer_datetime_format=True)
s = time.time()
date = pd.to_datetime("1995-3-15")
segment = "BUILDING"
cust = cust_all[cust_all['C_MKTSEGMENT']==segment]
orders = orders_all[orders_all['O_ORDERDATE']<date]
lineitem = lineitem_all[lineitem_all['L_SHIPDATE']>date].copy()
orders_c = pd.merge(orders,cust,\
left_on='O_CUSTKEY',\
right_on='C_CUSTKEY',\
how='inner')
lineitem['REVENUE'] = lineitem.L_EXTENDEDPRICE*(1-lineitem.L_DISCOUNT)
lineitem_s = lineitem.groupby('L_ORDERKEY').REVENUE.sum()
lineitem_g = pd.DataFrame(lineitem_s,columns=['REVENUE']).reset_index()
orders_cl = pd.merge(orders_c,lineitem_g,\
left_on='O_ORDERKEY',\
right_on='L_ORDERKEY',\
how='inner')
orders_cl.sort_values(by=['REVENUE','O_ORDERDATE'],\
ascending=[False,True],\
ignore_index=True,\
inplace=True)
res = orders_cl[:10][['L_ORDERKEY','REVENUE','O_ORDERDATE','O_SHIPPRIORITY']]
e = time.time()
print(e-s)
print(res)

耗时1.48

SPL社区版:


A

1

'/svmiracle/tpchtbl1g/

2

1995-3-15

3

BUILDING

4

=file(A1/"customer.tbl").import@t(C_CUSTKEY,C_MKTSEGMENT ;,"|")

5

=file(A1/"orders.tbl").import@t(O_ORDERKEY,O_CUSTKEY,O_ORDERDATE,O_SHIPPRIORITY;,"|")

6

=file(A1/"lineitem.tbl").import@t(L_ORDERKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE;,"|")

7

=now()

8

=A4.select(C_MKTSEGMENT==A3).derive@o().keys@i(C_CUSTKEY)

9

=A5.select(O_ORDERDATE<A2).switch@i(O_CUSTKEY,A8).derive@o().keys@i(O_ORDERKEY)

10

=A6.select(L_SHIPDATE>A2).switch@i(L_ORDERKEY,A9)

11

=A10.groups(L_ORDERKEY.O_ORDERKEY:L_ORDERKEY;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):REVENUE,L_ORDERKEY.O_ORDERDATE,L_ORDERKEY.O_SHIPPRIORITY)

12

=A11.top(10;-REVENUE,O_ORDERDATE)

13

=interval@ms(A7,now())

耗时1.87秒。

SPL企业版列式计算:


A

1

'/svmiracle/tpchtbl1g/

2

1995-3-15

3

BUILDING

4

=file(A1/"customer.tbl").import@t(C_CUSTKEY,C_MKTSEGMENT ;,"|").i()

5

=file(A1/"orders.tbl").import@t(O_ORDERKEY,O_CUSTKEY,O_ORDERDATE,O_SHIPPRIORITY;,"|").i()

6

=file(A1/"lineitem.tbl").import@t(L_ORDERKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE;,"|").i()

7

=now()

8

=A4.select@v(C_MKTSEGMENT==A3).(C_CUSTKEY)

9

=A5.select@v(O_ORDERDATE<A2&&A8.contain(O_CUSTKEY)).keys@i(O_ORDERKEY)

10

=A6.select@v(L_SHIPDATE>A2).join@i(L_ORDERKEY,A9,O_ORDERDATE,O_SHIPPRIORITY)

11

=A10.groups(L_ORDERKEY;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):REVENUE,O_ORDERDATE,O_SHIPPRIORITY)

12

=A11.top(10;-REVENUE,O_ORDERDATE)

13

=interval@ms(A7,now())

耗时0.71秒。

再来看SPL的多线程并行表现:

SPL社区版并行:


A

1

'/svmiracle/tpchtbl1g/

2

1995-3-15

3

BUILDING

4

=file(A1/"customer.tbl").import@mt(C_CUSTKEY,C_MKTSEGMENT ;,"|")

5

=file(A1/"orders.tbl").import@mt(O_ORDERKEY,O_CUSTKEY,O_ORDERDATE,O_SHIPPRIORITY;,"|")

6

=file(A1/"lineitem.tbl").import@mt(L_ORDERKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE;,"|")

7

=now()

8

=A4.select@m(C_MKTSEGMENT==A3).derive@o().keys@i(C_CUSTKEY)

9

=A5.select@m(O_ORDERDATE<A2).switch@i(O_CUSTKEY,A8).derive@o().keys@i(O_ORDERKEY)

10

=A6.select@m(L_SHIPDATE>A2).switch@i(L_ORDERKEY,A9)

11

=A10.groups@m(L_ORDERKEY.O_ORDERKEY:L_ORDERKEY;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):REVENUE,L_ORDERKEY.O_ORDERDATE,L_ORDERKEY.O_SHIPPRIORITY)

12

=A11.top(10;-REVENUE,O_ORDERDATE)

13

=interval@ms(A7,now())

耗时1.16

SPL企业版列式并行:


A

1

'/svmiracle/tpchtbl1g/

2

1995-3-15

3

BUILDING

4

=file(A1/"customer.tbl").import@mt(C_CUSTKEY,C_MKTSEGMENT ;,"|").i()

5

=file(A1/"orders.tbl").import@mt(O_ORDERKEY,O_CUSTKEY,O_ORDERDATE,O_SHIPPRIORITY;,"|").i()

6

=file(A1/"lineitem.tbl").import@mt(L_ORDERKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE;,"|").i()

7

=now()

8

=A4.select@vm(C_MKTSEGMENT==A3).(C_CUSTKEY)

9

=A5.select@mv(O_ORDERDATE<A2&&A8.contain(O_CUSTKEY)).keys@i(O_ORDERKEY)

10

=A6.select@mv(L_SHIPDATE>A2).join@i(L_ORDERKEY,A9,O_ORDERDATE,O_SHIPPRIORITY)

11

=A10.groups@m(L_ORDERKEY;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):REVENUE,O_ORDERDATE,O_SHIPPRIORITY)

12

=A11.top(10;-REVENUE,O_ORDERDATE)

13

=interval@ms(A7,now())

耗时0.51

Q6

SQL语句如下:

select

sum(l_extendedprice * l_discount) as revenue

from

lineitem

where

l_shipdate >= date '1995-01-01'

and l_shipdate < date '1995-01-01' + interval '1' year

and l_discount between 0.05 - 0.01 and 0.05 + 0.01

and l_quantity < 24;

这是一个简单的单表过滤后的聚合运算。

Python代码:

import time
import pandas as pd
dir = "/svmiracle/tpchtbl1g/"
pd.set_option('display.max_columns',None)
lineitem_cols =['L_SHIPDATE','L_QUANTITY','L_EXTENDEDPRICE','L_DISCOUNT']
lineitem_all = pd.read_csv(dir+"lineitem.tbl",usecols=lineitem_cols,sep="|",
parse_dates=['L_SHIPDATE'],
infer_datetime_format=True)
s = time.time()
date_s = pd.to_datetime("1995-01-01")
date_e = date_s+pd.DateOffset(years=1)
mi = 0.05 - 0.01
ma = 0.05 + 0.01
qt = 24
lineitem = lineitem_all[(lineitem_all['L_SHIPDATE'].between(date_s,date_e,"left"))& \
(lineitem_all['L_DISCOUNT'].between(mi,ma,"both"))& \
(lineitem_all['L_QUANTITY']<qt)]
res = (lineitem.L_EXTENDEDPRICE * lineitem.L_DISCOUNT).sum()
e = time.time()
print(e-s)
print(res)

耗时0.19秒。

SPL社区版:


A

1

'/svmiracle/tpchtbl1g/

2

1995-01-01

3

=elapse@y(A2,1)

4

=0.05 - 0.01

5

=0.05 + 0.01

6

24

7

=file(A1/"lineitem.tbl").import@t(L_SHIPDATE,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT;,"|")

8

=now()

9

=A7.select(L_DISCOUNT>=A4&&L_DISCOUNT<A5&&L_QUANTITY<24&&L_SHIPDATE>=A2&&L_SHIPDATE<A3)

10

=A9.sum(L_EXTENDEDPRICE*L_DISCOUNT)

11

=interval@ms(A8,now())

耗时1.54秒。

SPL企业版列式计算:


A

1

'/svmiracle/tpchtbl1g/

2

1995-01-01

3

=elapse@y(A2,1)

4

=0.05 - 0.01

5

=0.05 + 0.01

6

24

7

=file(A1/"lineitem.tbl").import@t(L_SHIPDATE,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT;,"|").i()

8

=now()

9

=A7.sum(if(L_DISCOUNT>=A4&&L_DISCOUNT<A5&&L_QUANTITY<24&&L_SHIPDATE>=A2&&L_SHIPDATE<A3,L_EXTENDEDPRICE*L_DISCOUNT,0))

10

=interval@ms(A8,now())

SPL列式耗时0.55秒。

再看 SPL 多线程并行。

SPL社区版并行:


A

1

'/svmiracle/tpchtbl1g/

2

1995-01-01

3

=elapse@y(A2,1)

4

=0.05 - 0.01

5

=0.05 + 0.01

6

24

7

=file(A1/"lineitem.tbl").import@mt(L_SHIPDATE,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT;,"|")

8

=now()

9

=A7.select@m(L_DISCOUNT>=A4&&L_DISCOUNT<A5&&L_QUANTITY<24&&L_SHIPDATE>=A2&&L_SHIPDATE<A3)

10

=A9.sum(L_EXTENDEDPRICE*L_DISCOUNT)

11

=interval@ms(A8,now())

耗时0.55

SPL企业版列式并行:


A

1

'/svmiracle/tpchtbl1g/

2

1995-01-01

3

=elapse@y(A2,1)

4

=0.05 - 0.01

5

=0.05 + 0.01

6

24

7

=file(A1/"lineitem.tbl").import@t(L_SHIPDATE,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT;,"|").i()

8

=now()

9

=A7.groups@m(;sum(if(L_DISCOUNT>=A4&&L_DISCOUNT<A5&&L_QUANTITY<24&&L_SHIPDATE>=A2&&L_SHIPDATE<A3,L_EXTENDEDPRICE*L_DISCOUNT,0))).~

10

=interval@ms(A8,now())

SPL列式多线程计算耗时0.22

小结

基础 SQL 式计算性能对比:


Q1

Q3

Q6

Python

1.84

1.48

0.19

SPL 社区版

2.45

1.87

1.54

SPL 企业版

1.30

0.71

0.55

SPL 社区版并行

1.25

1.16

0.55

SPL 企业版并行

0.59

0.51

0.22

对于这些基础运算,Python 的表现要好于 SPL 社区版,SPL 为了支持泛型而使用了对象序列,计算时判断比较多,而这里 Python 的 DataFrame 中的列都是单一数据类型的,计算速度更快。SPL 企业版中提供了列数据类型单一的纯序表,和 Python 采用的数据结构类似,表现通常就会好于 Python 了。并行后社区版和企业版计算都得到提升。Q6 是最简单的过滤、相乘、求和运算,这些刚好是 Pandas 的矩阵数据结构最擅长的,所以它的计算性能特别好,超过了 SPL 企业版优化后的结果。

非基础运算

再挑选两个 SQL 需要子查询才能写出来的运算,以及可以利用数据特征提高性能的运算。

Q2

查询 SQL 语句如下:

select * from (

select

s_acctbal,s_name,n_name,p_partkey,p_mfgr,s_address,s_phone,s_comment

from part,supplier,partsupp,nation,region

where

p_partkey = ps_partkey

and s_suppkey = ps_suppkey

and p_size = 25

and p_type like '%COPPER'

and s_nationkey = n_nationkey

and n_regionkey = r_regionkey

and r_name = 'ASIA'

and ps_supplycost = (

select

min(ps_supplycost)

from

partsupp,

supplier,

nation,

region

where

p_partkey = ps_partkey

and s_suppkey = ps_suppkey

and s_nationkey = n_nationkey

and n_regionkey = r_regionkey

and r_name = 'ASIA'

)

order by

s_acctbal desc,n_name,s_name,p_partkey

)

where rownum <= 100;

该 SQL 运算除了常规的关联聚合等基础运算,还包含了取所有最小值记录的运算。

Python 代码:

import time
import pandas as pd
pd.set_option('display.max_columns',None)
dir = "/svmiracle/tpchtbl1g/"
size=25
type="COPPER"
name="ASIA"
def min_r(df):
mi = df['PS_SUPPLYCOST'].min()
minr = df[df['PS_SUPPLYCOST']==mi]
return minr
region_all = pd.read_csv(dir+"region.tbl",
usecols=['R_NAME','R_REGIONKEY'],sep='|')
nation_all = pd.read_csv(dir+"nation.tbl",
usecols=['N_NATIONKEY','N_REGIONKEY','N_NAME'],sep='|')
part_all = pd.read_csv(dir+"part.tbl",
usecols=['P_PARTKEY','P_MFGR','P_SIZE','P_TYPE'],sep='|')
supplier_all = pd.read_csv(dir+"supplier.tbl",
usecols=['S_SUPPKEY','S_NAME','S_ADDRESS','S_NATIONKEY','S_PHONE','S_ACCTBAL','S_COMMENT'],sep='|')
partsupp_all = pd.read_csv(dir+"partsupp.tbl",
usecols=['PS_PARTKEY','PS_SUPPKEY','PS_SUPPLYCOST'],sep='|')
s = time.time()
for i in range(len(region_all)):
sr = region_all.iloc[i]
if sr.R_NAME==name:
rg = sr.R_REGIONKEY
break
nation = nation_all[nation_all['N_REGIONKEY']==rg]
part = part_all[(part_all['P_SIZE']==size)& (part_all['P_TYPE'].str.contains(type))]
sup_nat = pd.merge(supplier_all,nation,\
left_on='S_NATIONKEY',right_on='N_NATIONKEY',how='inner')
ps_par = pd.merge(partsupp_all,part,\
left_on='PS_PARTKEY',right_on='P_PARTKEY',how='inner')
ps_par_sup_nat = pd.merge(ps_par,sup_nat,\
left_on='PS_SUPPKEY',right_on='S_SUPPKEY',how='inner')
ps_min = ps_par_sup_nat.groupby('PS_PARTKEY').apply(lambda x:min_r(x))
ps_min_100 = ps_min.sort_values(['S_ACCTBAL','N_NAME','S_NAME','P_PARTKEY'],ascending=[False,True,True,True]).iloc[:100]
res = ps_min_100[['S_ACCTBAL','S_NAME','N_NAME','P_PARTKEY','P_MFGR','S_ADDRESS','S_PHONE','S_COMMENT']]
e = time.time()
print(e-s)
print(res)

耗时 1.23 秒,其中主要耗时是在查找每组的最小值记录上。

SPL 社区版:


A

1

'/svmiracle/tpchtbl1g/

2

>size=25

3

>type="COPPER"

4

>name="ASIA"

5

=file(A1/"region.tbl").import@t(R_NAME,R_REGIONKEY;,"|")

6

=file(A1/"nation.tbl").import@t(N_NATIONKEY,N_REGIONKEY,N_NAME;,"|")

7

=file(A1/"part.tbl").import@t(P_PARTKEY,P_MFGR,P_SIZE,P_TYPE;,"|")

8

=file(A1/"supplier.tbl").import@t(S_SUPPKEY,S_NAME,S_ADDRESS,S_NATIONKEY,S_PHONE,S_ACCTBAL,S_COMMENT;,"|")

9

=file(A1/"partsupp.tbl").import@t(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;,"|")

10

=now()

11

=A5.select@1(R_NAME==name).R_REGIONKEY

12

=A6.select(N_REGIONKEY==A11).derive@o().keys@i(N_NATIONKEY)

13

=A7.select(P_SIZE==size && pos@t(P_TYPE,type)).derive@o().keys@i(P_PARTKEY)

14

=A8.switch@i(S_NATIONKEY,A12).keys@i(S_SUPPKEY)

15

=A9.switch@i(PS_PARTKEY,A13;PS_SUPPKEY,A14)

16

=A15.groups(PS_PARTKEY;minp@a(PS_SUPPLYCOST)).conj(#2)

17

=A16.top(100;-PS_SUPPKEY.S_ACCTBAL,PS_SUPPKEY.S_NATIONKEY.N_NAME,PS_SUPPKEY.S_NAME,PS_PARTKEY.P_PARTKEY)

18

=A17.new(PS_SUPPKEY.S_ACCTBAL,PS_SUPPKEY.S_NAME,PS_SUPPKEY.S_NATIONKEY.N_NAME,PS_PARTKEY.P_PARTKEY,PS_PARTKEY.P_MFGR,PS_SUPPKEY.S_ADDRESS,PS_SUPPKEY.S_PHONE,PS_SUPPKEY.S_COMMENT)

19

=interval@ms(A10,now())

耗时 0.10 秒。

SPL 把查找最小值记录看作是和 sum、count 类似的聚合函数,所以 A16 中可以直接使用 groups(…;minp@()) 一次计算出结果,性能更好。

SPL 企业版列式计算:


A

1

'/svmiracle/tpchtbl1g/

2

>size=25

3

>type="COPPER"

4

>name="ASIA"

5

=file(A1/"region.tbl").import@t(R_NAME,R_REGIONKEY;,"|").i()

6

=file(A1/"nation.tbl").import@t(N_NATIONKEY,N_REGIONKEY,N_NAME;,"|").i()

7

=file(A1/"part.tbl").import@t(P_PARTKEY,P_MFGR,P_SIZE,P_TYPE;,"|").i()

8

=file(A1/"supplier.tbl").import@t(S_SUPPKEY,S_NAME,S_ADDRESS,S_NATIONKEY,S_PHONE,S_ACCTBAL,S_COMMENT;,"|").i()

9

=file(A1/"partsupp.tbl").import@t(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;,"|").i()

10

=now()

11

=A5.select@v1(R_NAME==name).R_REGIONKEY

12

=A6.select@v(N_REGIONKEY==A11).derive@o().keys@i(N_NATIONKEY)

13

=A7.select@v(P_SIZE==size && pos@t(P_TYPE,type)).derive@o().keys@i(P_PARTKEY)

14

=A8.switch@i(S_NATIONKEY,A12).keys@i(S_SUPPKEY)

15

=A9.switch@i(PS_PARTKEY,A13;PS_SUPPKEY,A14)

16

=A15.groups(PS_PARTKEY;minp@a(PS_SUPPLYCOST)).conj(#2)

17

=A16.top(100;-PS_SUPPKEY.S_ACCTBAL,PS_SUPPKEY.S_NATIONKEY.N_NAME,PS_SUPPKEY.S_NAME,PS_PARTKEY.P_PARTKEY)

18

=A17.new(PS_SUPPKEY.S_ACCTBAL,PS_SUPPKEY.S_NAME,PS_SUPPKEY.S_NATIONKEY.N_NAME,PS_PARTKEY.P_PARTKEY,PS_PARTKEY.P_MFGR,PS_SUPPKEY.S_ADDRESS,PS_SUPPKEY.S_PHONE,PS_SUPPKEY.S_COMMENT)

19

=interval@ms(A10,now())

耗时0.05秒。

Q13

查询 SQL 语句如下:

select

c_count,

count(*) as custdist

from (

select

c_custkey,

count(o_orderkey) c_count

from

customer left outer join orders on

c_custkey = o_custkey

and o_comment not like '%special%accounts%'

group by

c_custkey

) c_orders

group by

c_count

order by

custdist desc,

c_count desc;

这个查询简单看是对 orders 做两轮常规分组,第一轮按 custkey 分组计算出每个顾客的下单数,第二轮再按下单数分组计算出每种下单数各有多少顾客。其中包含了匹配字符串的运算。

Python 代码:

import time
import pandas as pd
pd.set_option('display.max_columns',None)
dir = "/svmiracle/tpchtbl1g/"
filter=".*special.*accounts.*"
orders_all = pd.read_csv(dir+"orders.tbl",
usecols=['O_CUSTKEY','O_COMMENT'],sep='|')
s = time.time()
orders = orders_all[~(orders_all['O_COMMENT'].str.match(filter))]
orders_cnt = orders.groupby('O_CUSTKEY').size()
len_orders_cnt = len(orders_cnt)
orders_cnt_cnt = orders_cnt.groupby(orders_cnt).size()
with open(dir+"customer.tbl",'rb') as f:
n_row = -1
for i in f:
n_row+=1
num = n_row - len_orders_cnt
orders_cnt_cnt[0] = num
res_df = pd.DataFrame(orders_cnt_cnt).reset_index(drop=False)\
.rename(columns={'index':'c_count',0:'custdist'})
res = res_df.sort_values(by=['custdist','c_count'],ascending=False).reset_index()
e = time.time()
print(e-s)
print(res)

Python 耗时 2.26 秒,其中主要耗时是在匹配字符串上。

SPL 社区版:


A

1

'/svmiracle/tpchtbl1g/

2

>filter="*special*accounts*"

3

=file(A1/"orders.tbl").import@t(O_CUSTKEY,O_COMMENT;,"|")

4

=file(A1/"customer.tbl").cursor@t()

5

=now()

6

=A3.select(!like(O_COMMENT,filter))

7

=A4.skip()

8

=A6.groups@u(O_CUSTKEY;count(1):c_count)

9

=A8.len()

10

=A8.groups@u(c_count;count(1):custdist)

11

=A10.insert(0,0,A7-A9)

12

=A10.sort@z(custdist,c_count)

13

=interval@ms(A5,now())

耗时 1.99 秒。

SPL 企业版列式计算:


A

1

'/svmiracle/tpchtbl1g/

2

>filter="*special*accounts*"

3

=file(A1/"orders.tbl").import@t(O_CUSTKEY,O_COMMENT;,"|").i()

4

=file(A1/"customer.tbl").cursor@t()

5

=now()

6

=A3.select@v(!like(O_COMMENT,filter))

7

=A4.skip()

8

=A6.groups@u(O_CUSTKEY;count(1):c_count)

9

=A8.len()

10

=A8.groups@u(c_count;count(1):custdist)

11

=new(0:c_count,A7-A9:custdist).i()

12

=A10|A11

13

=A12.sort@zv(custdist,c_count)

14

=interval@ms(A5,now())

耗时 0.58 秒。

有序数据的 Q3

Python 没有提供利用序计算的方法,所以这里不再提供 Python 的代码。

SPL 可以利用数据有序这一特点,进一步提高运行速度。比如本例中,我们假设 customer.tbl 关于 C_CUSTKEY 有序,orders 关于 O_ORDERKEY 有序,lineitem.tbl 关于 L_ORDERKEY 有序,来看下 SPL 有序列式计算的表现。

SPL 有序社区版:


A

1

'/svmiracle/tpchtbl1g/

2

1995-3-15

3

BUILDING

4

=file(A1/"customer_s.tbl").import@t(C_CUSTKEY,C_MKTSEGMENT ;,"|")

5

=file(A1/"orders_s.tbl").import@t(O_ORDERKEY,O_CUSTKEY,O_ORDERDATE,O_SHIPPRIORITY;,"|")

6

=file(A1/"lineitem_s.tbl").import@t(L_ORDERKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE;,"|")

7

=now()

8

=A4.select(C_MKTSEGMENT==A3).derive@o().keys@i(C_CUSTKEY)

9

=A5.select(O_ORDERDATE<A2).switch@i(O_CUSTKEY,A8).derive@o().keys@i(O_ORDERKEY)

10

=A6.select(L_SHIPDATE>A2).join@mi(L_ORDERKEY,A9,O_ORDERDATE,O_SHIPPRIORITY)

11

=A10.groups(L_ORDERKEY;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):REVENUE,O_ORDERDATE,O_SHIPPRIORITY)

12

=A11.top(10;-REVENUE,O_ORDERDATE)

13

=interval@ms(A7,now())

SPL耗时1.52秒。

SPL 有序企业版列式计算:


A

1

'/svmiracle/tpchtbl1g/

2

1995-3-15

3

BUILDING

4

=file(A1/"customer_s.tbl").import@t(C_CUSTKEY,C_MKTSEGMENT ;,"|").i()

5

=file(A1/"orders_s.tbl").import@t(O_ORDERKEY,O_CUSTKEY,O_ORDERDATE,O_SHIPPRIORITY;,"|").i()

6

=file(A1/"lineitem_s.tbl").import@t(L_ORDERKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE;,"|").i()

7

=now()

8

=A4.select@v(C_MKTSEGMENT==A3).(C_CUSTKEY)

9

=A5.select@v(O_ORDERDATE<A2&&A8.contain(O_CUSTKEY)).keys@i(O_ORDERKEY)

10

=A6.select@v(L_SHIPDATE>A2).join@mi(L_ORDERKEY,A9,O_ORDERDATE,O_SHIPPRIORITY)

11

=A10.groups(L_ORDERKEY;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):REVENUE,O_ORDERDATE,O_SHIPPRIORITY)

12

=A11.top(10;-REVENUE,O_ORDERDATE)

13

=interval@ms(A7,now())

SPL耗时0.43秒。

因为数据有序,所以关联时可以用join@m选项实现归并关联,降低关联成本。

小结

Q2 中分组后取所有最小值记录、Q13 中的字符串匹配、Q3 的有序关联都属于非基础计算,下面是 Python 和 SPL 的性能对比表。


Q2

Q13

有序 Q3

Python

1.23

2.26

1.48

SPL 行式

0.10

1.99

1.52

SPL 列式

0.05

0.58

0.43

Python 无法利用数据有序这一点,代码和之前是一样的,耗时也延用了之前的。

在这些非基础运算方面,Python 是有欠缺的,缺少这方面的计算方法,只能绕路或者用常规方法计算,性能比 SPL 差不少。

总结

1. 读取数据能力

在读取文本文档时,PythonSPL在这方面的差距不大,SPL小优。SPL拥有独有的二进制文件格式btx,读取效率比Python高很多;读取数据库方面,SPL受累于JDBC,和Python比明显处于劣势。

SPL支持并行读取数据,在不看重数据顺序时,并行可以提速很多,扩大读取文本文件时的优势,也可以弥补读取数据库的劣势。

2. 基础运算

Python依靠Pandas完成计算,因为Pandas的结构化数据结构是DataFrame,它的本质是个矩阵,所以在基础计算时(sumgroupjoin等)拥有矩阵计算的优势。在单线程条件下,Python的表现好于SPL行式计算,但SPL用列式方式计算时,又好于Python,再加上SPL可以并行计算,整体来看SPL还是优于Python

3. 非基础运算

在非基础计算时,Pandas的矩阵数据结构又成为劣势,比如Q13中的匹配字符串,矩阵本来是用来数字计算的,并不擅长字符串计算,所以Q13的性能表现并不好;另一个劣势是Pandas的运算不完善,比如Q2中查找分组后的所有最小值记录,Pandas的要先算出最小值,再选出所有最小值记录,这么处理多遍历了一遍,自然耗费性能。再比如对序的运用方面,Python是缺失的,而SPL可以利用数据的有序性,提高运行性能,比如Q3的有序关联,SPL提供了join@m(),把复杂的join转变成归并连接,计算量自然低很多。