没有 RDB 也敢揽 SQL 活的开源金刚钻 SPL
SQL语法接近自然语言,上手学习门槛低,再加上先发优势的加成,很快就在数据库厂商和用户间流行开来。经过多年的发展,SQL已经成为应用最广、用户最多、最成熟的结构化数据计算语言。
但是,SQL必须基于RDB工作,而很多场景下并没有RDB,比如遇到csv\restful json\MongoDB等数据源,或进行这些数据源之间的混合计算,比如csv和xls之间。在这些场景下,很多人会选择用JAVA或C#等高级语言硬写算法,这要从头编写冗长的底层函数,执行效率也很难保证,很容易堆积出人人痛恨的“代码屎山”。也有人会把数据写入数据库,再用SQL进行计算,但入库的过程非常繁琐,实时性也很差,有时还要求助于ETL工具,架构重上加重,风险增了又增,遇到混合计算更是加倍的麻烦。
现在好了,集算器SPL来了,这些问题都可以轻松解决了。
SPL是开源的计算技术,完全覆盖了SQL的计算能力,支持种类繁多的数据源,没有RDB也可以用SQL进行结构化数据计算了。
完善的SQL计算能力
SPL提供了相当于SQL92标准的语法,可以进行足够丰富多样的数据计算,包括过滤、计算列、选择部分列、改名等等,可以直接把文本、xls等文件当成数据表来执行SQL。下面以csv文件作为数据源为例说明:
1. 过滤
基本的比较运算:
$select * from d:/Orders.csv where Amount>=100
like:
$select * from d:/Orders.csv where Client like '%bro%'
空值判断:
$select * from d:/Orders.csv where Client is null
与、或、非这样的逻辑运算符可以把比较运算组合起来,实现组合过滤:
$select * from d:/Orders.csv
where not Amount>=100 and Client like 'bro' or OrderDate is null
in:
$select * from d:/Orders.csv where Client in ('TAS','KBRO','PNS')
多层括号:
$select * from d:/Orders.csv
where (OrderDate<date('2020-01-01') and Amount<=100) or (OrderDate>=date('2020-12-31') and Amount>100)
2. 计算列
SPL有丰富的数学函数、字符串函数、日期函数:
$select round(Amount,2), price*quantity from d:/Orders.csv
$select left(Client,4) from d:/Orders.csv
$select year(OrderDate) from d:/Orders.csv
case when:
$select case year(OrderDate) when 2021 then 'this year' when 2020 then 'last year' else 'previous years' end
from d:/Orders.csv
coalesce反显空值:
$select coalesce(Client,'unknown') from d:/Orders.csv
3. SELECT
$select OrderId, Amount, OrderDate from d:/Orders.csv
4. ORDER BY
$select * from d:/Orders.csv order by Client, Amount desc
5. DISTINCT
$select distinct Client ,Sellerid from d:/Orders.csv
6. GROUP BY … HAVING
$select year(OrderDate),Client ,sum(Amount),count(1) from d:/Orders.csv
group by year(OrderDate),Client having sum(Amount)<=100
聚合函数包括sum、count、avg、max、min,不分组也可以直接汇总:
$select avg(Amount) from d:/Orders.csv
7. JOIN
左关联:
$select o.OrderId,o.Client,e.Name e.Dept,e.EId from d:/Orders.txt o
left join d:/Employees.txt e on o.SellerId=e.Eid
右关联:
$select o.OrderId,o.Client,e.Name e.Dept,e.EId from d:/Employees.txt e
right join d:/Orders.txt o on o.SellerId=e.Eid
全关联:
$select o.OrderId,o.Client,e.Name e.Dept,e.EId from d:/Employees.txt e
full join d:/Orders.txt o on o.SellerId=e.EId
内关联:
$select o.OrderId,o.Client,e.Name e.Dept from d:/Orders.csv o
inner join d:/Employees.csv e on o.SellerId=e.Eid
内关联还可以写成WHERE形式:
$select o.OrderId,o.Client,e.Name e.Dept from d:/Orders.csv o ,d:/Employees.csv e
where o.SellerId=e.Eid
8. 子查询
$select t.Client, t.s, ct.Name, ct.address from
(select Client ,sum(amount) s from d:/Orders.csv group by Client)
left join ClientTable ct on t.Client=ct.Client
with:
$with t as (select Client ,sum(amount) s from d:/Orders.csv group by Client)
select t.Client, t.s, ct.Name, ct.address from t
left join ClientTable ct on t.Client=ct.Client</p>
In中的子查询:
$select * from d:/Orders.txt o where o.sellerid in (select eid from d:/Employees.txt)
9. AS
使用as关键字,可对字段、计算列、物理表、子查询改名:
$select price*quantity as subtotal from d:/detail.csv
10. 集合运算
union、union all、intersect、minus都有,举一例:
$select * from Orders1.csv union all select * from Orders2.csv
11. into输出结果
查询结果可用into关键字写入文件:
$select dept,count(1) c,sum(salary) s into deptResult.xlsx
from employee.txt group by dept having >100000
丰富的数据源支持
SPL支持各种非数据库的数据源,包括各种非标准格式的文本,前面例子中已展示过csv。TAB分隔的txt也一样可以支持,SPL会根据扩展名自动处理:
$select * from d:/Orders.txt where Amount>=100 and Client like 'bro' or OrderDate is null
如果分隔符不是逗号和tab,就要用SPL扩展函数处理了,比如分隔符是冒号:
$select * from {file("d:/Orders.txt").import@t(;":")}
where Amount>=100 and Client like 'bro' or OrderDate is null
没有标题行的文件,可以用序号表示列名:
$select * from {file("d:/Orders.txt").import()} where _4>=100 and _2 like 'bro' or _5 is null
某些特殊格式的字符串也要用扩展函数解析,比如日期格式不是标准的yyyy-MM-dd:
$select year(OrderDate),sum(Amount) from
{file("d:/Orders.txt").import@t(orderid,client,sellerid,amount,orderdate📅"dd-MM-yyyy")}
group by year(OrderDate)
Excel文件上也可以执行SQL,对于格式规范的Excel,只需直接引用文件名:
$select * from d:/Orders.xlsx where Amount>=100 and Client like 'bro' or OrderDate is null
可以读取指定sheet:
$select * from {file("D:/Orders.xlsx").xlsimport@t(;"sheet3")}
where Amount>=100 and Client like 'bro' or OrderDate is null
从远程网站下载来的csv/xls文件:
$select * from { httpfile("http://127.0.0.1:6868/Orders.csv").import@tc() }
where Amount>=100 and Client like 'bro' or OrderDate is null
HTTP协议的特性很多,比如字符集、端口号、post参数、header参数、登录认证等等,SPL扩展函数都可以支持。扩展函数还可以抓取网页中的表格数据,也支持从FTP服务器下载文件,这里不再赘述。
Json文件要先读为字符串再解析:
$select * from {json(file("d:\\data.json").read())}
where Amount>=100 and Client like 'bro' or OrderDate is null
二维Json比较少,多层才是常态,SPL扩展函数可以把多层数据转为二维记录,再用SQL计算,这里就不展开了。
Restful json
$select * from {json(httpfile("http://127.0.0.1:6868/api/getData").read())}
where Amount>=100 and Client like 'bro' or OrderDate is null
扩展函数如果比较多比较长,可以写成分步形式:
A |
|
1 |
=httpfile("http://127.0.0.1:6868/api/getData") |
2 |
=A1.read() |
3 |
=json(A2) |
4 |
$select * from {A3} where Amount>=100 and Client like 'bro' or OrderDate is null |
类似csv/xls,SPL也可以读取HTTP网站上的json/xml文件。
XML
$select * from {xml(file("d:/data.xml").read(),"xml/row")}
where Amount>=100 and Client like 'bro' or OrderDate is null
Web Service
$select * from {ws_call(ws_client("http://....../entityWS.asmx?wsdl"),"entityWS ":" entityWSSoap":"getData")}
where Amount>=100 and Client like 'bro' or OrderDate is null
NoSQL也不在话下。
MongoDB
$select * from<br>{mongo_shell@x(mongo_open("mongodb://127.0.0.1:27017/mongo"),"main.find()")}
where Amount>=100 and Client like 'bro' or OrderDate is null
MongoDB经常是多层数据,包括前面的restful、web Service,用SPL扩展函数都可以转为二维。
Salesforce
$select * from { sf_query(sf_open(),"/services/data/v51.0/query","Select Id,CaseNumber,Subject From Case where Status='New'")}
where Amount>=100 and Client like 'bro' or OrderDate is null
Hadoop HDFS上的csv/xls/json/xml:
A |
|
1 |
=hdfs_open(;"hdfs://192.168.0.8:9000") |
2 |
=hdfs_file(A1,"/user/Orders.csv":"GBK") |
3 |
=A2.import@t() |
4 |
=hdfs_close(A1) |
5 |
$select Client,sum(Amount) from {A3} group by Client |
HBase也支持:
A |
|
1 |
=hbase_open("hdfs://192.168.0.8", "192.168.0.8") |
2 |
=hbase_scan(A1,"Orders") |
3 |
=hbase_close(A1) |
4 |
$select Client,sum(Amount) from {A2} group by Client |
Hbase还有filter、cmp等取数方式,SPL都可以支持。
Hive有公共的JDBC接口,但性能较差,SPL提供了高性能接口:
A |
|
1 |
=hive_client("hdfs://192.168.0.8:9000","thrift://192.168.0.8:9083","hive","asus") |
2 |
=hive_query(A1, "select * from table") |
3 |
=hive_close() |
4 |
$select Client,sum(Amount) from {A2} group by Client |
Spark
A |
|
1 |
=spark_client("hdfs://192.168.0.8:9000","thrift://192.168.0.8:9083","aa") |
2 |
=spark_query(A1,"select * from tablename") |
3 |
=spark_close(A1) |
4 |
$select Client,sum(Amount) from {A2} group by Client |
阿里云
A |
|
1 |
=ali_open("http://test.ots.aliyuncs.com","LTAIXZNG5zzSPHTQ","sa","test") |
2 |
=ali_query@x(A1,"test",["id1","id2"],[1,"10001"]:[10,"70001"], ["id1","id2","f1","f2"],f1>=2000.0) |
3 |
$select Client,sum(Amount) from {A2} group by Client |
Cassandra
A |
|
1 |
=stax_open("127.0.0.1":9042,"mycasdb","cassandra":"cassandra") |
2 |
=stax_query(A1,"select * from user where id=?",1) |
3 |
=stax_close(A1) |
4 |
$select Client,sum(Amount) from {A2} group by Client |
ElasticSearch
A |
|
1 |
=es_open("localhost:9200","user":"un1234") |
2 |
=es_get(A1,"/person/_mget","{\"ids\":[\"1\",\"2\",\"5\"]}") |
3 |
=es_close(A1) |
4 |
$select Client,sum(Amount) from {A2} group by Client |
Redis
A |
|
1 |
=redis_open() |
2 |
=redis_hscan(A1, "runoobkey", "v*", 3) |
3 |
=redis_close (A1) |
4 |
$select key,value from {A2} where value>=2000 and value<3000 |
SAP BW
A |
|
1 |
=sap_open("userName","passWord","192.168.0.188","00","000",”E") |
2 |
=sap_cursor(A1, "Z_TEST1","IT_ROOM").fetch() |
3 |
=sap_close(A1) |
4 |
$select * from {A2} where Vendor like '%software%' |
InfluxDB
A |
|
1 |
=influx_open("http://127.0.0.1:8086", "mydb", "autogen", "admin", "admin") |
2 |
=influx_query(A1, "SELECT * FROM Orders") |
3 |
=influx_close(A1) |
4 |
$select Client,sum(Amount) from {A2} group by Client |
Kafka
A |
|
1 |
=kafka_open("D://kafka.properties";"topic-test") |
2 |
=kafka_poll(A1) |
3 |
=kafka_close (A1) |
4 |
$select Client,sum(Amount) from {A2} group by Client |
MDX多维数据库
A |
|
1 |
=olap_open("http://192.168.0.178:8088/msmdpump.dll","CubeTest","Administrator","admin") |
2 |
=olap_query(A1,"with member [Measures].[AnnualInterestRate] as'[Measures].[SalesAmount]/[Measures].[StandardCost]-1'select {[Measures].[SalesAmount],[Measures].[StandardCost],[Measures].[ AnnualInterestRate]} on columns, {[Order Date].[Calendar Year].[Calendar Year]} on rows from [DataSourceMulti]") |
3 |
=olap_close(A1) |
4 |
$select * from {A2} where SalesAmount>10000 |
SPL除了支持种类繁多的数据源,也可以进行数据源之间的混合计算。比如csv和RDB之间:
$select o.OrderId,o.Client,e.Name e.Dept from d:/Orders.csv o
inner join d:/Employees.xls e on o.SellerId=e.Eid
MongoDB和数据库之间:
A |
B |
|
1 |
=mongo_open("mongodb://127.0.0.1:27017/mongo") |
|
2 |
=mongo_shell@x(A1,"detail.find()").fetch() |
=connect("orcl").query@x("select * from main") |
3 |
$select d.title,m.path,sum(d.amount) from {A2} as d left join {B2} as m on d.cat=m.cat group by d.title, m.path |
任意数据源之间都可以进行混合计算,而且SQL语法不受数据源的影响。
更深入的计算能力
其实,SPL的本意是Structure Process Language,是一种专门用于结构化数据处理的语言,在前面的例子中也已经展示了部分SPL本身的语法(那些扩展函数)。SQL只是SPL顺带提供的一种功能, SPL本身还拥有比SQL更强大便捷的计算能力。有些计算逻辑比较复杂,用SQL甚至存储过程都很难写,而用SPL则可以用简单的代码完成计算。
比如这个任务,计算某支股票最长的连续上涨天数,SQL要用多层嵌套的子查询和窗口函数,代码冗长难懂:
select max(continuousDays)-1
from (select count(*) continuousDays
from (select sum(changeSign) over(order by tradeDate) unRiseDays
from (select tradeDate,
case when price>lag(price) over(order by tradeDate) then 0 else 1 end changeSign
from AAPL) )
group by unRiseDays)
而SPL 只需两行:
A |
B |
|
1 |
=T("d:/AAPL.xlsx") |
读Excel文件,首行为列名 |
2 |
=a=0,A1.max(a=if(price>price[-1],a+1,0)) |
求最长连续上涨天数 |
对于简单的运算,使用基本的SQL是很方便的,但运算需求变复杂时,SQL就不适用了,即使提供更多的功能(比如窗口函数)也不能简化计算。这种情况下,我们推荐用户直接使用代码简洁的SPL,而不必再写多层嵌套的复杂SQL了。基于这个原因,SPL中的SQL也只支持到SQL92标准,没有提供包括窗口函数在内的更多语法。
SQL不提倡多步骤计算,习惯于把一个计算任务写在一个大语句中,这样会加大任务的难度。而SPL则天生支持多步骤计算,可以将复杂的大计算目标方便地分解为简单的小目标,这会大大降低写代码的难度。比如,找出销售额累计占到一半的前n个大客户,并按销售额从大到小排列:
A |
B |
|
1 |
= T("D:/data/sales.csv").sort(amount:-1) |
取数并逆序排序 |
2 |
=A1.cumulate(amount) |
计算累计序列 |
3 |
=A2.m(-1)/2 |
最后的累计值即是总和 |
4 |
=A2.pselect(~>=A3) |
超过一半的位置 |
5 |
=A1(to(A4)) |
按位置取值 |
灵活的应用结构
SPL该怎么用呢?
对于交互式的计算分析,SPL有专业的IDE,不仅具有完整的调试功能,还可以用表格直观观察每一步的中间结算结果:
SPL还支持命令行执行,任何主流操作系统都可以:
D:\raqsoft64\esProc\bin>esprocx.exe -R select Client,sum(Amount) from d:/Orders.csv group by Client
Log level:INFO
ARO 899.0
BDR 4278.8
BON 2564.4
BSF 14394.0
CHO 1174.0
CHOP 1420.0
DYD 1242.0
…
对于应用程序中的计算,SPL提供了标准的JDBC驱动,可以方便地集成进JAVA:
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
PrepareStatement st = conn.prepareStatement("$select * from employee.txt where SALARY >=? and SALARY<?");
st.setObject(1, 3000);<br>st.setObject(2, 5000);<br>ResultSet result=st.execute();
应用中会有些频繁修改或较为复杂的计算,SPL允许将代码外置于JAVA程序,可显著降低代码耦合性。比如上面的SPL代码可以先存为脚本文件,再在JAVA中以存储过程的形式调用:
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
Statement st = connection.();
CallableStatement st = conn.prepareCall("{call getQuery(?, ?)}");
st.setObject(1, 3000);
st.setObject(2, 5000);
ResultSet result=st.execute();
有了SPL这样的开源神器,没有RDB也能轻松使用SQL了。
英文版