1.2 结构化数据

 

1.1 简单集合


1.2.1 常数数据表

SPL

A B C D E
1 Rebecca Male 80 1.75 1974-11-20
2 Ashley Male 60 1.68 1980-07-19
3 Rachel Female 51 1.64 1970-12-17
4 Emily Female 49 1.6 1985-03-07
5 =create(name,gender,weight,height,birthday)
6 =A5.record([A1:E4])

SQL

CREATE TABLE emp (
    name VARCHAR2(50),
    gender VARCHAR2(10),
    weight NUMBER(5,2),
    height NUMBER(5,2),
    birthday DATE);
INSERT INTO emp (name, gender, weight, height, birthday) VALUES ('Rebecca', 'Male', 80, 1.75,
    TO_DATE('1974-11-20', 'YYYY-MM-DD')); 
INSERT INTO emp (name, gender, weight, height, birthday) VALUES ('Ashley', 'Male', 60, 1.68, 
    TO_DATE('1980-07-19', 'YYYY-MM-DD'));
INSERT INTO emp (name, gender, weight, height, birthday) VALUES ('Rachel', 'Female', 51, 1.64, 
    TO_DATE('1970-12-17', 'YYYY-MM-DD'));
INSERT INTO emp (name, gender, weight, height, birthday) VALUES ('Emily', 'Female', 49, 1.6, 
    TO_DATE('1985-03-07', 'YYYY-MM-DD'));

Python

info = [["Rebecca","Male",80,1.75,"1974-11-20"],
    ["Ashley","Male",60,1.68,"1980-07-19"],
    ["Rachel","Female",51,1.64,"1970-12-17"],
    ["Emily","Female",49,1.6,"1985-03-07"]]
data = pd.DataFrame(info,columns=['name','gender','weight','height','birthday'])

1.2.2 成员取出

SPL

A
1 =file(“EMPLOYEE.csv”).import@tc()
2 =A1(3)
3 =A1([2,6,5])
4 =A1.to(2,4)
5 =A1.step(2,2)
6 =A1.m(-1)
7 =A1.m([1,3],5:7,-2)

SQL

1. 取出第 3 个成员

SELECT * FROM (
 SELECT EMPLOYEE.*, ROWNUM AS rnum
 FROM EMPLOYEE)
WHERE rnum = 3;

2. 取出第 2,6,5 个成员

SELECT * FROM (
 SELECT EMPLOYEE.*, ROWNUM AS rnum
 FROM EMPLOYEE)
WHERE rnum IN (2,5,6) 
ORDER BY CASE rnum WHEN 2 THEN 1 WHEN 6 THEN 2 WHEN 5 THEN 3 END;

3. 取出第 2 到 4 个的成员

SELECT * FROM (
 SELECT EMPLOYEE.*, ROWNUM AS rnum
 FROM EMPLOYEE)
WHERE rnum BETWEEN 2 AND 4;

4. 取出偶数位置成员

SELECT * FROM (
 SELECT EMPLOYEE.*, ROWNUM AS rnum
 FROM EMPLOYEE)
WHERE MOD(rnum,2)=0;

5. 取出最后一个成员

SELECT * FROM (
 SELECT EMPLOYEE.*, ROWNUM AS rnum
 FROM EMPLOYEE)
WHERE rnum=( SELECT COUNT(*)
 FROM EMPLOYEE);

6. 取出第 1,3 个, 第 5 到 7 个, 倒数第 2 个成员

SELECT * FROM (
 SELECT EMPLOYEE.*, ROWNUM AS rnum
 FROM EMPLOYEE)
WHERE (rnum IN (1,3)) OR (rnum BETWEEN 5 AND 7) OR (rnum=(
    SELECT COUNT(*)-1
    FROM EMPLOYEE));

Python

emp = pd.read_csv('EMPLOYEE.csv')
result1 = emp.iloc[2]
result2 = emp.iloc[[1, 5, 4]]
result3 = emp.iloc[1:4]
result4 = emp.iloc[1::2]
result5 = emp.iloc[-1]
result6 = emp.iloc[[0, 2, *range(4, 7), -2]]

1.2.3 字段引用

1. 记录取出字段

2. 取出多个字段

3. 动态字段名取出字段

4. 按位置取出字段

5. 取出全部字段

6. 记录集合取出字段

SPL

A
1 =file(“EMPLOYEE.csv”).import@tc()
2 =A1(1)
3 =A2.NAME
4 =A2.([NAME,GENDER])
5 =A2.field(“NAME”)
6 =A2.([#2,#3])
7 =A2.array()
8 =A1.(NAME)

SQL

SQL 虽有记录这个概念,但记录并不能独立存在,只能以只有一条记录的新表存在。所以 SQL 不能根据记录取出字段。

Python

emp = pd.read_csv('EMPLOYEE.csv')
emp1 = emp.iloc[0]
result1 = emp1.NAME
result2 = emp1[['NAME', 'GENDER']].values
result3 = emp1['NAME']
result4 = emp1.iloc[[1,2]].values
result5 = emp1.values
result6 = emp.NAME

1.2.4 记录比较

1. 不同数据表记录比较指针

2. 不同数据表记录比较值

SPL

A B
1 =file(“EMPLOYEE.csv”).import@tc(EID,NAME,SALARY)
2 =file(“EMPLOYEE.csv”).import@tc(EID,NAME,SALARY)
3 =file(“EMPLOYEE.csv”).import@tc(EID,NAME)
4 =A1(1)==A2(1) /false
5 =cmp(A1(1),A2(1)) /0
6 =cmp(A1(1),A2(2)) /-1
7 =cmp(A1(1),A3(1)) /1

A4: 不同序表记录比较指针

A5-A7:不同序表记录比较值

SQL

SQL 不支持记录直接比较。

Python

Pandas 不支持记录直接比较,需要转成 list 比较。

1.2.5 集合运算

1. 交集

2. 差集

3. 并集

4. 和集

SPL

A B
1 =file(“EMPLOYEE.csv”).import@tc()
2 =A1.select(GENDER==“F”)
3 =A1.select(DEPT==“Sales”)
4 =A2^A3 =[A2,A3].isect()
5 =A2\A3 =[A2,A3].diff()
6 =A2&A3 =[A2,A3].union()
7 =A2|A3 =[A2,A3].conj()

SQL

1. 交集

SELECT * FROM EMPLOYEE
WHERE GENDER = 'F'
INTERSECT
SELECT * FROM EMPLOYEE
WHERE DEPT = 'Sales';

2. 差集

SELECT * FROM EMPLOYEE
WHERE GENDER = 'F'
MINUS
SELECT * FROM EMPLOYEE
WHERE DEPT = 'Sales';

3. 并集

SELECT * FROM EMPLOYEE
WHERE GENDER = 'F'
UNION
SELECT * FROM EMPLOYEE
WHERE DEPT = 'Sales';

4. 和集

SELECT * FROM EMPLOYEE
WHERE GENDER = 'F'
UNION ALL
SELECT * FROM EMPLOYEE
WHERE DEPT = 'Sales';

Python

Pandas 没有记录构成的集合数据类型。

1.2.6 泛型数据表

1) 字段值是集合

SPL

A B C
1 1 91993.67 [82]
2 2 96754.54 [88,12]
3 3 28409.55 [73,71]
4 4 32972.12 [29,82]
5 5 51869.75 [60,1]
6 =create(order_id,order_price,part_no)
7 =A6.record([A1:C5])
8 =A6(2).part_no(1) /88

可以把上表看作订单表的一部分,包括订单号,订单价格和该订单下的零件编号。

A7 结果:

SQL

CREATE OR REPLACE TYPE NUMBER_LIST_TYPE AS VARRAY(5) OF NUMBER;
--之前创建过这个类型了,这里可以不创建。
CREATE TABLE order_gen (ORDER_ID NUMBER,ORDER_PRICE NUMBER,PART_NO NUMBER_LIST_TYPE);
INSERT ALL
    INTO order_gen (ORDER_ID,ORDER_PRICE,PART_NO) VALUES (1,91993.67,number_list_type (82))
    INTO order_gen (ORDER_ID,ORDER_PRICE,PART_NO) VALUES (2,96754.54,number_list_type (88,12))
    INTO order_gen (ORDER_ID,ORDER_PRICE,PART_NO) VALUES (3,28409.55,number_list_type (73,71))
    INTO order_gen (ORDER_ID,ORDER_PRICE,PART_NO) VALUES (4,32972.12,number_list_type (29,82))
    INTO order_gen (ORDER_ID,ORDER_PRICE,PART_NO) VALUES (5,51869.75,number_list_type (60,1))
SELECT * FROM dual;
SELECT COLUMN_VALUE FROM TABLE(
    SELECT PART_NO FROM (
        SELECT PART_NO, ROWNUM AS rnum FROM order_gen) 
    WHERE rnum = 2) 
WHERE ROWNUM = 1;

表 order_gen 结果如下:

ORDER_ID    ORDER_PRICE     PART_NO
-----------------------------------------------------------------------
 1          91993.67        NUMBER_LIST_TYPE(82)
 2          96754.54        NUMBER_LIST_TYPE(88, 12)
 3          28409.55        NUMBER_LIST_TYPE(73, 71)
 4          32972.12        NUMBER_LIST_TYPE(29, 82)
 5          51869.75        NUMBER_LIST_TYPE(60, 1)

Python

info = [[1,91993.67,[82]],
    [2,96754.54,[88,12]],
    [3,28409.55,[73,71]],
    [4,32972.12,[29,82]],
    [5,51869.75,[60,1]]]
data = pd.DataFrame(info,columns=['order_id','order_price','part_no'])
data_list1 = data.iloc[1].part_no[0]        #88

data 结果:

2) 字段值是记录

SPL

A
1 [{“order_id”:1,“part_no”:[47,10],“cust_info”:{“cust_id”:10001,“gender”:“M”,“age”:31}},{“order_id”:2,“part_no”:[46,30,23],“cust_info”:{“cust_id”:10005,“gender”:“F”,“age”:35}},{“order_id”:3,“part_no”:[94,91,91],“cust_info”:{“cust_id”:10005,“gender”:“F”,“age”:35}},{“order_id”:4,“part_no”:[19,62],“cust_info”:{“cust_id”:10002,“gender”:“F”,“age”:28}},{“order_id”:5,“part_no”:[9,68],“cust_info”:{“cust_id”:10005,“gender”:“F”,“age”:35}},{“order_id”:6,“part_no”:[67,20,12],“cust_info”:{“cust_id”:10005,“gender”:“F”,“age”:35}},{“order_id”:7,“part_no”:[74],“cust_info”:{“cust_id”:10001,“gender”:“M”,“age”:31}},{“order_id”:8,“part_no”:[34],“cust_info”:{“cust_id”:10002,“gender”:“F”,“age”:28}},{“order_id”:9,“part_no”:[49],“cust_info”:{“cust_id”:10003,“gender”:“M”,“age”:30}},{“order_id”:10,“part_no”:[99,43,63],“cust_info”:{“cust_id”:10005,“gender”:“F”,“age”:35}}]
2 =A1(2).cust_info
3 =A2.gender

SPL 支持把 json 格式的数据识别成序表,A1 结果如下:

A2 是条记录:

A3 是取 cust_info 的 gender 字段:

SQL

SQL 很难直接根据 json 串生成带记录的表。

Python

data=[
    {"order_id":1,"part_no":[47,10],"cust_info":{"cust_id":10001,"gender":"M","age":31}},
    {"order_id":2,"part_no":[46,30,23],"cust_info":{"cust_id":10005,"gender":"F","age":35}},
    {"order_id":3,"part_no":[94,91,91],"cust_info":{"cust_id":10005,"gender":"F","age":35}},
    {"order_id":4,"part_no":[19,62],"cust_info":{"cust_id":10002,"gender":"F","age":28}},
    {"order_id":5,"part_no":[9,68],"cust_info":{"cust_id":10005,"gender":"F","age":35}},
    {"order_id":6,"part_no":[67,20,12],"cust_info":{"cust_id":10005,"gender":"F","age":35}},
    {"order_id":7,"part_no":[74],"cust_info":{"cust_id":10001,"gender":"M","age":31}},
    {"order_id":8,"part_no":[34],"cust_info":{"cust_id":10002,"gender":"F","age":28}},
    {"order_id":9,"part_no":[49],"cust_info":{"cust_id":10003,"gender":"M","age":30}},
    {"order_id":10,"part_no":[99,43,63],"cust_info":{"cust_id":10005,"gender":"F","age":35}}
]
df=pd.DataFrame(data)
cust_info_rec=df.iloc[1].cust_info      #{'cust_id':10005,'gender':'F','age':35}
rec_gender=cust_info_rec['gender']      #F

Python 把 cust_info 作为字典来生成 DataFrame,df 结果:

DataFrame 中每一行其实是一个 Series,而这里的 cust_info 的数据格式只是一个字典,不再是 Series,所以只能用字典的规则来获取 gender,即 rec_gender=cust_info_rec[‘gender’],而不能用 Series 的方式取数,比如 cust_info_rec.gender。

3) 字段值是序表

现有 json 文件如下图,请将文件读成表:

SPL

A
1 =file(“table_data.txt”).read()
2 =json(A1)
3 =A2(3).detail
4 =A3(2)
5 =A4.cust_id

A1 读取 json 文件;

A2 将 json 串转换成序表,结果如下:

A3:第 3 条记录的 detail 信息:

A4:第三条记录 detail 信息中第 2 条记录

A5:第三条记录 detail 信息中第 2 条记录的 cust_id 字段。

SQL

SQL 很难直接用 json 文件生成带子数据表的表。

Python

df=pd.read_json('table_data.txt')
detail_info_table=df.loc[2].detail      #第3条记录的detail信息
detail_2=detail_info_table[1]           #第三条记录detail信息中第2条
detail_2_cust_id=detail_2['cust_id']    #第三条记录detail信息中第2条的cust_id字段

Python 把 detail 读成字典的列表,df 结果如下:

detail_info_table:第 3 条记录 detail 的信息:

[{'detail_id': 1001, 'cust_id': 14388, 'sale_id': 109}, 
{'detail_id': 1002, 'cust_id': 14768, 'sale_id': 187}, 
{'detail_id': 1003, 'cust_id': 14391, 'sale_id': 186}]

detail_2:detail_info_table 中的第 2 条:

{'detail_id': 1002, 'cust_id': 14768, 'sale_id': 187}

detail_2_cust_id 是 detail_2 中的 cust_id:

14768

和上例类似,detail_info_table 是一个字典的列表,并不是 DataFrame 这一数据结构。所以在获取 detail_info_table 中的第 2 条信息时,只能用 list 的取数方式 detail_info_table[1] 得到,而不能用 detail_info_table.loc[1] 得到,得到的结果 detail_2 是一个字典,而不是 Seires,所以在获取 detail_2 中的 cust_id 时,只能用字典的取数方式 detail_2[‘cust_id’] 得到,而不能用 detail_2.info 得到。

1.2.7 泛型记录序列

员工表和家属表结构不同,但都有 GENDER 字段,请统计员工和家属中,有多少女性。

SPL

A
1 =file(“EMPLOYEE.csv”).import@tc()
2 =file(“FAMILY.csv”).import@tc()
3 =A1|A2
4 =A3.count(left(GENDER,1)==“F”)

SPL 允许来自不同序表的记录构成新集合,比如 A3,还可以对新集合进行计算。

A3 的结果是记录的集合,集合的每个成员都是一条记录:

SQL

SQL 不允许两张不同结构的表直接合并。

Python

emp = pd.read_csv("EMPLOYEE.csv")
fam = pd.read_csv("FAMILY.csv")
df = pd.concat([emp,fam])
cnt = len(df[df['GENDER'].str[0]=='F'])

Python 不支持不同结构的 DataFrame 简单合并,合并时要把两个表的数据结构调整到相同。

df 结果:


2.1 当前值引用
SPL SQL Python 代码示例对比