3.1 等值分组

 

2.5 循环产生和执行


1. 单列分组单聚合——各部门员工人数

2. 分组唯一计数——各部门员工来自几个州

3. 单列分组多聚合——各部门员工人数、平均工资和最大年龄

4. 多列分组单聚合——各部门各州员工的平均工资

5. 多列分组多聚合——各部门各州员工的平均工资、最高工资、平均年龄

6. 分组不聚合——找出生日是同一天的员工们(不含年)

7. 分组复杂聚合——各部门年龄最大的员工和年龄最小员工的工资差

SPL

A B
1 =file(“EMPLOYEE.csv”).import@tc()
2 =A1.groups(DEPT;count(1):emp_no) / 单列分组单聚合
3 =A1.groups(DEPT;icount(STATE)) / 分组唯一计数
4 =A1.groups(DEPT;count(1):emp_no,avg(SALARY):avg_salary,max(age(BIRTHDAY)):max_age) / 单列分组多聚合
5 =A1.groups(DEPT,STATE;avg(SALARY):avg_salary) / 多列分组单聚合
6 =A1.groups(DEPT,STATE;avg(SALARY):avg_salary,max(SALARY):max_salary,avg(age(BIRTHDAY)):avg_age) / 多列分组多聚合
7 =A1.group(month(BIRTHDAY),day(BIRTHDAY)).select(~.len()>1).conj() / 分组不聚合
8 =A1.group(DEPT;(ma=~.minp(BIRTHDAY),mi=~.maxp(BIRTHDAY), ma.SALARY-mi.SALARY):SALARY_DIF) / 分组复杂聚合

SQL

1. 单列分组单聚合

SELECT DEPT, COUNT(*) AS NUM_EMPLOYEES
FROM EMPLOYEE
GROUP BY DEPT;

2. 分组唯一计数

SELECT DEPT, COUNT(DISTINCT STATE) AS NUM_STATES
FROM EMPLOYEE
GROUP BY DEPT;

3. 单列分组多聚合

SELECT DEPT, COUNT(*) AS NUM_EMPLOYEES, AVG(SALARY) AS AVG_SALARY, 
    MAX(TRUNC(MONTHS_BETWEEN(SYSDATE, BIRTHDAY)/12)) AS MAX_AGE
FROM EMPLOYEE
GROUP BY DEPT;

4. 多列分组单聚合

SELECT DEPT, STATE, AVG(SALARY) AS AVG_SALARY
FROM EMPLOYEE
GROUP BY DEPT, STATE;

5. 多列分组多聚合

SELECT DEPT, STATE, AVG(SALARY) AS AVG_SALARY, MAX(SALARY) AS MAX_SALARY, 
    AVG(TRUNC(MONTHS_BETWEEN(SYSDATE, BIRTHDAY)/12)) AS AVG_AGE
FROM EMPLOYEE
GROUP BY DEPT, STATE;

6. 分组不聚合

SQL 没有分组不聚合,所以 SQL 只能通过其他方法绕路来做。

7. 分组复杂聚合

WITH MAX_AGE_EMPLOYEE AS (
SELECT DEPT, BIRTHDAY, SALARY
FROM (
    SELECT DEPT,BIRTHDAY,SALARY, 
        ROW_NUMBER() OVER (PARTITION BY DEPT ORDER BY BIRTHDAY ASC) AS RN
    FROM EMPLOYEE)
WHERE RN = 1), 
MIN_AGE_EMPLOYEE AS (
SELECT DEPT, BIRTHDAY, SALARY
FROM (
    SELECT DEPT,BIRTHDAY,SALARY, 
        ROW_NUMBER() OVER (PARTITION BY DEPT ORDER BY BIRTHDAY DESC) AS RN
    FROM EMPLOYEE)
WHERE RN = 1)
SELECT MAX_AGE_EMPLOYEE.DEPT, 
    MAX_AGE_EMPLOYEE. SALARY - MIN_AGE_EMPLOYEE.SALARY AS SALARY_DIFF
FROM MAX_AGE_EMPLOYEE
JOIN MIN_AGE_EMPLOYEE ON MAX_AGE_EMPLOYEE.DEPT = MIN_AGE_EMPLOYEE.DEPT;

Python

emp = pd.read_csv('../EMPLOYEE.csv')
#单列分组单聚合
dept_counts = emp['DEPT'].value_counts()
#分组唯一计数
dept_state_counts = emp.groupby('DEPT')['STATE'].nunique()
#单列分组多聚合
dept_stats = emp.groupby('DEPT').agg(
    count=('DEPT', 'count'),
    avg_salary=('SALARY', 'mean'),
    max_age=('BIRTHDAY', lambda x: 
        (pd.Timestamp('now') - pd.to_datetime(x)).astype('<m8[Y]').max()))
#多列分组单聚合
dept_state_salary = emp.groupby(['DEPT', 'STATE'])['SALARY'].mean()
#多列分组多聚合
dept_state_stats = emp.groupby(['DEPT', 'STATE']).agg(
    avg_salary=('SALARY', 'mean'),
    max_salary=('SALARY', 'max'),
    avg_age=('BIRTHDAY', lambda x: 
        (pd.Timestamp('now') - pd.to_datetime(x)).astype('<m8[Y]').mean()))
birthday_md=pd.to_datetime(emp['BIRTHDAY']).dt.strftime('%m-%d').sort_values()
#分组不聚合
emp_same_birthday=emp.loc[birthday_md.index]
    [emp.groupby(birthday_md).BIRTHDAY.transform('count')>1]
#分组复杂聚合
def salary_diff(g):
    max_age = g['BIRTHDAY'].idxmin()
    min_age = g['BIRTHDAY'].idxmax()
    diff = g.loc[max_age]['SALARY']-g.loc[min_age]['SALARY']
    return diff
emp['BIRTHDAY']=pd.to_datetime(emp['BIRTHDAY'])
salary_diff=emp.groupby('DEPT').apply(salary_diff) 

3.2 非等值分组
SPL SQL Python 代码示例对比