连续型数据探索

 

连续型数据属于定量数据,对于定量数据通常可以从集中趋势测度、分散测度、相对位置测度和对称性测度四个维度来分析。

..

例如泰坦尼克数据中有一个连续型变量Age,表示乘客的年龄,探索代码如下:


A


1

=file("D://titanic.csv").import@qtc()


2

=A1.(Age)


3

=A2.max()


4

=A2.min()


5

=A2.avg()


6

=A2.mode()


7

=A2.median()


8

=A2.median(1:4)


9

=A2.median(3:4)


10

=var@s(A2)


11

=sqrt(A9)


12

=A2.skew()


13

=A2.se()


14

8


15

=(A2.max()-A2.min())/A14


16

=A14.([(~-1)*A15+A2.min(),~*A15+A2.min()])


17

=A16.new(~:group,(~(1)+~(2))/2:group_median, if(#==A16.len(),count(A2.(~>=group(1)&&~<=group(2))),count(A2.(~>=group(1)&&~<group(2)))):count)


18

=canvas()


19

=A18.plot("EnumAxis","name":"x")


20

=A18.plot("NumericAxis","name":"y","location":2)


21

=A18.plot("Column","text":A17.(count),"axis1":"x","data1":A17.(string(group_median)),"axis2":"y","data2":A17.(count))


22

=A18.draw@p(800,450)


23



24

=A1.impute("Age")

[0.25,0.5,0.75]

25

=A24(1).sort()

=A25(1)

26

=A25.(#/A25.len())

=A25.m(-1)

27

=canvas()


28

=A27.plot("NumericAxis","name":"x","autoCalcValueRange":false,"maxValue":1,"scaleNum":10,"allowRegions":false)


29

=A27.plot("NumericAxis","name":"y","location":2,"autoCalcValueRange":false,"autoRangeFromZero":false,"maxValue":A25.m(-1),"minValue":A25(1))


30

=A27.plot("Line","lineColor":-16776961,"markerWeight":1,"axis1":"x","data1":A26,"axis2":"y","data2":A25)


31

for B24

=A27.plot("Line","lineStyle":2,"lineColor":-65281,"markerWeight":-1,"axis1":"x","data1":[A31,A31],"axis2":"y","data2":[B25,B26])

32

=A27.draw@p(800,400)


A2-A9 计算变量的基本统计量,最大值、最小值、平均值、众数、中位数、四分位数

A10-A13 计算变量的方差、标准差、偏度、标准误

对于连续型变量,也可以通过可视化的方式来观察数据,最常用的就是直方图

A14-A22 绘画直方图。绘图之前需要先确定直方柱体的个数,然后将变量进行等距分组,统计落到每个分组区间(柱体)的样本个数。

A14 输入柱体的个数为8

A15 计算每个柱体的宽度

A16 将变量Age等距分为8组,返回每个分组的区间范围,大约每隔10岁为一组。

..

A17 计算每组的中位数和落入该组的乘客数,比如第一组010岁的乘客有64

..

A18-A22 使用A17的数据进行绘图,可以看到每个年龄段的人数分布情况

..

连续型变量有时也可以用分位数图来表示

A24-A32 绘画变量Age的分位数图

..