连续型数据探索
连续型数据属于定量数据,对于定量数据通常可以从集中趋势测度、分散测度、相对位置测度和对称性测度四个维度来分析。
例如泰坦尼克数据中有一个连续型变量”Age”,表示乘客的年龄,探索代码如下:
A | ||
1 | =file("D://titanic.csv").import@qtc() | |
2 | =A1.(Age) | |
3 | =A2.max() | |
4 | =A2.min() | |
5 | =A2.avg() | |
6 | =A2.mode() | |
7 | =A2.median() | |
8 | =A2.median(1:4) | |
9 | =A2.median(3:4) | |
10 | =var@s(A2) | |
11 | =sqrt(A9) | |
12 | =A2.skew() | |
13 | =A2.se() | |
14 | 8 | |
15 | =(A2.max()-A2.min())/A14 | |
16 | =A14.([(~-1)*A15+A2.min(),~*A15+A2.min()]) | |
17 | =A16.new(~:group,(~(1)+~(2))/2:group_median, if(#==A16.len(),count(A2.(~>=group(1)&&~<=group(2))),count(A2.(~>=group(1)&&~<group(2)))):count) | |
18 | =canvas() | |
19 | =A18.plot("EnumAxis","name":"x") | |
20 | =A18.plot("NumericAxis","name":"y","location":2) | |
21 | =A18.plot("Column","text":A17.(count),"axis1":"x","data1":A17.(string(group_median)),"axis2":"y","data2":A17.(count)) | |
22 | =A18.draw@p(800,450) | |
23 | ||
24 | =A1.impute("Age") | [0.25,0.5,0.75] |
25 | =A24(1).sort() | =A25(1) |
26 | =A25.(#/A25.len()) | =A25.m(-1) |
27 | =canvas() | |
28 | =A27.plot("NumericAxis","name":"x","autoCalcValueRange":false,"maxValue":1,"scaleNum":10,"allowRegions":false) | |
29 | =A27.plot("NumericAxis","name":"y","location":2,"autoCalcValueRange":false,"autoRangeFromZero":false,"maxValue":A25.m(-1),"minValue":A25(1)) | |
30 | =A27.plot("Line","lineColor":-16776961,"markerWeight":1,"axis1":"x","data1":A26,"axis2":"y","data2":A25) | |
31 | for B24 | =A27.plot("Line","lineStyle":2,"lineColor":-65281,"markerWeight":-1,"axis1":"x","data1":[A31,A31],"axis2":"y","data2":[B25,B26]) |
32 | =A27.draw@p(800,400) |
A2-A9 计算变量的基本统计量,最大值、最小值、平均值、众数、中位数、四分位数
A10-A13 计算变量的方差、标准差、偏度、标准误
对于连续型变量,也可以通过可视化的方式来观察数据,最常用的就是直方图
A14-A22 绘画直方图。绘图之前需要先确定直方柱体的个数,然后将变量进行等距分组,统计落到每个分组区间(柱体)的样本个数。
A14 输入柱体的个数为8
A15 计算每个柱体的宽度
A16 将变量Age等距分为8组,返回每个分组的区间范围,大约每隔10岁为一组。
A17 计算每组的中位数和落入该组的乘客数,比如第一组0到10岁的乘客有64个
A18-A22 使用A17的数据进行绘图,可以看到每个年龄段的人数分布情况
连续型变量有时也可以用分位数图来表示
A24-A32 绘画变量Age的分位数图