Last time, I talked about visualizing the Uniform, Normal, Exponential, and Poisson Distributions. However, there are more useful methods than just plotting the density and distribution functions.
Of course, you can always simply ask R to output summary statistics:
> n <- 10000
> d <- data.frame(unif=runif(n=n), norm=rnorm(n=n), exp=rexp(n=n),
+ pois=rpois(n, lambda=1))
>
> summary(d)
> summary(d)
unif norm exp pois
Min. :2.449e-05 Min. :-3.603527 Min. :0.0001143 Min. :0.000
1st Qu.:2.489e-01 1st Qu.:-0.657498 1st Qu.:0.2963731 1st Qu.:0.000
Median :4.991e-01 Median :-0.003342 Median :0.6927739 Median :1.000
Mean :5.001e-01 Mean : 0.008527 Mean :0.9958194 Mean :1.006
3rd Qu.:7.498e-01 3rd Qu.: 0.662084 3rd Qu.:1.3718103 3rd Qu.:2.000
Max. :1.000e+00 Max. : 3.512077 Max. :9.4726291 Max. :6.000
>
Or perhaps, we could be even more detailed and ask for more finely detailed order statistics. The quantile function will calculate arbitrary order statistics:
> quantile(d$unif, probs=seq(0,1,by=0.05))
0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55%
2.448517e-05 4.870828e-02 9.949762e-02 1.450521e-01 1.953495e-01 2.488873e-01 3.011948e-01 3.541679e-01 4.020943e-01 4.508098e-01 4.991203e-01 5.527021e-01
60% 65% 70% 75% 80% 85% 90% 95% 100%
6.038723e-01 6.540685e-01 6.990784e-01 7.498041e-01 7.989884e-01 8.480684e-01 8.982745e-01 9.496064e-01 9.999980e-01
> t(t( quantile(d$unif, probs=seq(0,1,by=0.05)) ))
[,1]
0% 2.448517e-05
5% 4.870828e-02
10% 9.949762e-02
15% 1.450521e-01
20% 1.953495e-01
25% 2.488873e-01
30% 3.011948e-01
35% 3.541679e-01
40% 4.020943e-01
45% 4.508098e-01
50% 4.991203e-01
55% 5.527021e-01
60% 6.038723e-01
65% 6.540685e-01
70% 6.990784e-01
75% 7.498041e-01
80% 7.989884e-01
85% 8.480684e-01
90% 8.982745e-01
95% 9.496064e-01
100% 9.999980e-01
>
> s <- seq(0,1,by=0.02)
> e <- sapply(d, FUN=function(x) quantile(x, probs=s))
> e <- as.data.frame(e)
> e[1:10,]
unif norm exp pois
0% 2.448517e-05 -3.6035270 0.0001142514 0
2% 1.900289e-02 -1.9884255 0.0226651895 0
4% 3.844486e-02 -1.7161347 0.0440634106 0
6% 6.012384e-02 -1.5041045 0.0634013971 0
8% 8.089568e-02 -1.3566198 0.0899735826 0
10% 9.949762e-02 -1.2407524 0.1103189153 0
12% 1.191770e-01 -1.1392377 0.1342348062 0
14% 1.369209e-01 -1.0483029 0.1590086026 0
16% 1.555910e-01 -0.9618975 0.1829207834 0
18% 1.748496e-01 -0.8799168 0.2056880019 0
>
Plotting the quantile values is worth a shot, just to see what we get:
>
> plot(x=s, y=e$unif, ylim=c(min(e), max(e)), type='b', pch='.', cex=3, lwd=2, col='black',
+ xlab='quantiles [0,100]', ylab='dist values', main='Quantile Plots, Four Distributions')
> for(j in 2:4){
+ points(x=s, y=e[,j], type='b', pch='.', cex=3, lwd=2, col=c('', 'red', 'blue', 'green')[j])
+ }
> legend(x=0, y=8, legend=colnames(e), col=c('black', 'red', 'blue', 'green'), lwd=3)
>
Quantile quantile plots are typically used when you want to see if a sample is a particular distribution. You plot the quantiles of the sample versus the assumed distribution and compare; if they are the same distribution you should get a straight line at roughly a forty five degree angle.
> par(mfrow=c(1,2), oma=c(0,0,2,0))
>
> qqplot(x=d$exp, y=d$pois,
+ xlab='exponential', ylab='poisson', main = 'qq plot: different distributions')
> qqplot(x=d$norm, y=rnorm(n=n),
+ xlab='our normal sample in d', ylab='new normal sample', main = 'qq plot: same distribution, resampled')
>
> mtext('QQ plots -- different vs same distributions', side=3, outer=T)
Box and whisker plots
> dev.set(which=1)
> boxplot(x=d, xlab='distributions', main='Distributions Visualization - boxplot')
Finally, for univariate distributions, I prefer box-percentile plots. These are similar to boxplots, but the width of the distribution graphs are proportional to the percent of observations more extreme in that direction. They are also marked at the 25th, 50th, and 75th percentiles.
> library(Hmisc)
>
> dev.set(which=1)
> bpplot(d, xlab='distributions', main='Distributions Visualization - box percentile plot')
>
And if the distributions are genuinely different, let’s examine 5 univariate Normal Distributions: our already sampled N(0,1), a resampled N(0,1), N(1,1), N(0,3), N(-1,0.5). Side by side, the box percentile plot really draws out the differences in the distributions.
>
> d2 = data.frame(norm=d$norm, resample=rnorm(n), mean1=rnorm(n=n, mean=1), sd3=rnorm(n=n, sd=3), sdhalf=rnorm(n=n, sd=0.5, mean=-2))
>
> bpplot(d2, xlab='distributions', main='Multiple Normal Distribution Visualization - box percentile plot')
>




