Mean
The mean of an observation variable is a numerical measure of the central location of the data values. It is the sum of its data values divided by data count.
Hence, for a data sample of size n, its sample mean is defined as follows:
Similarly, for a data population of size N, the population mean is:
Problem
Find the mean eruption duration in the data set faithful.
Solution
We apply the mean function to compute the mean value of eruptions.
> mean(duration) # apply the mean function
[1] 3.4878
Answer
The mean eruption duration is 3.4878 minutes.
Exercise
Find the mean eruption waiting periods in faithful.
Median
The median of an observation variable is the value at the middle when the data is sorted in ascending order. It is an ordinal measure of the central location of the data values.
Problem
Find the median of the eruption duration in the data set faithful.
Solution
We apply the median function to compute the median value of eruptions.
> median(duration) # apply the median function
[1] 4
Answer
The median of the eruption duration is 4 minutes.
Exercise
Find the median of the eruption waiting periods in faithful.
Quartile
There are several quartiles of an observation variable. The first quartile, or lower quartile, is the value that cuts off the first 25% of the data when it is sorted in ascending order. The second quartile, or median, is the value that cuts off the first 50%. The third quartile, or upper quartile, is the value that cuts off the first 75%.
Problem
Find the quartiles of the eruption durations in the data set faithful.
Solution
We apply the quantile function to compute the quartiles of eruptions.
> quantile(duration) # apply the quantile function
0% 25% 50% 75% 100%
1.6000 2.1627 4.0000 4.4543 5.1000
Answer
The first, second and third quartiles of the eruption duration are 2.1627, 4.0000 and 4.4543 minutes respectively.
Exercise
Find the quartiles of the eruption waiting periods in faithful.
Note
There are several algorithms for the computation of quartiles. Details can be found in the R documentation via help(quantile).
Percentile
The nth percentile of an observation variable is the value that cuts off the first n percent of the data values when it is sorted in ascending order.
Problem
Find the 32nd, 57th and 98th percentiles of the eruption durations in the data set faithful.
Solution
We apply the quantile function to compute the percentiles of eruptions with the desired percentage ratios.
> quantile(duration, c(.32, .57, .98))
32% 57% 98%
2.3952 4.1330 4.9330
Answer
The 32nd, 57th and 98th percentiles of the eruption duration are 2.3952, 4.1330 and 4.9330 minutes respectively.
Exercise
Find the 17th, 43rd, 67th and 85th percentiles of the eruption waiting periods in faithful.
Note
There are several algorithms for the computation of percentiles. Details can be found in the R documentation via help(quantile).
Range
The range of an observation variable is the difference of its largest and smallest data values. It is a measure of how far apart the entire data spreads in value.
Problem
Find the range of the eruption duration in the data set faithful.
Solution
We apply the max and min function to compute the largest and smallest values of eruptions, then take the difference.
> max(duration) − min(duration) # apply the max and min functions
[1] 3.5
Answer
The range of the eruption duration is 3.5 minutes.
Exercise
Find the range of the eruption waiting periods in faithful.
Interquartile Range
The interquartile range of an observation variable is the difference of its upper and lower quartiles. It is a measure of how far apart the middle portion of data spreads in value.
Problem
Find the interquartile range of eruption duration in the data set faithful.
Solution
We apply the IQR function to compute the interquartile range of eruptions.
> IQR(duration) # apply the IQR function
[1] 2.2915
Answer
The interquartile range of eruption duration is 2.2915 minutes.
Exercise
Find the interquartile range of eruption waiting periods in faithful.
Box Plot
The box plot of an observation variable is a graphical representation based on its quartiles, as well as its smallest and largest values. It attempts to provide a visual shape of the data distribution.
Problem
Find the box plot of the eruption duration in the data set faithful.
Solution
We apply the boxplot function to produce the box plot of eruptions.
> boxplot(duration, horizontal=TRUE) # horizontal box plot
Answer
The box plot of the eruption duration is:
Exercise
Find the box plot of the eruption waiting periods in faithful.
Variance
The variance is a numerical measure of how the data values is dispersed around the mean. In particular, the sample variance is defined as:
Similarly, the population variance is defined in terms of the population mean μ and population size N:
Problem
Find the variance of the eruption duration in the data set faithful.
Solution
We apply the var function to compute the variance of eruptions.
> var(duration) # apply the var function
[1] 1.3027
Answer
The variance of the eruption duration is 1.3027.
Exercise
Find the variance of the eruption waiting periods in faithful.
Standard Deviation
The standard deviation of an observation variable is the square root of its variance.
Problem
Find the standard deviation of the eruption duration in the data set faithful.
Solution
We apply the sd function to compute the standard deviation of eruptions.
> sd(duration) # apply the sd function
[1] 1.1414
Answer
The standard deviation of the eruption duration is 1.1414.
Exercise
Find the standard deviation of the eruption waiting periods in faithful.
Covariance
The covariance of two variables x and y in a data set measures how the two are linearly related. A positive covariance would indicate a positive linear relationship between the variables, and a negative covariance would indicate the opposite.
The sample covariance is defined in terms of the sample means as:
Similarly, the population covariance is defined in terms of the population mean μx, μy as:
Problem
Find the covariance of eruption duration and waiting time in the data set faithful. Observe if there is any linear relationship between the two variables.
Solution
We apply the cov function to compute the covariance of eruptions and waiting.
> waiting = faithful$waiting # the waiting period
> cov(duration, waiting) # apply the cov function
[1] 13.978
Answer
The covariance of eruption duration and waiting time is about 14. It indicates a positive linear relationship between the two variables.
Correlation Coefficient
The correlation coefficient of two variables in a data set equals to their covariance divided by the product of their individual standard deviations. It is a normalized measurement of how the two are linearly related.
Formally, the sample correlation coefficient is defined by the following formula, where sx and sy are the sample standard deviations, and sxy is the sample covariance.
Similarly, the population correlation coefficient is defined as follows, where σx and σy are the population standard deviations, and σxy is the population covariance.
If the correlation coefficient is close to 1, it would indicate that the variables are positively linearly related and the scatter plot falls almost along a straight line with positive slope. For -1, it indicates that the variables are negatively linearly related and the scatter plot almost falls along a straight line with negative slope. And for zero, it would indicate a weak linear relationship between the variables.
Problem
Find the correlation coefficient of eruption duration and waiting time in the data set faithful. Observe if there is any linear relationship between the variables.
Solution
We apply the cor function to compute the correlation coefficient of eruptions and waiting.
> waiting = faithful$waiting # the waiting period
> cor(duration, waiting) # apply the cor function
[1] 0.90081
Answer
The correlation coefficient of eruption duration and waiting time is 0.90081. Since it is rather close to 1, we can conclude that the variables are positively linearly related.
Central Moment
The kth central moment (or moment about the mean) of a data population is:
Similarly, the kth central moment of a data sample is:
In particular, the second central moment of a population is its variance.
Problem
Find the third central moment of eruption duration in the data set faithful.
Solution
We apply the function moment from the e1071 package. As it is not in the core R library, the package has to be installed and loaded into the R workspace.
> duration = faithful$eruptions # eruption durations
> moment(duration, order=3, center=TRUE)
[1] -0.6149
Answer
The third central moment of eruption duration is -0.6149.
Exercise
Find the third central moment of eruption waiting period in faithful.
Skewness
The skewness of a data population is defined by the following formula, where μ2 and μ3 are the second and third central moments.
Intuitively, the skewness is a measure of symmetry. As a rule, negative skewness indicates that the mean of the data values is less than the median, and the data distribution is left-skewed. Positive skewness would indicate that the mean of the data values is larger than the median, and the data distribution is right-skewed.
Problem
Find the skewness of eruption duration in the data set faithful.
Solution
We apply the function skewness from the e1071 package to compute the skewness coefficient of eruptions. As the package is not in the core R library, it has to be installed and loaded into the R workspace.
> duration = faithful$eruptions # eruption durations
> skewness(duration) # apply the skewness function
[1] -0.41355
Answer
The skewness of eruption duration is -0.41355. It indicates that the eruption duration distribution is skewed towards the left.
Exercise
Find the skewness of eruption waiting period in faithful.
0 comments:
Post a Comment