Pages

Thursday 30 September 2021

Pertemuan ke-3 Numerical Measures


Mean

The mean of an observation variable is a numerical measure of the central location of the data values. It is the sum of its data values divided by data count.

Hence, for a data sample of size n, its sample mean is defined as follows:

    1-∑n
x¯= n    xi
      i=1

Similarly, for a data population of size N, the population mean is:

    1 ∑N
μ = --   xi
    N i=1

Problem

Find the mean eruption duration in the data set faithful.

Solution

We apply the mean function to compute the mean value of eruptions.

> duration = faithful$eruptions     # the eruption durations 
> mean(duration)                    # apply the mean function 
[1] 3.4878

Answer

The mean eruption duration is 3.4878 minutes.

Exercise

Find the mean eruption waiting periods in faithful.


Median

The median of an observation variable is the value at the middle when the data is sorted in ascending order. It is an ordinal measure of the central location of the data values.

Problem

Find the median of the eruption duration in the data set faithful.

Solution

We apply the median function to compute the median value of eruptions.

> duration = faithful$eruptions     # the eruption durations 
> median(duration)                  # apply the median function 
[1] 4

Answer

The median of the eruption duration is 4 minutes.

Exercise

Find the median of the eruption waiting periods in faithful.


Quartile

There are several quartiles of an observation variable. The first quartile, or lower quartile, is the value that cuts off the first 25% of the data when it is sorted in ascending order. The second quartile, or median, is the value that cuts off the first 50%. The third quartile, or upper quartile, is the value that cuts off the first 75%.

Problem

Find the quartiles of the eruption durations in the data set faithful.

Solution

We apply the quantile function to compute the quartiles of eruptions.

> duration = faithful$eruptions     # the eruption durations 
> quantile(duration)                # apply the quantile function 
    0%    25%    50%    75%   100% 
1.6000 2.1627 4.0000 4.4543 5.1000

Answer

The first, second and third quartiles of the eruption duration are 2.1627, 4.0000 and 4.4543 minutes respectively.

Exercise

Find the quartiles of the eruption waiting periods in faithful.

Note

There are several algorithms for the computation of quartiles. Details can be found in the R documentation via help(quantile).


Percentile

The nth percentile of an observation variable is the value that cuts off the first n percent of the data values when it is sorted in ascending order.

Problem

Find the 32nd, 57th and 98th percentiles of the eruption durations in the data set faithful.

Solution

We apply the quantile function to compute the percentiles of eruptions with the desired percentage ratios.

> duration = faithful$eruptions     # the eruption durations 
> quantile(duration, c(.32, .57, .98)) 
   32%    57%    98% 
2.3952 4.1330 4.9330

Answer

The 32nd, 57th and 98th percentiles of the eruption duration are 2.3952, 4.1330 and 4.9330 minutes respectively.

Exercise

Find the 17th, 43rd, 67th and 85th percentiles of the eruption waiting periods in faithful.

Note

There are several algorithms for the computation of percentiles. Details can be found in the R documentation via help(quantile).


Range

The range of an observation variable is the difference of its largest and smallest data values. It is a measure of how far apart the entire data spreads in value.

Range = Largest Value− Smallest Value

Problem

Find the range of the eruption duration in the data set faithful.

Solution

We apply the max and min function to compute the largest and smallest values of eruptions, then take the difference.

> duration = faithful$eruptions     # the eruption durations 
> max(duration)  min(duration)     # apply the max and min functions 
[1] 3.5

Answer

The range of the eruption duration is 3.5 minutes.

Exercise

Find the range of the eruption waiting periods in faithful.


Interquartile Range

The interquartile range of an observation variable is the difference of its upper and lower quartiles. It is a measure of how far apart the middle portion of data spreads in value.

Interquartile Range = U pper Quartile − Lower Quartile

Problem

Find the interquartile range of eruption duration in the data set faithful.

Solution

We apply the IQR function to compute the interquartile range of eruptions.

> duration = faithful$eruptions     # the eruption durations 
> IQR(duration)                     # apply the IQR function 
[1] 2.2915

Answer

The interquartile range of eruption duration is 2.2915 minutes.

Exercise

Find the interquartile range of eruption waiting periods in faithful.


Box Plot

The box plot of an observation variable is a graphical representation based on its quartiles, as well as its smallest and largest values. It attempts to provide a visual shape of the data distribution.

Problem

Find the box plot of the eruption duration in the data set faithful.

Solution

We apply the boxplot function to produce the box plot of eruptions.

> duration = faithful$eruptions       # the eruption durations 
> boxplot(duration, horizontal=TRUE)  # horizontal box plot

Answer

The box plot of the eruption duration is:

PIC

Exercise

Find the box plot of the eruption waiting periods in faithful.


Variance

The variance is a numerical measure of how the data values is dispersed around the mean. In particular, the sample variance is defined as:

          n
s2 =--1--∑  (x - ¯x)2
    n - 1i=1  i

Similarly, the population variance is defined in terms of the population mean μ and population size N:

 2   1-∑N       2
σ  = N    (xi - μ)
       i=1

Problem

Find the variance of the eruption duration in the data set faithful.

Solution

We apply the var function to compute the variance of eruptions.

> duration = faithful$eruptions    # the eruption durations 
> var(duration)                    # apply the var function 
[1] 1.3027

Answer

The variance of the eruption duration is 1.3027.

Exercise

Find the variance of the eruption waiting periods in faithful.


Standard Deviation

The standard deviation of an observation variable is the square root of its variance.

Problem

Find the standard deviation of the eruption duration in the data set faithful.

Solution

We apply the sd function to compute the standard deviation of eruptions.

> duration = faithful$eruptions    # the eruption durations 
> sd(duration)                     # apply the sd function 
[1] 1.1414

Answer

The standard deviation of the eruption duration is 1.1414.

Exercise

Find the standard deviation of the eruption waiting periods in faithful.


Covariance

The covariance of two variables and in a data set measures how the two are linearly related. A positive covariance would indicate a positive linear relationship between the variables, and a negative covariance would indicate the opposite.

The sample covariance is defined in terms of the sample means as:

           n
s  = --1--∑  (x  - ¯x)(y − ¯y)
xy   n - 1 i=1 i     i

Similarly, the population covariance is defined in terms of the population mean μxμy as:

     -1 N∑
σxy = N   (xi - μx)(yi − μy)
        i=1

Problem

Find the covariance of eruption duration and waiting time in the data set faithful. Observe if there is any linear relationship between the two variables.

Solution

We apply the cov function to compute the covariance of eruptions and waiting.

> duration = faithful$eruptions   # eruption durations 
> waiting = faithful$waiting      # the waiting period 
> cov(duration, waiting)          # apply the cov function 
[1] 13.978

Answer

The covariance of eruption duration and waiting time is about 14. It indicates a positive linear relationship between the two variables.


Correlation Coefficient

The correlation coefficient of two variables in a data set equals to their covariance divided by the product of their individual standard deviations. It is a normalized measurement of how the two are linearly related.

Formally, the sample correlation coefficient is defined by the following formula, where sx and sy are the sample standard deviations, and sxy is the sample covariance.

      s
rxy =--xy
     sxsy

Similarly, the population correlation coefficient is defined as follows, where σx and σy are the population standard deviations, and σxy is the population covariance.

ρ  = -σxy-
 xy  σxσy

If the correlation coefficient is close to 1, it would indicate that the variables are positively linearly related and the scatter plot falls almost along a straight line with positive slope. For -1, it indicates that the variables are negatively linearly related and the scatter plot almost falls along a straight line with negative slope. And for zero, it would indicate a weak linear relationship between the variables.

Problem

Find the correlation coefficient of eruption duration and waiting time in the data set faithful. Observe if there is any linear relationship between the variables.

Solution

We apply the cor function to compute the correlation coefficient of eruptions and waiting.

> duration = faithful$eruptions   # eruption durations 
> waiting = faithful$waiting      # the waiting period 
> cor(duration, waiting)          # apply the cor function 
[1] 0.90081

Answer

The correlation coefficient of eruption duration and waiting time is 0.90081. Since it is rather close to 1, we can conclude that the variables are positively linearly related.

Central Moment

The kth central moment (or moment about the mean) of a data population is:

     1-∑N       k
μk = N    (xi - μ)
       i=1

Similarly, the kth central moment of a data sample is:

     1-∑n       k
mk = n    (xi - ¯x)
       i=1

In particular, the second central moment of a population is its variance.

Problem

Find the third central moment of eruption duration in the data set faithful.

Solution

We apply the function moment from the e1071 package. As it is not in the core R library, the package has to be installed and loaded into the R workspace.

> library(e1071)                    # load e1071 
> duration = faithful$eruptions     # eruption durations 
> moment(duration, order=3, center=TRUE) 
[1] -0.6149

Answer

The third central moment of eruption duration is -0.6149.

Exercise

Find the third central moment of eruption waiting period in faithful.


Skewness

The skewness of a data population is defined by the following formula, where μ2 and μ3 are the second and third central moments.

γ1 = μ3∕μ3∕22

Intuitively, the skewness is a measure of symmetry. As a rule, negative skewness indicates that the mean of the data values is less than the median, and the data distribution is left-skewed. Positive skewness would indicate that the mean of the data values is larger than the median, and the data distribution is right-skewed.

Problem

Find the skewness of eruption duration in the data set faithful.

Solution

We apply the function skewness from the e1071 package to compute the skewness coefficient of eruptions. As the package is not in the core R library, it has to be installed and loaded into the R workspace.

> library(e1071)                    # load e1071 
> duration = faithful$eruptions     # eruption durations 
> skewness(duration)                # apply the skewness function 
[1] -0.41355

Answer

The skewness of eruption duration is -0.41355. It indicates that the eruption duration distribution is skewed towards the left.

Exercise

Find the skewness of eruption waiting period in faithful.


0 comments:

Post a Comment

Splitting dataset dan k-fold cross validation

Tantangan utama dalam merancang model pembelajaran mesin adalah membuatnya bekerja secara akurat pada data yang tidak terlihat. Untuk menget...