Qualitative Data

$fractal-07h$ A data sample is called qualitative, also known as categorical, if its values belong to a collection of known defined non-overlapping classes. Common examples include student letter grade (A, B, C, D or F), commercial bond rating (AAA, AAB, ...) and consumer clothing shoe sizes (1, 2, 3, ...).
The tutorials in this section are based on an R built-in data frame named painters. It is a compilation of technical information of a few eighteenth century classical painters. The data set belongs to the MASS package, and has to be pre-loaded into the R workspace prior to its use.
> library(MASS)      # load the MASS package
> painters
              Composition Drawing Colour Expression School
Da Udine               10       8     16          3      A
Da Vinci               15      16      4         14      A
Del Piombo              8      13     16          7      A
Del Sarto              12      16      9          8      A
Fr. Penni               0      15      8          0      A
Guilio Romano          15      16      4         14      A
                    .................
The last School column contains the information of school classification of the painters. The schools are named as A, B, ..., etc, and the School variable is qualitative.
> painters$School
[1] A A A A A A A A A A B B B B B B C C C C C C D D D D
[27] D D D D D D E E E E E E E F F F F G G G G G G G H H
[53] H H
Levels: A B C D E F G H

Qualitative Data

$fractal-07h$ A data sample is called qualitative, also known as categorical, if its values belong to a collection of known defined non-overlapping classes. Common examples include student letter grade (A, B, C, D or F), commercial bond rating (AAA, AAB, ...) and consumer clothing shoe sizes (1, 2, 3, ...).
The tutorials in this section are based on an R built-in data frame named painters. It is a compilation of technical information of a few eighteenth century classical painters. The data set belongs to the MASS package, and has to be pre-loaded into the R workspace prior to its use.
> library(MASS)      # load the MASS package
> painters
              Composition Drawing Colour Expression School
Da Udine               10       8     16          3      A
Da Vinci               15      16      4         14      A
Del Piombo              8      13     16          7      A
Del Sarto              12      16      9          8      A
Fr. Penni               0      15      8          0      A
Guilio Romano          15      16      4         14      A
                    .................
The last School column contains the information of school classification of the painters. The schools are named as A, B, ..., etc, and the School variable is qualitative.
> painters$School
[1] A A A A A A A A A A B B B B B B C C C C C C D D D D
[27] D D D D D D E E E E E E E F F F F G G G G G G G H H
[53] H H
Levels: A B C D E F G H
For further details of the painters data set, please consult the R documentation.
> help(painters)

Frequency Distribution of Qualitative Data

The frequency distribution of a data variable is a summary of the data occurrence in a collection of non-overlapping categories.

Example

In the data set painters, the frequency distribution of the School variable is a summary of the number of painters in each school.

Problem

Find the frequency distribution of the painter schools in the data set painters.

Solution

We apply the table function to compute the frequency distribution of the School variable.
> library(MASS)                 # load the MASS package
> school = painters$School      # the painter schools
> school.freq = table(school)   # apply the table function

Answer

The frequency distribution of the schools is:
> school.freq
school
A B C D E F G H
10 6 6 10 7 4 7 4

Enhanced Solution

We apply the cbind function to print the result in column format.
> cbind(school.freq)
  school.freq
A          10
B           6
C           6
D          10
E           7
F           4
G           7
H           4

Relative Frequency Distribution of Qualitative Data

The relative frequency distribution of a data variable is a summary of the frequency proportion in a collection of non-overlapping categories.
The relationship of frequency and relative frequency is:
$Relative F requency =-Frequency- Sample Size$

Example

In the data set painters, the relative frequency distribution of the School variable is a summary of the proportion of painters in each school.

Problem

Find the relative frequency distribution of the painter schools in the data set painters.

Solution

We first apply the table function to compute the frequency distribution of the School variable.
> library(MASS)                 # load the MASS package
> school = painters$School      # the painter schools
> school.freq = table(school)   # apply the table function
Then we find the sample size of painters with the nrow function, and divide the frequency distribution with it. Therefore the relative frequency distribution is:
> school.relfreq = school.freq / nrow(painters)

Answer

The relative frequency distribution of the schools is:
> school.relfreq
school
A B C D E F
0.185185 0.111111 0.111111 0.185185 0.129630 0.074074
G H
0.129630 0.074074

Enhanced Solution

We can print with fewer digits and make it more readable by setting the digits option.
> old = options(digits=1)
> school.relfreq
school
   A    B    C    D    E    F    G    H
0.19 0.11 0.11 0.19 0.13 0.07 0.13 0.07
> options(old)
In addition, we can apply the cbind function to print the result in column format.
> old = options(digits=1)
> cbind(school.relfreq)
  school.relfreq
A           0.19
B           0.11
C           0.11
D           0.19
E           0.13
F           0.07
G           0.13
H           0.07
> options(old)    # restore the old option

Quantitative Data

$fractal-01h$ Quantitative data, also known as continuous data, consists of numeric data that support arithmetic operations. This is in contrast with qualitative data, whose values belong to pre-defined classes with no arithmetic operation allowed. We will explain how to apply some of the R tools for quantitative data analysis with examples.
The tutorials in this section are based on a built-in data frame named faithful. It consists of a collection of observations of the Old Faithful geyser in the USA Yellowstone National Park. The following is a preview via the head function.
> head(faithful)
  eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55
There are two observation variables in the data set. The first one, called eruptions, is the duration of the geyser eruptions. The second one, called waiting, is the length of waiting period until the next eruption.

Frequency Distribution of Quantitative Data

The frequency distribution of a data variable is a summary of the data occurrence in a collection of non-overlapping categories.

Example

In the data set faithful, the frequency distribution of the eruptions variable is the summary of eruptions according to some classification of the eruption durations.

Problem

Find the frequency distribution of the eruption durations in faithful.

Solution

The solution consists of the following steps:
We first find the range of eruption durations with the range function. It shows that the observed eruptions are between 1.6 and 5.1 minutes in duration.
> duration = faithful$eruptions
> range(duration)
[1] 1.6 5.1
Break the range into non-overlapping sub-intervals by defining a sequence of equal distance break points. If we round the endpoints of the interval [1.6, 5.1] to the closest half-integers, we come up with the interval [1.5, 5.5]. Hence we set the break points to be the half-integer sequence { 1.5, 2.0, 2.5, ... }.
> breaks = seq(1.5, 5.5, by=0.5) # half-integer sequence
> breaks
[1] 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
Classify the eruption durations according to the half-unit-length sub-intervals with cut. As the intervals are to be closed on the left, and open on the right, we set the right argument as FALSE.
> duration.cut = cut(duration, breaks, right=FALSE)
Compute the frequency of eruptions in each sub-interval with the table function.
> duration.freq = table(duration.cut)

Answer

The frequency distribution of the eruption duration is:
> duration.freq
duration.cut
[1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5) [4.5,5)
51 41 5 7 30 73 61
[5,5.5)
4

Enhanced Solution

We apply the cbind function to print the result in column format.
> cbind(duration.freq)
        duration.freq
[1.5,2)            51
[2,2.5)            41
[2.5,3)             5
[3,3.5)             7
[3.5,4)            30
[4,4.5)            73
[4.5,5)            61
[5,5.5)             4

Relative Frequency Distribution of Quantitative Data

The relative frequency distribution of a data variable is a summary of the frequency proportion in a collection of non-overlapping categories.
The relationship of frequency and relative frequency is:
$Relative F requency =-Frequency- Sample Size$

Example

In the data set faithful, the relative frequency distribution of the eruptions variable shows the frequency proportion of the eruptions according to a duration classification.

Problem

Find the relative frequency distribution of the eruption durations in faithful.

Solution

We first find the frequency distribution of the eruption durations as follows. Further details can be found in the Frequency Distribution tutorial.
> duration = faithful$eruptions
> breaks = seq(1.5, 5.5, by=0.5)
> duration.cut = cut(duration, breaks, right=FALSE)
> duration.freq = table(duration.cut)
Then we find the sample size of faithful with the nrow function, and divide the frequency distribution with it. As a result, the relative frequency distribution is:
> duration.relfreq = duration.freq / nrow(faithful)

Answer

The frequency distribution of the eruption variable is:
> duration.relfreq
duration.cut
[1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5)
0.187500 0.150735 0.018382 0.025735 0.110294 0.268382
[4.5,5) [5,5.5)
0.224265 0.014706

Enhanced Solution

We can print with fewer digits and make it more readable by setting the digits option.
> old = options(digits=1)
> duration.relfreq
duration.cut
[1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5) [4.5,5)
   0.19    0.15    0.02    0.03    0.11    0.27    0.22
[5,5.5)
   0.01
> options(old)    # restore the old option
We then apply the cbind function to print both the frequency distribution and relative frequency distribution in parallel columns.
> old = options(digits=1)
> cbind(duration.freq, duration.relfreq)
        duration.freq duration.relfreq
[1.5,2)            51             0.19
[2,2.5)            41             0.15
[2.5,3)             5             0.02
[3,3.5)             7             0.03
[3.5,4)            30             0.11
[4,4.5)            73             0.27
[4.5,5)            61             0.22
[5,5.5)             4             0.01
> options(old)    # restore the old option

Cumulative Frequency Distribution

The cumulative frequency distribution of a quantitative variable is a summary of data frequency below a given level.

Example

In the data set faithful, the cumulative frequency distribution of the eruptions variable shows the total number of eruptions whose durations are less than or equal to a set of chosen levels.

Problem

Find the cumulative frequency distribution of the eruption durations in faithful.

Solution

We first find the frequency distribution of the eruption durations as follows. Further details can be found in the Frequency Distribution tutorial.
> duration = faithful$eruptions
> breaks = seq(1.5, 5.5, by=0.5)
> duration.cut = cut(duration, breaks, right=FALSE)
> duration.freq = table(duration.cut)
We then apply the cumsum function to compute the cumulative frequency distribution.
> duration.cumfreq = cumsum(duration.freq)

Answer

The cumulative distribution of the eruption duration is:
> duration.cumfreq
[1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5) [4.5,5)
51 92 97 104 134 207 268
[5,5.5)
272

Enhanced Solution

We apply the cbind function to print the result in column format.
> cbind(duration.cumfreq)
        duration.cumfreq
[1.5,2)               51
[2,2.5)               92
[2.5,3)               97
[3,3.5)              104
[3.5,4)              134
[4,4.5)              207
[4.5,5)              268
[5,5.5)              272

umulative Frequency Graph

A cumulative frequency graph or ogive of a quantitative variable is a curve graphically showing the cumulative frequency distribution.

Example

In the data set faithful, a point in the cumulative frequency graph of the eruptions variable shows the total number of eruptions whose durations are less than or equal to a given level.

Problem

Find the cumulative frequency graph of the eruption durations in faithful.

Solution

We first find the frequency distribution of the eruption durations as follows. Check the previous tutorial on Frequency Distribution for details.
> duration = faithful$eruptions
> breaks = seq(1.5, 5.5, by=0.5)
> duration.cut = cut(duration, breaks, right=FALSE)
> duration.freq = table(duration.cut)
We then compute its cumulative frequency with cumsum, add a starting zero element, and plot the graph.
> cumfreq0 = c(0, cumsum(duration.freq))
> plot(breaks, cumfreq0,            # plot the data
+   main="Old Faithful Eruptions",  # main title
+   xlab="Duration minutes",        # x−axis label
+   ylab="Cumulative eruptions")   # y−axis label
> lines(breaks, cumfreq0)           # join the points

Cumulative Relative Frequency Distribution

The cumulative relative frequency distribution of a quantitative variable is a summary of frequency proportion below a given level.
The relationship between cumulative frequency and relative cumulative frequency is:
$Cumulative Relative Frequency = Cumulative-Frequency Sample Size$

Example

In the data set faithful, the cumulative relative frequency distribution of the eruptions variable shows the frequency proportion of eruptions whose durations are less than or equal to a set of chosen levels.

Problem

Find the cumulative relative frequency distribution of the eruption durations in faithful.

Solution

We first find the frequency distribution of the eruption durations as follows. Further details can be found in the Frequency Distribution tutorial.
> duration = faithful$eruptions
> breaks = seq(1.5, 5.5, by=0.5)
> duration.cut = cut(duration, breaks, right=FALSE)
> duration.freq = table(duration.cut)
We then apply the cumsum function to compute the cumulative frequency distribution.
> duration.cumfreq = cumsum(duration.freq)
Then we find the sample size of faithful with the nrow function, and divide the cumulative frequency distribution with it. As a result, the cumulative relative frequency distribution is:
> duration.cumrelfreq = duration.cumfreq / nrow(faithful)

Answer

The cumulative relative frequency distribution of the eruption variable is:
> duration.cumrelfreq
[1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5) [4.5,5)
0.18750 0.33824 0.35662 0.38235 0.49265 0.76103 0.98529
[5,5.5)
1.00000

Enhanced Solution

We can print with fewer digits and make it more readable by setting the digits option.
> old = options(digits=2)
> duration.cumrelfreq
[1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5) [4.5,5)
   0.19    0.34    0.36    0.38    0.49    0.76    0.99
[5,5.5)
   1.00
> options(old)    # restore the old option
We then apply the cbind function to print both the cumulative frequency distribution and relative cumulative frequency distribution in parallel columns.
> old = options(digits=2)
> cbind(duration.cumfreq, duration.cumrelfreq)
        duration.cumfreq duration.cumrelfreq
[1.5,2)               51                0.19
[2,2.5)               92                0.34
[2.5,3)               97                0.36
[3,3.5)              104                0.38
[3.5,4)              134                0.49
[4,4.5)              207                0.76
[4.5,5)              268                0.99
[5,5.5)              272                1.00
> options(old)

R - Line Graphs

A line chart is a graph that connects a series of points by drawing line segments between them. These points are ordered in one of their coordinate (usually the x-coordinate) value. Line charts are usually used in identifying the trends in data.

The plot() function in R is used to create the line graph.

Syntax

The basic syntax to create a line chart in R is −

plot(v,type,col,xlab,ylab)

Following is the description of the parameters used −

v is a vector containing the numeric values.
type takes the value "p" to draw only the points, "l" to draw only the lines and "o" to draw both points and lines.
xlab is the label for x axis.
ylab is the label for y axis.
main is the Title of the chart.
col is used to give colors to both the points and lines.

Example

A simple line chart is created using the input vector and the type parameter as "O". The below script will create and save a line chart in the current R working directory.

Pages

Probability and Statistics

Thursday, 23 September 2021

Pertemuan ke-2: Descriptive Statistics in R (charts)

Qualitative Data

Qualitative Data

Frequency Distribution of Qualitative Data

The frequency distribution of a data variable is a summary of the data occurrence in a collection of non-overlapping categories.

Example

In the data set painters, the frequency distribution of the School variable is a summary of the number of painters in each school.

Problem

Find the frequency distribution of the painter schools in the data set painters.

Solution

We apply the table function to compute the frequency distribution of the School variable.> library(MASS) # load the MASS package > school = painters$School # the painter schools > school.freq = table(school) # apply the table function

Answer

The frequency distribution of the schools is:> school.freq school A B C D E F G H 10 6 6 10 7 4 7 4

Enhanced Solution

We apply the cbind function to print the result in column format.> cbind(school.freq) school.freq A 10 B 6 C 6 D 10 E 7 F 4 G 7 H 4

Relative Frequency Distribution of Qualitative Data

The relative frequency distribution of a data variable is a summary of the frequency proportion in a collection of non-overlapping categories.The relationship of frequency and relative frequency is:

Example

In the data set painters, the relative frequency distribution of the School variable is a summary of the proportion of painters in each school.

Problem

Find the relative frequency distribution of the painter schools in the data set painters.

Solution

Answer

The relative frequency distribution of the schools is:> school.relfreq school A B C D E F 0.185185 0.111111 0.111111 0.185185 0.129630 0.074074 G H 0.129630 0.074074

Enhanced Solution

Quantitative Data

Frequency Distribution of Quantitative Data

The frequency distribution of a data variable is a summary of the data occurrence in a collection of non-overlapping categories.

Example

In the data set faithful, the frequency distribution of the eruptions variable is the summary of eruptions according to some classification of the eruption durations.

Problem

Find the frequency distribution of the eruption durations in faithful.

Solution

Answer

The frequency distribution of the eruption duration is:> duration.freq duration.cut [1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5) [4.5,5) 51 41 5 7 30 73 61 [5,5.5) 4

Enhanced Solution

We apply the cbind function to print the result in column format.> cbind(duration.freq) duration.freq [1.5,2) 51 [2,2.5) 41 [2.5,3) 5 [3,3.5) 7 [3.5,4) 30 [4,4.5) 73 [4.5,5) 61 [5,5.5) 4

Relative Frequency Distribution of Quantitative Data

The relative frequency distribution of a data variable is a summary of the frequency proportion in a collection of non-overlapping categories.The relationship of frequency and relative frequency is:

Example

In the data set faithful, the relative frequency distribution of the eruptions variable shows the frequency proportion of the eruptions according to a duration classification.

Problem

Find the relative frequency distribution of the eruption durations in faithful.

Solution

Answer

The frequency distribution of the eruption variable is:> duration.relfreq duration.cut [1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5) 0.187500 0.150735 0.018382 0.025735 0.110294 0.268382 [4.5,5) [5,5.5) 0.224265 0.014706

Enhanced Solution

Cumulative Frequency Distribution

The cumulative frequency distribution of a quantitative variable is a summary of data frequency below a given level.

Example

In the data set faithful, the cumulative frequency distribution of the eruptions variable shows the total number of eruptions whose durations are less than or equal to a set of chosen levels.

Problem

Find the cumulative frequency distribution of the eruption durations in faithful.

Solution

Answer

The cumulative distribution of the eruption duration is:> duration.cumfreq [1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5) [4.5,5) 51 92 97 104 134 207 268 [5,5.5) 272

Enhanced Solution

We apply the cbind function to print the result in column format.> cbind(duration.cumfreq) duration.cumfreq [1.5,2) 51 [2,2.5) 92 [2.5,3) 97 [3,3.5) 104 [3.5,4) 134 [4,4.5) 207 [4.5,5) 268 [5,5.5) 272

umulative Frequency Graph

A cumulative frequency graph or ogive of a quantitative variable is a curve graphically showing the cumulative frequency distribution.

Example

In the data set faithful, a point in the cumulative frequency graph of the eruptions variable shows the total number of eruptions whose durations are less than or equal to a given level.

Problem

Find the cumulative frequency graph of the eruption durations in faithful.

Solution

Cumulative Relative Frequency Distribution

The cumulative relative frequency distribution of a quantitative variable is a summary of frequency proportion below a given level.The relationship between cumulative frequency and relative cumulative frequency is:

Example

In the data set faithful, the cumulative relative frequency distribution of the eruptions variable shows the frequency proportion of eruptions whose durations are less than or equal to a set of chosen levels.

Problem

Find the cumulative relative frequency distribution of the eruption durations in faithful.

Solution

Answer

The cumulative relative frequency distribution of the eruption variable is:> duration.cumrelfreq [1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5) [4.5,5) 0.18750 0.33824 0.35662 0.38235 0.49265 0.76103 0.98529 [5,5.5) 1.00000

Enhanced Solution

R - Line Graphs

Syntax

We apply the table function to compute the frequency distribution of the School variable.
> library(MASS) # load the MASS package
> school = painters$School # the painter schools
> school.freq = table(school) # apply the table function

The frequency distribution of the schools is:
> school.freq
school
A B C D E F G H
10 6 6 10 7 4 7 4

We apply the cbind function to print the result in column format.
> cbind(school.freq)
school.freq
A 10
B 6
C 6
D 10
E 7
F 4
G 7
H 4

The relative frequency distribution of a data variable is a summary of the frequency proportion in a collection of non-overlapping categories.
The relationship of frequency and relative frequency is:
$Relative F requency =-Frequency- Sample Size$

The relative frequency distribution of the schools is:
> school.relfreq
school
A B C D E F
0.185185 0.111111 0.111111 0.185185 0.129630 0.074074
G H
0.129630 0.074074

The frequency distribution of the eruption duration is:
> duration.freq
duration.cut
[1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5) [4.5,5)
51 41 5 7 30 73 61
[5,5.5)
4

We apply the cbind function to print the result in column format.
> cbind(duration.freq)
duration.freq
[1.5,2) 51
[2,2.5) 41
[2.5,3) 5
[3,3.5) 7
[3.5,4) 30
[4,4.5) 73
[4.5,5) 61
[5,5.5) 4

The relative frequency distribution of a data variable is a summary of the frequency proportion in a collection of non-overlapping categories.
The relationship of frequency and relative frequency is:
$Relative F requency =-Frequency- Sample Size$

The frequency distribution of the eruption variable is:
> duration.relfreq
duration.cut
[1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5)
0.187500 0.150735 0.018382 0.025735 0.110294 0.268382
[4.5,5) [5,5.5)
0.224265 0.014706

The cumulative distribution of the eruption duration is:
> duration.cumfreq
[1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5) [4.5,5)
51 92 97 104 134 207 268
[5,5.5)
272

We apply the cbind function to print the result in column format.
> cbind(duration.cumfreq)
duration.cumfreq
[1.5,2) 51
[2,2.5) 92
[2.5,3) 97
[3,3.5) 104
[3.5,4) 134
[4,4.5) 207
[4.5,5) 268
[5,5.5) 272

The cumulative relative frequency distribution of a quantitative variable is a summary of frequency proportion below a given level.
The relationship between cumulative frequency and relative cumulative frequency is:
$Cumulative Relative Frequency = Cumulative-Frequency Sample Size$

The cumulative relative frequency distribution of the eruption variable is:
> duration.cumrelfreq
[1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5) [4.5,5)
0.18750 0.33824 0.35662 0.38235 0.49265 0.76103 0.98529
[5,5.5)
1.00000