Data types and structures

Some parts of this section have been copied from, or based on, material in the R programming wikibook.

Overview

Most programming systems and languages have some basic data types which reflect allow data to be represented, organised and accessed efficiently by programs written in the language. This section introduces some of the most common and useful data types available in the base R system. Note that there are many other types of data structure that could be useful for statistical programming and data manipulation which are not part of the base distribution of R. However, many of these other data structures are provided through packages that are available in the public domain. For example, the datastructures package.

Introduction

Vectors are the simplest R objects, an ordered list of primitive R objects of a given type (e.g. real numbers, strings, logicals). Vectors are indexed by integers starting at 1. Factors are similar to vectors but where each element is categorical, i.e. one of a fixed number of possibilities (or levels). A matrix is like a vector but with a specific instruction for the layout such that it looks like a matrix, i.e. the elements are indexed by two integers, each starting at 1. Arrays are similar to matrices but can have more than 2 dimensions. A list is similar to a vector, but the elements need not all be of the same type. The elements of a list can be indexed either by integers or by named strings, i.e. an R list can be used to implement what is known in other languages as an “associative array”, “hash table”, “map” or “dictionary” - but not in a very efficient manner !! A dataframe is like a matrix but does not assume that all columns have the same type. A dataframe is a list of variables/vectors of the same length. Classes define how objects of a certain type look like. Classes are attached to object as an attribute. All R objects have a class, a type and a dimension. The class, type, and dimension of an object can be determined using the class, typeof, and dim functions.

x<-c(1,2,3,4)
Y<-matrix(c(1,2,3,4),2,2)
class(x)
typeof(x)
dim(x)
'numeric'
'double'
NULL
class(Y)
typeof(Y)
dim(Y)
  1. 'matrix'
  2. 'array'
'double'
  1. 2
  2. 2

Vectors

You can create a vector using the c() function which concatenates some elements. You can create a sequence using the : symbol or the seq() function. For instance 1:5 gives all the number between 1 and 5. The seq() function lets you specify the interval between the successive numbers. You can also repeat a pattern using the rep() function. You can also create a numeric vector of missing values using numeric(), a character vector of missing values using character() and a logical vector of missing values (i.e. FALSE) using logical().

Exercise 1

See if you can predict the output from each of these ways of creating vectors.

c(1,2,3,4,5)
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
c("a","b","c","d","e")
  1. 'a'
  2. 'b'
  3. 'c'
  4. 'd'
  5. 'e'
c(T,F,T,F)
  1. TRUE
  2. FALSE
  3. TRUE
  4. FALSE
1:5
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
5:1
  1. 5
  2. 4
  3. 3
  4. 2
  5. 1
seq(1,5)
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
seq(1,5,by=.5)
  1. 1
  2. 1.5
  3. 2
  4. 2.5
  5. 3
  6. 3.5
  7. 4
  8. 4.5
  9. 5
rep(1,5)
  1. 1
  2. 1
  3. 1
  4. 1
  5. 1
rep(1:2,5)
  1. 1
  2. 2
  3. 1
  4. 2
  5. 1
  6. 2
  7. 1
  8. 2
  9. 1
  10. 2
numeric(5)
  1. 0
  2. 0
  3. 0
  4. 0
  5. 0
logical(5)
  1. FALSE
  2. FALSE
  3. FALSE
  4. FALSE
  5. FALSE
character(5)
  1. ''
  2. ''
  3. ''
  4. ''
  5. ''

Vectors can be referred to using variables and the data in the vector accessed by using the [] brackets.

Height <- c(168, 177, 177, 177, 178, 172, 165, 171, 178, 170) # store a vector
Height[2] # Print the second component
Height[2:5] # Print the second, the 3rd, the 4th and 5th component
obs <- 1:10
Weight <- c(88, 72, 85, 52, 71, 69, 61, 61, 51, 75) 
BMI <- Weight/((Height/100)^2)   # Performs a simple calculation using vectors
BMI
index<-c(1,4,6)
BMI[index] # use a vector to index another vector
177
  1. 177
  2. 177
  3. 177
  4. 178
  1. 31.1791383219955
  2. 22.98190175237
  3. 27.1314117909924
  4. 16.5980401544894
  5. 22.4087867693473
  6. 23.3234180638183
  7. 22.4058769513315
  8. 20.8611196607503
  9. 16.0964524681227
  10. 25.9515570934256
  1. 31.1791383219955
  2. 16.5980401544894
  3. 23.3234180638183

Note how # can be used to place comments in your code.

aLso - Negative indices can be used to “drop” values from a vector.

x<-c(1,2,3,4)
y<-x[-2]
print(y)
z<-x[-length(x)]
print(z)
[1] 1 3 4
[1] 1 2 3

Matrices

If you want to create a new matrix, one way is to use the matrix function. You have to enter a vector of data, the number of rows and/or columns and finally you can specify if you want R to read your vector by row or by column (the default option). Here are two examples.

matrix(data = NA, nrow = 5, ncol = 5, byrow = T)
matrix(data = 1:15, nrow = 5, ncol = 5, byrow = T)
A matrix: 5 × 5 of type lgl
NANANANANA
NANANANANA
NANANANANA
NANANANANA
NANANANANA
A matrix: 5 × 5 of type int
1 2 3 4 5
6 7 8 910
1112131415
1 2 3 4 5
6 7 8 910

The functions cbind and rbind combine vectors into matrices in a column by column or row by row mode.

v1 <- 1:5
v2 <- 5:1
cbind(v1,v2)
A matrix: 5 × 2 of type int
v1v2
15
24
33
42
51
rbind(v1,v2)
A matrix: 2 × 5 of type int
v112345
v254321

The dimension of a matrix can be obtained using the dim function. Alternatively nrow and ncol returns the number of rows and columns in a matrix.

X <- matrix(data = 1:15, nrow = 5, ncol = 5, byrow = T)
dim(X)
nrow(X)
ncol(X)
  1. 5
  2. 5
5
5

Exercise 2

How would you access the value of an element of a matrix using [] ?

What is the value of \(X_{4,3}\) ?

X[4,3]
3

The function t forms the transpose of a matrix.

t(X)
A matrix: 5 × 5 of type int
1 6111 6
2 7122 7
3 8133 8
4 9144 9
51015510

Matrices are not just arrays (i.e a way of organising data in a grid), they also have an algebra.

Exercise 3

What do you think the output of the following code examples might be ?

X*X
A matrix: 5 × 5 of type int
1 4 9 16 25
36 49 64 81100
121144169196225
1 4 9 16 25
36 49 64 81100
X%*%X
print(X)
A matrix: 5 × 5 of type dbl
80 95110125140
205245285325365
330395460525590
80 95110125140
205245285325365
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    6    7    8    9   10
[3,]   11   12   13   14   15
[4,]    1    2    3    4    5
[5,]    6    7    8    9   10
print(X%*%X[,1:2])
     [,1] [,2]
[1,]   80   95
[2,]  205  245
[3,]  330  395
[4,]   80   95
[5,]  205  245

A matrix can be visualised using the plot function.

M <- cbind(obs,Height,Weight,BMI) # Create a matrix
plot(M)
_images/data-types_47_0.png
plot(Height,Weight,ylab="Weight",xlab="Height",main="Corpulence")
_images/data-types_48_0.png

Arrays

An array is composed of n dimensions where each dimension is a vector of R objects of the same type. An array of one dimension of one element may be constructed as follows.

x <- array(c(T,F),dim=c(1))
print(x)
[1] TRUE

The array x was created with a single dimension (dim=c(1)) drawn from the vector of possible values c(T,F). A similar array, y, can be created with a single dimension and two values.

y <- array(c(T,F),dim=c(2))
print(y)
[1]  TRUE FALSE

A three dimensional array - 3 by 3 by 3 - may be created as follows.

z <- array(1:27,dim=c(3,3,3))
dim(z)
print(z)
  1. 3
  2. 3
  3. 3
, , 1

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

, , 2

     [,1] [,2] [,3]
[1,]   10   13   16
[2,]   11   14   17
[3,]   12   15   18

, , 3

     [,1] [,2] [,3]
[1,]   19   22   25
[2,]   20   23   26
[3,]   21   24   27
is.matrix(z[,,1])
TRUE

Exercise 4

How would you access all of the elements in the second dimension of z ?

z[,,3]
A matrix: 3 × 3 of type int
192225
202326
212427

Exercise 5

What would you expect the output of the following code to be ?

print(z[,c(2,3),c(2,3)])
, , 1

     [,1] [,2]
[1,]   13   16
[2,]   14   17
[3,]   15   18

, , 2

     [,1] [,2]
[1,]   22   25
[2,]   23   26
[3,]   24   27

Arrays need not be symmetric across all dimensions. The following code creates a pair of 3 by 3 arrays.

w <- array(1:18,dim=c(3,3,2))
print(w)
, , 1

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

, , 2

     [,1] [,2] [,3]
[1,]   10   13   16
[2,]   11   14   17
[3,]   12   15   18

Objects of the vectors composing the array must be of the same type, but they need not be numbers.

u <- array(c(T,F),dim=c(3,3,2))
print(u)
, , 1

      [,1]  [,2]  [,3]
[1,]  TRUE FALSE  TRUE
[2,] FALSE  TRUE FALSE
[3,]  TRUE FALSE  TRUE

, , 2

      [,1]  [,2]  [,3]
[1,] FALSE  TRUE FALSE
[2,]  TRUE FALSE  TRUE
[3,] FALSE  TRUE FALSE

Exercise 6

Try evaluating the following code in your own notebook. What would you expect the output to be ? Were you correct ?

z <- array(1:27,dim=c(3,3,3))
is.matrix(z)
is.matrix(z[,,1])
FALSE
TRUE

Lists

A list is a collection of R objects. list creates a list. unlist transform a list into a vector. The objects in a list do not have to be of the same type or length.

x <- c(1:4)
y <- FALSE
z <- matrix(c(1:4),nrow=2,ncol=2)
myList <- list(x,y,z)
print(myList)
[[1]]
[1] 1 2 3 4

[[2]]
[1] FALSE

[[3]]
     [,1] [,2]
[1,]    1    3
[2,]    2    4

Lists have very flexible methods for reference

  • by index number :

Notice the use of [[ ]]

a <- list()
a[[1]] = "A"
print(a)
a[[2]]="B"
print(a)
[[1]]
[1] "A"
[[1]]
[1] "A"

[[2]]
[1] "B"
  • by name

a$fruit = "Apple"
a$color = "green"
print(a)
[[1]]
[1] "A"

[[2]]
[1] "B"

$fruit
[1] "Apple"

$color
[1] "green"

Lists are recursive data structres !!

a <- list()
a[[1]] <- "hello"
a[[2]] <- list(c(1,2,3))
a[[2]]$message <- "hi !!"
print(a)
[[1]]
[1] "hello"

[[2]]
[[2]][[1]]
[1] 1 2 3

[[2]]$message
[1] "hi !!"

Exercise 6

What would you expect the output of the following to be ?

print(a[[2]][[2]])
[1] "hi !!"
print(a[[2]][2])
$message
[1] "hi !!"
print(a[[2]][[1]][2])
[1] 2
print(a[2])
[[1]]
[[1]][[1]]
[1] 1 2 3

[[1]]$message
[1] "hi !!"
print(a[[2]])
[[1]]
[1] 1 2 3

$message
[1] "hi !!"

Confused !! Not surprising. Lists are recursive and inhomogenous - a tricky combination !! The key thing here is that [[ ]] returns a single item from a list and [ ] returns a list containing the elements indicated in the [ ]. For example

x <- list(1:3, "a", 4:6,list(4,5,6)) # x is a list with three items
x[1:2] # this prints a LIST containg the the first 2 items of list x
x[[2]] # this is the second ITEM contained in x - in this case, this is NOT a list
x[[4]] # this returns the fourth ITEM in the list x, which happens to be a list
x[4] # this returns a LIST containg the fourth item x - which is a LIST !! i.e x[4] is a list containing a list
    1. 1
    2. 2
    3. 3
  1. 'a'
'a'
  1. 4
  2. 5
  3. 6
    1. 4
    2. 5
    3. 6

If you are still confused - take a look at the following link to a useful section of Advanced R by Hadley Wickham.

Exercise 7

What is a recursive data structure ? Do you know of any other types of recursive data structures other than a list ?

Data frames

A dataframe has been referred to as “a list of variables/vectors of the same length”. In the following example, a dataframe of two vectors is created, each of five elements. The first vector, v1, is composed of a sequence of the integers 1 through 5. A second vector, v2, is composed of five logical values drawn of type T and F. The dataframe is then created, composed of the vectors. The columns of the data frame can be accessed using integer subscripts or the column name and the $ symbol.

v1 <- 1:5
v2 <- c(T,T,F,F,T)
df <- data.frame(v1,v2)
print(df)
df[,1]
df$v2
  v1    v2
1  1  TRUE
2  2  TRUE
3  3 FALSE
4  4 FALSE
5  5  TRUE
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  1. TRUE
  2. TRUE
  3. FALSE
  4. FALSE
  5. TRUE

The dataframe may be created directly. In the following code, the dataframe is created - naming each vector composing the dataframe as part of the argument list.

df <- data.frame(foo=1:5,bar=c(T,T,F,F,T))
print(df)
  foo   bar
1   1  TRUE
2   2  TRUE
3   3 FALSE
4   4 FALSE
5   5  TRUE

Note - the rows of a data frame can be inhomogenous, the columns are homogenous.