2 Data Structures

2.1 R Data Types

R has 6 basic data types (logical, integer, double, character, complex, and raw). These data types can be combined to form Data Structures (vector, list, matrix, dataframe, factor).
- Vectors are the simplest type of data structure in R. A vector is a sequence of data elements of the same basic type.
- Members of a vector are called ‘elements’.
- Atomic vectors are homogeneous i.e. each component has the same datatype.
- A vector type can be checked with the typeof() function.
- list is a vector but not an ‘atomic vector’.
Create a vector or a list by c()
- In R, a literal character or number is just a vector of length 1.
- So, c() ‘combines’ them together in a series of 1-length vectors. It neither ‘creates’ nor ‘concatenates’ the vectors. It combines lists into a list and vectors into a vector.
- All attributes (e.g. dim) except names are removed.
- All arguments are coerced to a common type
- The output type is determined from the highest type of the components in the hierarchy NULL < raw < logical < integer < double < complex < character < list < expression.

# Integer: To declare as integer 'L' (not 'l') is added as Suffix
str(c(1L, 2L, NA, 4L, 5L))
##  int [1:5] 1 2 NA 4 5

# Double (& Default)
str(c(1, 2, NA, 4, 5))
##  num [1:5] 1 2 NA 4 5

# Character
str(c('a', 'b', NA, 'd', 'e'))
##  chr [1:5] "a" "b" NA "d" "e"

# Logical
str(c(TRUE, FALSE, NA, FALSE, TRUE))
##  logi [1:5] TRUE FALSE NA FALSE TRUE

Examination of R Data Types

# To know about an Object Named Vector (pi, letters are predefined)
aa <- setNames(c(1, 2, NA, pi, 4), nm = letters[1:5])

typeof(aa)              # Type
## [1] "double"
class(aa)               # Class
## [1] "numeric"
str(aa)                 # Structure
##  Named num [1:5] 1 2 NA 3.14 4
##  - attr(*, "names")= chr [1:5] "a" "b" "c" "d" ...
length(aa)              # Length
## [1] 5
dim(aa)                 # Dimensions
## NULL
is(aa)[1:6]             # Inheritance
## [1] "numeric"   "vector"    "index"     "replValue" "numLike"   "number"
names(attributes(aa))   # Attributes
## [1] "names"
names(aa)               # Names
## [1] "a" "b" "c" "d" "e"

2.2 R Matrices

Matrices and arrays are vectors with the attribute dim attached to them
- The data elements must be of the same basic type.
- A matrix is a two-dimensional rectangular data set.
- ‘Arrays’ are multi-dimensional Data structures. Data is stored in the form of matrices, row, and as well as in columns.

# Create Matrix
aa <- matrix(1:6, nrow=3, ncol=2, byrow=TRUE, dimnames=list(NULL, c('x', 'y')))
bb <- matrix(1:6, nrow=3, ncol=2, byrow=FALSE, dimnames=list(NULL, c('x', 'y')))

print(aa)
##      x y
## [1,] 1 2
## [2,] 3 4
## [3,] 5 6
print(bb)
##      x y
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6

# About
str(aa)
##  int [1:3, 1:2] 1 3 5 2 4 6
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:2] "x" "y"
dim(aa)
## [1] 3 2
length(aa)
## [1] 6

# Matrices have 'dimnames' attribute instead of usual 'names'
names(attributes(aa))
## [1] "dim"      "dimnames"
names(aa)
## NULL
dimnames(aa) 
## [[1]]
## NULL
## 
## [[2]]
## [1] "x" "y"

2.3 R DataFrames

Data frame is a list of vectors, factors, and/or matrices all having the same length (number of rows in the case of matrices).
- A data frame can contain a list that is the same length as the other components.

# Create DataFrame (letters is predefined vector of 26 elements)
aa <- data.frame(x = 4:6, y = letters[4:6])
print(aa)
##   x y
## 1 4 d
## 2 5 e
## 3 6 f

# About
str(aa)
## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  4 5 6
##  $ y: chr  "d" "e" "f"
dim(aa)                 #Dimensions Row x Column
## [1] 3 2
stopifnot(all(identical(nrow(aa), dim(aa)[1]),
              identical(ncol(aa), dim(aa)[2])))

names(attributes(aa))   # Attributes
## [1] "names"     "class"     "row.names"
names(aa)               # Names of column headers
## [1] "x" "y"

is.list(aa)
## [1] TRUE
is.vector(aa)
## [1] FALSE
is.atomic(aa)
## [1] FALSE

2.4 R Factors

Factors are used to describe items that can have a finite number of values (gender, social class, etc.). A factor has a levels attribute and class factor.
- A factor may be purely nominal or may have ordered categories.

# Create Factors Unordered
aa <- factor(c('female', 'male', 'male', 'female', 'male'), ordered = FALSE)
# Create Factors Ordered
bb <- factor(c('female', 'male', 'male', 'female', 'male'), ordered = TRUE)

print(aa)
## [1] female male   male   female male  
## Levels: female male
print(bb)
## [1] female male   male   female male  
## Levels: female < male

# About
str(aa)
##  Factor w/ 2 levels "female","male": 1 2 2 1 2
str(bb)
##  Ord.factor w/ 2 levels "female"<"male": 1 2 2 1 2

nlevels(aa)             # Count of Levels
## [1] 2
levels(aa)              # Vector of Levels
## [1] "female" "male"
names(attributes(aa))   # Attributes
## [1] "levels" "class"

2.5 R Membership

anyNA() is TRUE if there is an NA present, FALSE otherwise
is.atomic() is TRUE for all atomic vectors, factors, matrices but is FALSE for lists and dataframes
is.vector() is TRUE for all atomic vectors, lists but is FALSE for factors, matrices, DATE & POSIXct
- It returns FALSE if the vector has attributes (except names) ex: DATE, POSIXct, DataFrames (even though a Dataframe is a list and a list is a vector)
is.numeric() is TRUE for both integer and double
is.integer(), is.double(), is.character(), is.logical() are TRUE for their respective datatypes only
is.factor(), is.ordered() are membership functions for factors with and without ordering respectively

# Create Objects
aa_num <- setNames(c(1, 2, NA, pi, 4), nm = letters[1:5])
bb_mat <- matrix(1:6, nrow=3, ncol=2, byrow=TRUE)
dd_dft <- data.frame(x = 4:6, y = letters[4:6])
ee_lst <- list(x = 4:6, y = letters[4:8])
ff_fct <- factor(c('female', 'male', 'male', 'female', 'male'), ordered = FALSE)

# List of Objects
gg <- list(Vector = aa_num, Matrix = bb_mat, DataFrame = dd_dft, 
            List = ee_lst, Factor = ff_fct)

# Apply a membership function on all of the objects inside the list
names(which(sapply(gg, is.atomic)))
## [1] "Vector" "Matrix" "Factor"
names(which(sapply(gg, is.vector)))
## [1] "Vector" "List"
names(which(sapply(gg, is.matrix)))
## [1] "Matrix"
names(which(sapply(gg, is.list)))
## [1] "DataFrame" "List"
names(which(sapply(gg, is.data.frame)))
## [1] "DataFrame"
names(which(sapply(gg, is.factor)))
## [1] "Factor"

2.6 Python Types

(doc) Built-in Types
General
- The principal built-in types are numerics, sequences, mappings, classes, instances and exceptions.
- Some collection classes are mutable. The methods that add, subtract, or rearrange their members in place, and do not return a specific item, never return the collection instance itself but None.
- Practically all objects can be compared for equality, tested for truth value, and converted to a string.
Truth Value Testing
- constants defined to be false: None and False.
- zero of any numeric type: 0, 0.0, 0j, Decimal(0), Fraction(0, 1)
- empty sequences and collections: '', (), [], {}, set(), range(0)
Boolean Operations in ascending order: and, or, not
There are eight comparison operations in Python.
Numeric Types: int, float, complex
- Booleans are a subtype of integers
- Python Integers have unlimited precision. Whereas, R integers are limited to $(2^{31} - 1 = 2147483647)$
Iterator Types
- Sequences always support the iteration methods.
Sequence Types: list, tuple, range
- Negative value of index is relative to the end of the sequence in Python. Whereas, it acts to exclude those indices
- Concatenating immutable sequences always results in a new object.
- The range type represents an immutable sequence of numbers and is commonly used for looping a specific number of times in for loops.
Text Sequence Type: str
- Strings are immutable sequences of Unicode code points.
Set Types: set (mutable), frozenset (immutable)
Mapping Types: dict

2.7 Similarities

# R results are same as Python
stopifnot(0  == 1 %/% 2 )
stopifnot(-1 == (-1) %/% 2 )
stopifnot(-1 == 1 %/% (-2) )
stopifnot(0  == (-1) %/% (-2) )

# 1^y and y^0 are 1, ALWAYS in both R and Python
stopifnot(all(sapply(list(0**0, NaN**0, NA**0, Inf**0, 1**NA, 1**NaN), 
                     identical, 1)))

stopifnot(is.nan(NaN * 0))
stopifnot(is.na(NA * 0) & !is.nan(NA * 0))        #NA

stopifnot(is.nan(Inf * 0))

Python

# Python results are same as R
assert(0  == 1//2 )
assert(-1 == (-1) // 2 )
assert(-1 == 1 // (-2) )
assert(0  == (-1) // (-2) )

# 1^y and y^0 are 1, ALWAYS in both R and Python
assert(1 == 0**0 == np.nan**0 == 1**np.nan)

assert(np.isnan(np.nan * 0))
assert(np.isnan(math.inf * 0))

2.8 Differences

# R allows division by zero whereas Python throws ZeroDivisionError
stopifnot(is.na(0L %/% 0L) & !is.nan(0L %/% 0L))  #NA
stopifnot(is.nan(0 %/% 0 ))                       #NaN
stopifnot(is.infinite(1 %/% 0))                   #Inf

# R index starts from 1, whereas Python index starts from 0
# R range includes the maximum value, whereas Python excludes it
seq.int(length.out = 5)
## [1] 1 2 3 4 5
seq.int(from = 0, length.out = 5)
## [1] 0 1 2 3 4
seq.int(to = 10, length.out = 5)
## [1]  6  7  8  9 10
seq.int(by = -2, length.out = 5)
## [1]  1 -1 -3 -5 -7

# Colon ':' acts as sequence in R, Python uses colon for dictionary key:value
# Colon should be avoided in R if the range limits may change
1:5
## [1] 1 2 3 4 5
1:0
## [1] 1 0

Python

# Python index starts from 0, whereas R index starts from 1
# Python range excludes the maximum value, whereas R includes it
list(range(5))
## [0, 1, 2, 3, 4]
list(range(1, 6))
## [1, 2, 3, 4, 5]
list(range(10, 1, -2))
## [10, 8, 6, 4, 2]

2.9 Python Collections

Python Collections: List, Tuple, Set, Dictionary
- tuple: Literal (), Ordered, Immutable, Allows Duplicates, Refer
- list : Literal [], Ordered, Mutable, Allows Duplicates, Refer
- dict : Literal {}, Ordered, Mutable, No Duplicates, Refer
- set : Literal {}, Unordered, Mutable, No Duplicates, Refer
  - Effectively, ‘sets’ are ‘dictionaries’ without keys
  - Indexing and slicing does not work on set because it is unordered
  - Immutable objects include numbers, strings and tuples. Refer.
  - set is converted to R environment. However, it is tricky to handle, so it is being put on hold for now. Note that python set operations like ‘set difference’ can be applied using scope resolution i.e. py$set_x$difference(set_y)

Python

pp = 11                                           # int
print(f'{type(pp) = }   |     ...     | {pp = }')
## type(pp) = <class 'int'>   |     ...     | pp = 11

pp = 11,                                          # Implicit Tuple
print(f'{type(pp) = } | {len(pp) = } | {pp = }')
## type(pp) = <class 'tuple'> | len(pp) = 1 | pp = (11,)

pp = (11, )                                       # Length 1 Tuple needs comma
print(f'{type(pp) = } | {len(pp) = } | {pp = }')
## type(pp) = <class 'tuple'> | len(pp) = 1 | pp = (11,)

pp = (11, 22, 33)                                 # Tuple
print(f'{type(pp) = } | {len(pp) = } | {pp = }')
## type(pp) = <class 'tuple'> | len(pp) = 3 | pp = (11, 22, 33)

pp = [11, 22, 33]                                 # List
print(f'{type(pp) = }  | {len(pp) = } | {pp = }')
## type(pp) = <class 'list'>  | len(pp) = 3 | pp = [11, 22, 33]

pp = {11, 22, 33}                                 # Set (unordered)
print(f'{type(pp) = }   | {len(pp) = } | {pp = }')
## type(pp) = <class 'set'>   | len(pp) = 3 | pp = {33, 11, 22}

pp = {'a': 11, 'b': 22, 'c': 33}                  # Dictionary
print(f'{type(pp) = }  | {len(pp) = } | {pp = }')
## type(pp) = <class 'dict'>  | len(pp) = 3 | pp = {'a': 11, 'b': 22, 'c': 33}

2.10 NumPy Array

Python

pp = np.arange(12).reshape((3,4))
print(type(pp))
## <class 'numpy.ndarray'>
print(pp)
## [[ 0  1  2  3]
##  [ 4  5  6  7]
##  [ 8  9 10 11]]
assert(np.array_equal(pp[0, :], pp[0, ]))         #Verify same shape & values
print(pp[0, ])                                    #Subset First Row
## [0 1 2 3]

1 Introduction

3 R List & Python Dictionary