2 Data Structures
2.1 R Data Types
- R has 6 basic data types (
logical,integer,double,character,complex, andraw). These data types can be combined to form Data Structures (vector,list,matrix,dataframe,factor).- Vectors are the simplest type of data structure in R. A
vectoris a sequence of data elements of the same basic type. - Members of a
vectorare called ‘elements’. - Atomic vectors are homogeneous i.e. each component has the same datatype.
- A vector type can be checked with the
typeof()function. -
listis avectorbut not an ‘atomic vector’.
- Vectors are the simplest type of data structure in R. A
- Create a vector or a list by
c()- In R, a literal character or number is just a vector of length 1.
- So,
c()‘combines’ them together in a series of 1-length vectors. It neither ‘creates’ nor ‘concatenates’ the vectors. It combines lists into a list and vectors into a vector. - All attributes (e.g.
dim) exceptnamesare removed. - All arguments are coerced to a common type
- The output type is determined from the highest type of the components in the hierarchy
NULL<raw<logical<integer<double<complex<character<list<expression.
R
# Integer: To declare as integer 'L' (not 'l') is added as Suffix
str(c(1L, 2L, NA, 4L, 5L))
## int [1:5] 1 2 NA 4 5
# Double (& Default)
str(c(1, 2, NA, 4, 5))
## num [1:5] 1 2 NA 4 5
# Character
str(c('a', 'b', NA, 'd', 'e'))
## chr [1:5] "a" "b" NA "d" "e"
# Logical
str(c(TRUE, FALSE, NA, FALSE, TRUE))
## logi [1:5] TRUE FALSE NA FALSE TRUE- Examination of R Data Types
R
# To know about an Object Named Vector (pi, letters are predefined)
aa <- setNames(c(1, 2, NA, pi, 4), nm = letters[1:5])
typeof(aa) # Type
## [1] "double"
class(aa) # Class
## [1] "numeric"
str(aa) # Structure
## Named num [1:5] 1 2 NA 3.14 4
## - attr(*, "names")= chr [1:5] "a" "b" "c" "d" ...
length(aa) # Length
## [1] 5
dim(aa) # Dimensions
## NULL
is(aa)[1:6] # Inheritance
## [1] "numeric" "vector" "index" "replValue" "numLike" "number"
names(attributes(aa)) # Attributes
## [1] "names"
names(aa) # Names
## [1] "a" "b" "c" "d" "e"2.2 R Matrices
-
Matricesandarraysare vectors with the attributedimattached to them- The data elements must be of the same basic type.
- A
matrixis a two-dimensional rectangular data set. - ‘Arrays’ are multi-dimensional Data structures. Data is stored in the form of matrices, row, and as well as in columns.
R
# Create Matrix
aa <- matrix(1:6, nrow=3, ncol=2, byrow=TRUE, dimnames=list(NULL, c('x', 'y')))
bb <- matrix(1:6, nrow=3, ncol=2, byrow=FALSE, dimnames=list(NULL, c('x', 'y')))
print(aa)
## x y
## [1,] 1 2
## [2,] 3 4
## [3,] 5 6
print(bb)
## x y
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
# About
str(aa)
## int [1:3, 1:2] 1 3 5 2 4 6
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:2] "x" "y"
dim(aa)
## [1] 3 2
length(aa)
## [1] 6
# Matrices have 'dimnames' attribute instead of usual 'names'
names(attributes(aa))
## [1] "dim" "dimnames"
names(aa)
## NULL
dimnames(aa)
## [[1]]
## NULL
##
## [[2]]
## [1] "x" "y"2.3 R DataFrames
- Data frame is a list of
vectors,factors, and/ormatricesall having the same length (number of rows in the case of matrices).- A
data framecan contain alistthat is the same length as the other components.
- A
R
# Create DataFrame (letters is predefined vector of 26 elements)
aa <- data.frame(x = 4:6, y = letters[4:6])
print(aa)
## x y
## 1 4 d
## 2 5 e
## 3 6 f
# About
str(aa)
## 'data.frame': 3 obs. of 2 variables:
## $ x: int 4 5 6
## $ y: chr "d" "e" "f"
dim(aa) #Dimensions Row x Column
## [1] 3 2
stopifnot(all(identical(nrow(aa), dim(aa)[1]),
identical(ncol(aa), dim(aa)[2])))
names(attributes(aa)) # Attributes
## [1] "names" "class" "row.names"
names(aa) # Names of column headers
## [1] "x" "y"
is.list(aa)
## [1] TRUE
is.vector(aa)
## [1] FALSE
is.atomic(aa)
## [1] FALSE2.4 R Factors
-
Factorsare used to describe items that can have a finite number of values (gender, social class, etc.). A factor has alevelsattribute and classfactor.- A factor may be purely nominal or may have ordered categories.
R
# Create Factors Unordered
aa <- factor(c('female', 'male', 'male', 'female', 'male'), ordered = FALSE)
# Create Factors Ordered
bb <- factor(c('female', 'male', 'male', 'female', 'male'), ordered = TRUE)
print(aa)
## [1] female male male female male
## Levels: female male
print(bb)
## [1] female male male female male
## Levels: female < male
# About
str(aa)
## Factor w/ 2 levels "female","male": 1 2 2 1 2
str(bb)
## Ord.factor w/ 2 levels "female"<"male": 1 2 2 1 2
nlevels(aa) # Count of Levels
## [1] 2
levels(aa) # Vector of Levels
## [1] "female" "male"
names(attributes(aa)) # Attributes
## [1] "levels" "class"2.5 R Membership
-
anyNA()isTRUEif there is anNApresent,FALSEotherwise -
is.atomic()isTRUEfor all atomic vectors, factors, matrices but isFALSEfor lists and dataframes -
is.vector()isTRUEfor all atomic vectors, lists but isFALSEfor factors, matrices, DATE & POSIXct- It returns
FALSEif the vector has attributes (exceptnames) ex: DATE, POSIXct, DataFrames (even though a Dataframe is a list and a list is a vector)
- It returns
-
is.numeric()isTRUEfor bothintegeranddouble -
is.integer(),is.double(),is.character(),is.logical()areTRUEfor their respective datatypes only -
is.factor(),is.ordered()are membership functions for factors with and without ordering respectively
R
# Create Objects
aa_num <- setNames(c(1, 2, NA, pi, 4), nm = letters[1:5])
bb_mat <- matrix(1:6, nrow=3, ncol=2, byrow=TRUE)
dd_dft <- data.frame(x = 4:6, y = letters[4:6])
ee_lst <- list(x = 4:6, y = letters[4:8])
ff_fct <- factor(c('female', 'male', 'male', 'female', 'male'), ordered = FALSE)
# List of Objects
gg <- list(Vector = aa_num, Matrix = bb_mat, DataFrame = dd_dft,
List = ee_lst, Factor = ff_fct)
# Apply a membership function on all of the objects inside the list
names(which(sapply(gg, is.atomic)))
## [1] "Vector" "Matrix" "Factor"
names(which(sapply(gg, is.vector)))
## [1] "Vector" "List"
names(which(sapply(gg, is.matrix)))
## [1] "Matrix"
names(which(sapply(gg, is.list)))
## [1] "DataFrame" "List"
names(which(sapply(gg, is.data.frame)))
## [1] "DataFrame"
names(which(sapply(gg, is.factor)))
## [1] "Factor"2.6 Python Types
-
General
- The principal built-in types are numerics, sequences, mappings, classes, instances and exceptions.
- Some collection classes are
mutable. The methods that add, subtract, or rearrange their members in place, and do not return a specific item, never return the collection instance itself butNone. - Practically all objects can be compared for equality, tested for truth value, and converted to a string.
-
Truth Value Testing
- constants defined to be false:
NoneandFalse. - zero of any numeric type:
0, 0.0, 0j, Decimal(0), Fraction(0, 1) - empty sequences and collections:
'', (), [], {}, set(), range(0)
- constants defined to be false:
Boolean Operations in ascending order:
and,or,notThere are eight comparison operations in Python.
-
Numeric Types:
int,float,complex- Booleans are a subtype of integers
- Python Integers have unlimited precision. Whereas, R integers are limited to \((2^{31} - 1 = 2147483647)\)
-
Iterator Types
- Sequences always support the iteration methods.
-
Sequence Types:
list,tuple,range- Negative value of index is relative to the end of the sequence in Python. Whereas, it acts to exclude those indices
- Concatenating immutable sequences always results in a new object.
- The
rangetype represents an immutable sequence of numbers and is commonly used for looping a specific number of times inforloops.
-
Text Sequence Type:
str- Strings are immutable sequences of Unicode code points.
Set Types:
set(mutable),frozenset(immutable)Mapping Types:
dict
2.7 Similarities
R
# R results are same as Python
stopifnot(0 == 1 %/% 2 )
stopifnot(-1 == (-1) %/% 2 )
stopifnot(-1 == 1 %/% (-2) )
stopifnot(0 == (-1) %/% (-2) )
# 1^y and y^0 are 1, ALWAYS in both R and Python
stopifnot(all(sapply(list(0**0, NaN**0, NA**0, Inf**0, 1**NA, 1**NaN),
identical, 1)))
stopifnot(is.nan(NaN * 0))
stopifnot(is.na(NA * 0) & !is.nan(NA * 0)) #NA
stopifnot(is.nan(Inf * 0))2.8 Differences
R
R
# R index starts from 1, whereas Python index starts from 0
# R range includes the maximum value, whereas Python excludes it
seq.int(length.out = 5)
## [1] 1 2 3 4 5
seq.int(from = 0, length.out = 5)
## [1] 0 1 2 3 4
seq.int(to = 10, length.out = 5)
## [1] 6 7 8 9 10
seq.int(by = -2, length.out = 5)
## [1] 1 -1 -3 -5 -7
# Colon ':' acts as sequence in R, Python uses colon for dictionary key:value
# Colon should be avoided in R if the range limits may change
1:5
## [1] 1 2 3 4 5
1:0
## [1] 1 02.9 Python Collections
- Python Collections: List, Tuple, Set, Dictionary
-
tuple: Literal(), Ordered, Immutable, Allows Duplicates, Refer -
list: Literal[], Ordered, Mutable, Allows Duplicates, Refer -
dict: Literal{}, Ordered, Mutable, No Duplicates, Refer -
set: Literal{}, Unordered, Mutable, No Duplicates, Refer- Effectively, ‘sets’ are ‘dictionaries’ without keys
- Indexing and slicing does not work on
setbecause it is unordered - Immutable objects include
numbers,stringsandtuples. Refer. -
setis converted toR environment. However, it is tricky to handle, so it is being put on hold for now. Note that python set operations like ‘set difference’ can be applied using scope resolution i.e.py$set_x$difference(set_y)
-
Python
pp = 11 # int
print(f'{type(pp) = } | ... | {pp = }')
## type(pp) = <class 'int'> | ... | pp = 11
pp = 11, # Implicit Tuple
print(f'{type(pp) = } | {len(pp) = } | {pp = }')
## type(pp) = <class 'tuple'> | len(pp) = 1 | pp = (11,)
pp = (11, ) # Length 1 Tuple needs comma
print(f'{type(pp) = } | {len(pp) = } | {pp = }')
## type(pp) = <class 'tuple'> | len(pp) = 1 | pp = (11,)
pp = (11, 22, 33) # Tuple
print(f'{type(pp) = } | {len(pp) = } | {pp = }')
## type(pp) = <class 'tuple'> | len(pp) = 3 | pp = (11, 22, 33)
pp = [11, 22, 33] # List
print(f'{type(pp) = } | {len(pp) = } | {pp = }')
## type(pp) = <class 'list'> | len(pp) = 3 | pp = [11, 22, 33]
pp = {11, 22, 33} # Set (unordered)
print(f'{type(pp) = } | {len(pp) = } | {pp = }')
## type(pp) = <class 'set'> | len(pp) = 3 | pp = {33, 11, 22}
pp = {'a': 11, 'b': 22, 'c': 33} # Dictionary
print(f'{type(pp) = } | {len(pp) = } | {pp = }')
## type(pp) = <class 'dict'> | len(pp) = 3 | pp = {'a': 11, 'b': 22, 'c': 33}