2 Data Structures
2.1 R Data Types
- R has 6 basic data types (
logical
,integer
,double
,character
,complex
, andraw
). These data types can be combined to form Data Structures (vector
,list
,matrix
,dataframe
,factor
).- Vectors are the simplest type of data structure in R. A
vector
is a sequence of data elements of the same basic type. - Members of a
vector
are called ‘elements’. - Atomic vectors are homogeneous i.e. each component has the same datatype.
- A vector type can be checked with the
typeof()
function. -
list
is avector
but not an ‘atomic vector’.
- Vectors are the simplest type of data structure in R. A
- Create a vector or a list by
c()
- In R, a literal character or number is just a vector of length 1.
- So,
c()
‘combines’ them together in a series of 1-length vectors. It neither ‘creates’ nor ‘concatenates’ the vectors. It combines lists into a list and vectors into a vector. - All attributes (e.g.
dim
) exceptnames
are removed. - All arguments are coerced to a common type
- The output type is determined from the highest type of the components in the hierarchy
NULL
<raw
<logical
<integer
<double
<complex
<character
<list
<expression
.
R
# Integer: To declare as integer 'L' (not 'l') is added as Suffix
str(c(1L, 2L, NA, 4L, 5L))
## int [1:5] 1 2 NA 4 5
# Double (& Default)
str(c(1, 2, NA, 4, 5))
## num [1:5] 1 2 NA 4 5
# Character
str(c('a', 'b', NA, 'd', 'e'))
## chr [1:5] "a" "b" NA "d" "e"
# Logical
str(c(TRUE, FALSE, NA, FALSE, TRUE))
## logi [1:5] TRUE FALSE NA FALSE TRUE
- Examination of R Data Types
R
# To know about an Object Named Vector (pi, letters are predefined)
aa <- setNames(c(1, 2, NA, pi, 4), nm = letters[1:5])
typeof(aa) # Type
## [1] "double"
class(aa) # Class
## [1] "numeric"
str(aa) # Structure
## Named num [1:5] 1 2 NA 3.14 4
## - attr(*, "names")= chr [1:5] "a" "b" "c" "d" ...
length(aa) # Length
## [1] 5
dim(aa) # Dimensions
## NULL
is(aa)[1:6] # Inheritance
## [1] "numeric" "vector" "index" "replValue" "numLike" "number"
names(attributes(aa)) # Attributes
## [1] "names"
names(aa) # Names
## [1] "a" "b" "c" "d" "e"
2.2 R Matrices
-
Matrices
andarrays
are vectors with the attributedim
attached to them- The data elements must be of the same basic type.
- A
matrix
is a two-dimensional rectangular data set. - ‘Arrays’ are multi-dimensional Data structures. Data is stored in the form of matrices, row, and as well as in columns.
R
# Create Matrix
aa <- matrix(1:6, nrow=3, ncol=2, byrow=TRUE, dimnames=list(NULL, c('x', 'y')))
bb <- matrix(1:6, nrow=3, ncol=2, byrow=FALSE, dimnames=list(NULL, c('x', 'y')))
print(aa)
## x y
## [1,] 1 2
## [2,] 3 4
## [3,] 5 6
print(bb)
## x y
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
# About
str(aa)
## int [1:3, 1:2] 1 3 5 2 4 6
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:2] "x" "y"
dim(aa)
## [1] 3 2
length(aa)
## [1] 6
# Matrices have 'dimnames' attribute instead of usual 'names'
names(attributes(aa))
## [1] "dim" "dimnames"
names(aa)
## NULL
dimnames(aa)
## [[1]]
## NULL
##
## [[2]]
## [1] "x" "y"
2.3 R DataFrames
- Data frame is a list of
vectors
,factors
, and/ormatrices
all having the same length (number of rows in the case of matrices).- A
data frame
can contain alist
that is the same length as the other components.
- A
R
# Create DataFrame (letters is predefined vector of 26 elements)
aa <- data.frame(x = 4:6, y = letters[4:6])
print(aa)
## x y
## 1 4 d
## 2 5 e
## 3 6 f
# About
str(aa)
## 'data.frame': 3 obs. of 2 variables:
## $ x: int 4 5 6
## $ y: chr "d" "e" "f"
dim(aa) #Dimensions Row x Column
## [1] 3 2
stopifnot(all(identical(nrow(aa), dim(aa)[1]),
identical(ncol(aa), dim(aa)[2])))
names(attributes(aa)) # Attributes
## [1] "names" "class" "row.names"
names(aa) # Names of column headers
## [1] "x" "y"
is.list(aa)
## [1] TRUE
is.vector(aa)
## [1] FALSE
is.atomic(aa)
## [1] FALSE
2.4 R Factors
-
Factors
are used to describe items that can have a finite number of values (gender, social class, etc.). A factor has alevels
attribute and classfactor
.- A factor may be purely nominal or may have ordered categories.
R
# Create Factors Unordered
aa <- factor(c('female', 'male', 'male', 'female', 'male'), ordered = FALSE)
# Create Factors Ordered
bb <- factor(c('female', 'male', 'male', 'female', 'male'), ordered = TRUE)
print(aa)
## [1] female male male female male
## Levels: female male
print(bb)
## [1] female male male female male
## Levels: female < male
# About
str(aa)
## Factor w/ 2 levels "female","male": 1 2 2 1 2
str(bb)
## Ord.factor w/ 2 levels "female"<"male": 1 2 2 1 2
nlevels(aa) # Count of Levels
## [1] 2
levels(aa) # Vector of Levels
## [1] "female" "male"
names(attributes(aa)) # Attributes
## [1] "levels" "class"
2.5 R Membership
-
anyNA()
isTRUE
if there is anNA
present,FALSE
otherwise -
is.atomic()
isTRUE
for all atomic vectors, factors, matrices but isFALSE
for lists and dataframes -
is.vector()
isTRUE
for all atomic vectors, lists but isFALSE
for factors, matrices, DATE & POSIXct- It returns
FALSE
if the vector has attributes (exceptnames
) ex: DATE, POSIXct, DataFrames (even though a Dataframe is a list and a list is a vector)
- It returns
-
is.numeric()
isTRUE
for bothinteger
anddouble
-
is.integer()
,is.double()
,is.character()
,is.logical()
areTRUE
for their respective datatypes only -
is.factor()
,is.ordered()
are membership functions for factors with and without ordering respectively
R
# Create Objects
aa_num <- setNames(c(1, 2, NA, pi, 4), nm = letters[1:5])
bb_mat <- matrix(1:6, nrow=3, ncol=2, byrow=TRUE)
dd_dft <- data.frame(x = 4:6, y = letters[4:6])
ee_lst <- list(x = 4:6, y = letters[4:8])
ff_fct <- factor(c('female', 'male', 'male', 'female', 'male'), ordered = FALSE)
# List of Objects
gg <- list(Vector = aa_num, Matrix = bb_mat, DataFrame = dd_dft,
List = ee_lst, Factor = ff_fct)
# Apply a membership function on all of the objects inside the list
names(which(sapply(gg, is.atomic)))
## [1] "Vector" "Matrix" "Factor"
names(which(sapply(gg, is.vector)))
## [1] "Vector" "List"
names(which(sapply(gg, is.matrix)))
## [1] "Matrix"
names(which(sapply(gg, is.list)))
## [1] "DataFrame" "List"
names(which(sapply(gg, is.data.frame)))
## [1] "DataFrame"
names(which(sapply(gg, is.factor)))
## [1] "Factor"
2.6 Python Types
-
General
- The principal built-in types are numerics, sequences, mappings, classes, instances and exceptions.
- Some collection classes are
mutable
. The methods that add, subtract, or rearrange their members in place, and do not return a specific item, never return the collection instance itself butNone
. - Practically all objects can be compared for equality, tested for truth value, and converted to a string.
-
Truth Value Testing
- constants defined to be false:
None
andFalse
. - zero of any numeric type:
0, 0.0, 0j, Decimal(0), Fraction(0, 1)
- empty sequences and collections:
'', (), [], {}, set(), range(0)
- constants defined to be false:
Boolean Operations in ascending order:
and
,or
,not
There are eight comparison operations in Python.
-
Numeric Types:
int
,float
,complex
- Booleans are a subtype of integers
- Python Integers have unlimited precision. Whereas, R integers are limited to \((2^{31} - 1 = 2147483647)\)
-
Iterator Types
- Sequences always support the iteration methods.
-
Sequence Types:
list
,tuple
,range
- Negative value of index is relative to the end of the sequence in Python. Whereas, it acts to exclude those indices
- Concatenating immutable sequences always results in a new object.
- The
range
type represents an immutable sequence of numbers and is commonly used for looping a specific number of times infor
loops.
-
Text Sequence Type:
str
- Strings are immutable sequences of Unicode code points.
Set Types:
set
(mutable),frozenset
(immutable)Mapping Types:
dict
2.7 Similarities
R
# R results are same as Python
stopifnot(0 == 1 %/% 2 )
stopifnot(-1 == (-1) %/% 2 )
stopifnot(-1 == 1 %/% (-2) )
stopifnot(0 == (-1) %/% (-2) )
# 1^y and y^0 are 1, ALWAYS in both R and Python
stopifnot(all(sapply(list(0**0, NaN**0, NA**0, Inf**0, 1**NA, 1**NaN),
identical, 1)))
stopifnot(is.nan(NaN * 0))
stopifnot(is.na(NA * 0) & !is.nan(NA * 0)) #NA
stopifnot(is.nan(Inf * 0))
2.8 Differences
R
R
# R index starts from 1, whereas Python index starts from 0
# R range includes the maximum value, whereas Python excludes it
seq.int(length.out = 5)
## [1] 1 2 3 4 5
seq.int(from = 0, length.out = 5)
## [1] 0 1 2 3 4
seq.int(to = 10, length.out = 5)
## [1] 6 7 8 9 10
seq.int(by = -2, length.out = 5)
## [1] 1 -1 -3 -5 -7
# Colon ':' acts as sequence in R, Python uses colon for dictionary key:value
# Colon should be avoided in R if the range limits may change
1:5
## [1] 1 2 3 4 5
1:0
## [1] 1 0
2.9 Python Collections
- Python Collections: List, Tuple, Set, Dictionary
-
tuple
: Literal()
, Ordered, Immutable, Allows Duplicates, Refer -
list
: Literal[]
, Ordered, Mutable, Allows Duplicates, Refer -
dict
: Literal{}
, Ordered, Mutable, No Duplicates, Refer -
set
: Literal{}
, Unordered, Mutable, No Duplicates, Refer- Effectively, ‘sets’ are ‘dictionaries’ without keys
- Indexing and slicing does not work on
set
because it is unordered - Immutable objects include
numbers
,strings
andtuples
. Refer. -
set
is converted toR environment
. However, it is tricky to handle, so it is being put on hold for now. Note that python set operations like ‘set difference’ can be applied using scope resolution i.e.py$set_x$difference(set_y)
-
Python
pp = 11 # int
print(f'{type(pp) = } | ... | {pp = }')
## type(pp) = <class 'int'> | ... | pp = 11
pp = 11, # Implicit Tuple
print(f'{type(pp) = } | {len(pp) = } | {pp = }')
## type(pp) = <class 'tuple'> | len(pp) = 1 | pp = (11,)
pp = (11, ) # Length 1 Tuple needs comma
print(f'{type(pp) = } | {len(pp) = } | {pp = }')
## type(pp) = <class 'tuple'> | len(pp) = 1 | pp = (11,)
pp = (11, 22, 33) # Tuple
print(f'{type(pp) = } | {len(pp) = } | {pp = }')
## type(pp) = <class 'tuple'> | len(pp) = 3 | pp = (11, 22, 33)
pp = [11, 22, 33] # List
print(f'{type(pp) = } | {len(pp) = } | {pp = }')
## type(pp) = <class 'list'> | len(pp) = 3 | pp = [11, 22, 33]
pp = {11, 22, 33} # Set (unordered)
print(f'{type(pp) = } | {len(pp) = } | {pp = }')
## type(pp) = <class 'set'> | len(pp) = 3 | pp = {33, 11, 22}
pp = {'a': 11, 'b': 22, 'c': 33} # Dictionary
print(f'{type(pp) = } | {len(pp) = } | {pp = }')
## type(pp) = <class 'dict'> | len(pp) = 3 | pp = {'a': 11, 'b': 22, 'c': 33}