7 Window Functions
7.1 Basics
A window function is a variation on an aggregation function. Where an aggregation function, like sum()
and mean()
, takes n inputs and return a single value, a window function returns n values. The output of a window function depends on all its input values, so window functions do not include functions that work element-wise, like round()
. Window functions include variations on aggregate functions, like cumsum()
and dplyr::cummean()
, functions for ranking and ordering, like rank()
, and functions for taking offsets, like dplyr::lead()
and dplyr::lag()
. (Refer)
7.2 Ranks
- Python Pandas rank() Function & R dplyr package
- NOTE: rank() can return cumulative distribution equivalent to
dplyr::cume_dist()
but currently it cannot provide relative rank likedplyr::percent_rank()
- NOTE: rank() can return cumulative distribution equivalent to
-
dplyr::row_number()
attributes a unique value to each row -
dplyr::min_rank()
attributes the same row number to the same value, leaving ‘holes’ -
dplyr::dense_rank()
attributes the same row number to the same value, leaving no ‘holes’- Ex: My salary might be the 2nd-highest (Dense Rank) the company pays. But, I might be the 4th-highest (Rank) paid employee if the same top salary is paid to 3 other people.
-
dplyr::percent_rank()
Rescaling min_rank() to [0, 1] -
dplyr::cume_dist()
Proportion of all values less than or equal to the current rank. -
ntile()
a rough rank, which breaks the input vector into n buckets
R
aa <- c(12, 15, 11, 13, NA, 12)
bb_row <- dplyr::row_number(aa)
bb_row_r <- rank(aa, ties.method = 'first', na.last = 'keep')
bb_row
## [1] 2 5 1 4 NA 3
dd_min <- dplyr::min_rank(aa) #Two 12, so skips once for 13
dd_min_r <- rank(aa, ties.method = 'min', na.last = 'keep')
dd_min
## [1] 2 5 1 4 NA 2
dplyr::dense_rank(aa) #Consecutive Rank (Dense Rank)
## [1] 2 4 1 3 NA 2
dplyr::percent_rank(aa) #Pandas do not have this
## [1] 0.25 1.00 0.00 0.75 NA 0.25
dplyr::cume_dist(aa) #Matches Pandas Percent Rank
## [1] 0.6 1.0 0.2 0.8 NA 0.6
dplyr::ntile(aa, n = 2)
## [1] 1 2 1 2 NA 1
stopifnot(all(identical(bb_row, bb_row_r), identical(dd_min, dd_min_r)))
Python
pp = pd.Series([12, 15, 11, 13, None, 12])
print(*pp) #Print Horizontal by using unpacking operator (*)
## 12.0 15.0 11.0 13.0 nan 12.0
print(*pp.rank(method = 'first')) #Row Number
## 2.0 5.0 1.0 4.0 nan 3.0
print(*pp.rank(method = 'min')) #Two 12, so skips once for 13
## 2.0 5.0 1.0 4.0 nan 2.0
print(*pp.rank(method = 'dense')) #Consecutive Rank (Dense Rank)
## 2.0 4.0 1.0 3.0 nan 2.0
qq = (pp.rank(method = 'min') - 1) / (pp.count() - 1) #Percent Rank
print(*qq)
## 0.25 1.0 0.0 0.75 nan 0.25
print(*pp.rank(method = 'max', pct = True)) #Cumulative Distance
## 0.6 1.0 0.2 0.8 nan 0.6
7.3 Data: Lahman
R
# DF | Tibble | Subset | Sort | Semi Join |
batting <- Lahman::Batting |>
as_tibble() |>
select(playerID, yearID, teamID, G, AB, R, H) |>
arrange(playerID, yearID, teamID) |>
semi_join(Lahman::AwardsPlayers, by = "playerID")
# Grouped Tibble
players <- batting |> group_by(playerID)
R
# Grouped Players
players |> head()
## # A tibble: 6 × 7
## # Groups: playerID [1]
## playerID yearID teamID G AB R H
## <chr> <int> <fct> <int> <int> <int> <int>
## 1 aaronha01 1954 ML1 122 468 58 131
## 2 aaronha01 1955 ML1 153 602 105 189
## 3 aaronha01 1956 ML1 153 609 106 200
## 4 aaronha01 1957 ML1 151 615 118 198
## 5 aaronha01 1958 ML1 153 601 109 196
## 6 aaronha01 1959 ML1 154 629 116 223
7.4 Test
R
# For each player, find the two years with most hits
dplyr::filter(players, min_rank(desc(H)) <= 2 & H > 0)
## # A tibble: 2,920 × 7
## # Groups: playerID [1,385]
## playerID yearID teamID G AB R H
## <chr> <int> <fct> <int> <int> <int> <int>
## 1 aaronha01 1959 ML1 154 629 116 223
## 2 aaronha01 1963 ML1 161 631 121 201
## 3 abbotji01 1999 MIL 20 21 0 2
## 4 abernte02 1955 WS1 40 26 1 4
## 5 abernte02 1957 WS1 26 24 3 4
## 6 abreubo01 1999 PHI 152 546 118 183
## 7 abreubo01 2000 PHI 154 576 103 182
## 8 abreujo02 2016 CHA 159 624 67 183
## 9 abreujo02 2017 CHA 156 621 95 189
## 10 abreujo02 2022 CHA 157 601 85 183
## # ℹ 2,910 more rows
# Within each player, rank each year by the number of games played
dplyr::mutate(players, G_rank = min_rank(G))
## # A tibble: 20,874 × 8
## # Groups: playerID [1,436]
## playerID yearID teamID G AB R H G_rank
## <chr> <int> <fct> <int> <int> <int> <int> <int>
## 1 aaronha01 1954 ML1 122 468 58 131 4
## 2 aaronha01 1955 ML1 153 602 105 189 13
## 3 aaronha01 1956 ML1 153 609 106 200 13
## 4 aaronha01 1957 ML1 151 615 118 198 12
## 5 aaronha01 1958 ML1 153 601 109 196 13
## 6 aaronha01 1959 ML1 154 629 116 223 17
## 7 aaronha01 1960 ML1 153 590 102 172 13
## 8 aaronha01 1961 ML1 155 603 115 197 18
## 9 aaronha01 1962 ML1 156 592 127 191 20
## 10 aaronha01 1963 ML1 161 631 121 201 23
## # ℹ 20,864 more rows