7 Window Functions

7.1 Basics

A window function is a variation on an aggregation function. Where an aggregation function, like sum() and mean(), takes n inputs and return a single value, a window function returns n values. The output of a window function depends on all its input values, so window functions do not include functions that work element-wise, like round(). Window functions include variations on aggregate functions, like cumsum() and dplyr::cummean(), functions for ranking and ordering, like rank(), and functions for taking offsets, like dplyr::lead() and dplyr::lag(). (Refer)

7.2 Ranks

  • Python Pandas rank() Function & R dplyr package
  • dplyr::row_number() attributes a unique value to each row
  • dplyr::min_rank() attributes the same row number to the same value, leaving ‘holes’
  • dplyr::dense_rank() attributes the same row number to the same value, leaving no ‘holes’
    • Ex: My salary might be the 2nd-highest (Dense Rank) the company pays. But, I might be the 4th-highest (Rank) paid employee if the same top salary is paid to 3 other people.
  • dplyr::percent_rank() Rescaling min_rank() to [0, 1]
  • dplyr::cume_dist() Proportion of all values less than or equal to the current rank.
  • ntile() a rough rank, which breaks the input vector into n buckets

R

aa <- c(12, 15, 11, 13, NA, 12)
bb_row <- dplyr::row_number(aa)
bb_row_r <- rank(aa, ties.method = 'first', na.last = 'keep')
bb_row
## [1]  2  5  1  4 NA  3

dd_min <- dplyr::min_rank(aa)                     #Two 12, so skips once for 13
dd_min_r <- rank(aa, ties.method = 'min', na.last = 'keep')
dd_min
## [1]  2  5  1  4 NA  2

dplyr::dense_rank(aa)                             #Consecutive Rank (Dense Rank)
## [1]  2  4  1  3 NA  2

dplyr::percent_rank(aa)                           #Pandas do not have this
## [1] 0.25 1.00 0.00 0.75   NA 0.25

dplyr::cume_dist(aa)                              #Matches Pandas Percent Rank
## [1] 0.6 1.0 0.2 0.8  NA 0.6

dplyr::ntile(aa, n = 2)
## [1]  1  2  1  2 NA  1

stopifnot(all(identical(bb_row, bb_row_r), identical(dd_min, dd_min_r)))

Python

pp = pd.Series([12, 15, 11, 13, None, 12])
print(*pp)                  #Print Horizontal by using unpacking operator (*)
## 12.0 15.0 11.0 13.0 nan 12.0

print(*pp.rank(method = 'first'))                 #Row Number
## 2.0 5.0 1.0 4.0 nan 3.0


print(*pp.rank(method = 'min'))                   #Two 12, so skips once for 13
## 2.0 5.0 1.0 4.0 nan 2.0


print(*pp.rank(method = 'dense'))                 #Consecutive Rank (Dense Rank)
## 2.0 4.0 1.0 3.0 nan 2.0


qq = (pp.rank(method = 'min') - 1) / (pp.count() - 1)       #Percent Rank
print(*qq)
## 0.25 1.0 0.0 0.75 nan 0.25


print(*pp.rank(method = 'max', pct = True))       #Cumulative Distance
## 0.6 1.0 0.2 0.8 nan 0.6

7.3 Data: Lahman

R

# DF | Tibble | Subset | Sort | Semi Join |
batting <- Lahman::Batting |> 
    as_tibble() |> 
    select(playerID, yearID, teamID, G, AB, R, H) |> 
    arrange(playerID, yearID, teamID) |> 
    semi_join(Lahman::AwardsPlayers, by = "playerID")

# Grouped Tibble
players <- batting |> group_by(playerID)

R

# Grouped Players
players |> head()
## # A tibble: 6 × 7
## # Groups:   playerID [1]
##   playerID  yearID teamID     G    AB     R     H
##   <chr>      <int> <fct>  <int> <int> <int> <int>
## 1 aaronha01   1954 ML1      122   468    58   131
## 2 aaronha01   1955 ML1      153   602   105   189
## 3 aaronha01   1956 ML1      153   609   106   200
## 4 aaronha01   1957 ML1      151   615   118   198
## 5 aaronha01   1958 ML1      153   601   109   196
## 6 aaronha01   1959 ML1      154   629   116   223

7.4 Test

R

# For each player, find the two years with most hits
dplyr::filter(players, min_rank(desc(H)) <= 2 & H > 0)
## # A tibble: 2,920 × 7
## # Groups:   playerID [1,385]
##    playerID  yearID teamID     G    AB     R     H
##    <chr>      <int> <fct>  <int> <int> <int> <int>
##  1 aaronha01   1959 ML1      154   629   116   223
##  2 aaronha01   1963 ML1      161   631   121   201
##  3 abbotji01   1999 MIL       20    21     0     2
##  4 abernte02   1955 WS1       40    26     1     4
##  5 abernte02   1957 WS1       26    24     3     4
##  6 abreubo01   1999 PHI      152   546   118   183
##  7 abreubo01   2000 PHI      154   576   103   182
##  8 abreujo02   2016 CHA      159   624    67   183
##  9 abreujo02   2017 CHA      156   621    95   189
## 10 abreujo02   2022 CHA      157   601    85   183
## # ℹ 2,910 more rows


# Within each player, rank each year by the number of games played
dplyr::mutate(players, G_rank = min_rank(G))
## # A tibble: 20,874 × 8
## # Groups:   playerID [1,436]
##    playerID  yearID teamID     G    AB     R     H G_rank
##    <chr>      <int> <fct>  <int> <int> <int> <int>  <int>
##  1 aaronha01   1954 ML1      122   468    58   131      4
##  2 aaronha01   1955 ML1      153   602   105   189     13
##  3 aaronha01   1956 ML1      153   609   106   200     13
##  4 aaronha01   1957 ML1      151   615   118   198     12
##  5 aaronha01   1958 ML1      153   601   109   196     13
##  6 aaronha01   1959 ML1      154   629   116   223     17
##  7 aaronha01   1960 ML1      153   590   102   172     13
##  8 aaronha01   1961 ML1      155   603   115   197     18
##  9 aaronha01   1962 ML1      156   592   127   191     20
## 10 aaronha01   1963 ML1      161   631   121   201     23
## # ℹ 20,864 more rows