Window Functions in SQL and Pandas

A practical guide to window functions—ranking, aggregation, and offset operations computed over a sliding frame—demonstrated side-by-side in SQL and Python's pandas library.

Window function = data analysis step

After the database server has completed all of the steps necessary to evaluate a query, including joining, filtering, grouping, and sorting, the result set is complete and ready to be returned to the caller. Imagine if you could pause the query execution at this point and take a walk through the result set while it is still held in memory; what types of analysis might you want to do? If your result set contains sales data, perhaps you might want to generate rankings for salespeople or regions, or calculate percentage differences between one time period and another. If you are generating results for a financial report, perhaps you would like to calculate subtotals for each report section, and a grand total for the final section. Using analytic functions, you can do all of these things and more.

Window functions let you compute aggregations, rankings, or offsets across a set of rows that are related to the current row—without collapsing the result into a single row the way GROUP BY does. They are one of the most powerful tools for data analysis in both SQL and pandas.

Summary Cheat Sheet

Operation SQL pandas
Dense rank DENSE_RANK() OVER (...) rank(method="dense")
Rank RANK() OVER (...) rank(method="min")
Group total SUM() OVER (PARTITION BY ...) groupby().transform("sum")
Row number ROW_NUMBER() OVER (...) rank(method="first")
Running total SUM() OVER (ORDER BY ... ROWS ...) groupby().cumsum()
Moving avg AVG() OVER (ROWS BETWEEN n PRECEDING ...) rolling(window=n).mean()
Previous row LAG(col, 1) OVER (...) groupby().shift(1)
Next row LEAD(col, 1) OVER (...) groupby().shift(-1)
First in group FIRST_VALUE(col) OVER (...) groupby().transform("first")
Last in group LAST_VALUE(col) OVER (...) groupby().transform("last")
Buckets NTILE(n) OVER (...) pd.qcut(..., q=n)
Percentile rank PERCENT_RANK() OVER (...) manual formula with rank()
Cumulative dist CUME_DIST() OVER (...) rank(pct=True, method="max")

Ranking

Ranking functions assign an ordinal position to each row within its partition.

Function Behavior
DENSE_RANK() Tied rows get the same rank; next rank does not skip
ROW_NUMBER() Unique sequential integer, no ties
RANK() Tied rows get the same rank; next rank skips

.rank() function

` DataFrame.rank(axis=0, method=’average’, numeric_only=False, na_option=’keep’, ascending=True, pct=False) `

Key options

Dense ranking

Leetcode 178: Rank Scores

This is a almost a tutorial for rank function.

def order_scores(scores: pd.DataFrame) -> pd.DataFrame:
    scores['rank'] = scores['score'].rank(method = 'dense', ascending = False)
    return scores.sort_values(by = 'score', ascending = False)[['score', 'rank']]

Leetcode 184: Department Highest salary

Dense rank is used here.

def department_highest_salary(employee: pd.DataFrame, department: pd.DataFrame) -> pd.DataFrame:
    df = employee.merge(department, left_on='departmentId', right_on='id', how='left', suffixes=('_emp', '_dept'))
    df['rank'] = df.groupby('departmentId')['salary'].rank(method = 'min', ascending = False)
    return df[df['rank']==1][['name_dept', 'name_emp', 'salary']].rename(
        columns = {
            'name_dept' : 'Department',
            'name_emp' : 'Employee',
            'salary' : 'Salary'
        }
    )

Leetcode 185: Department Top Three Salaries

Dense rank is used here.

def top_three_salaries(employee: pd.DataFrame, department: pd.DataFrame) -> pd.DataFrame:
    df = employee.merge(department, how = 'left', left_on = 'departmentId', right_on = 'id', suffixes = ('_emp', '_dept'))
    df['salrank'] = df.groupby('departmentId')['salary'].rank(method = 'dense', ascending = False)
    return df[df['salrank']<=3][['name_dept', 'name_emp', 'salary']].rename(
        columns = {
            'name_dept' : 'Department',
            'name_emp' : 'Employee',
            'salary' : 'Salary'
        }
    )

Leetcode 1875: Group Employees of the Same Salary

Ranking with ties

Leetcode 512: Game Play Analysis II

def game_analysis(activity: pd.DataFrame) -> pd.DataFrame:
activity['event_date_rank'] = activity.groupby('player_id')['event_date'].rank(method = 'min')
return activity[activity['event_date_rank']==1][['player_id', 'device_id']]

Leetcode 586: Customer Placing the Largest Number of Orders

def largest_orders(orders: pd.DataFrame) -> pd.DataFrame:
order_count = orders.groupby('customer_number').size().reset_index(name='order_count') #result is two columns
order_count['count_rank'] = order_count['order_count'].rank(method = 'min', ascending = False)
return order_count[order_count['count_rank'] == 1][['customer_number']]

Leetcode 1070: Product Sales Analysis III

def sales_analysis(sales: pd.DataFrame) -> pd.DataFrame:
sales['year_rank'] = sales.groupby('product_id')['year'].rank(method = 'min')
return sales[sales['year_rank']==1][['product_id', 'year', 'quantity', 'price']].rename(
columns = {'year': 'first_year'}
)

Leetcode 1082: Sales Analysis I

def sales_analysis(sales: pd.DataFrame, product: pd.DataFrame) -> pd.DataFrame:
    sales_record = sales.merge(product, on = 'product_id', how = 'left')
    revenue = sales_record.groupby('seller_id')['price'].sum().reset_index(name = 'revenue') 
    revenue['revenue_rank'] = revenue['revenue'].rank(method = 'min', ascending = False)
    return revenue[revenue['revenue_rank'] == 1][['seller_id']]

Leetcode 1112: Highest Grade For Each Student

def highest_grade(enrollments: pd.DataFrame) -> pd.DataFrame:
    sorted_df = enrollments.sort_values(
        by = ['student_id', 'grade', 'course_id'],
        ascending = [True, False, True]
    )
    
    top_grades = sorted_df.drop_duplicates(subset = ['student_id'], keep ='first')
    return top_grades[['student_id', 'course_id', 'grade']]

Row number: unique integers

Leetcode 601. Human Traffic of Stadium

AVG, MAX, MIN groupby + transform(‘mean’, ‘max’, ‘min’)

Leetcode 615. Average Salary: Departments VS Company

import pandas as pd
import numpy as np
def average_salary(salary: pd.DataFrame, employee: pd.DataFrame) -> pd.DataFrame:
# 1. Merge tables
    df = salary.merge(employee, on='employee_id')
    
    # 2. Format the date as a string 'YYYY-MM' (safer than to_period)
    df['pay_month'] = df['pay_date'].dt.strftime('%Y-%m')

    
    # 3. Calculate the averages using your transform logic
    df['salary_avg'] = df.groupby('pay_month')['amount'].transform('mean')
    df['salary_avg_dept'] = df.groupby(['pay_month', 'department_id'])['amount'].transform('mean')
    

    # 4. Drop duplicates to get unique month-department combinations
    df = df.drop_duplicates(
        subset=['pay_month', 'department_id'], keep='first'
    ).sort_values(['pay_month', 'department_id']).copy()
    
    # 5. Create the exact 'higher', 'lower', 'same' comparison
    conditions = [
        df['salary_avg_dept'] > df['salary_avg'],
        df['salary_avg_dept'] < df['salary_avg']
    ]
    choices = ['higher', 'lower']
    df['comparison'] = np.select(conditions, choices, default='same')
    
    # 6. Return only the requested columns
    return df[['pay_month', 'department_id', 'comparison']]

Leetcode 1084. Sales Analysis III

import pandas as pd

def sales_analysis(product: pd.DataFrame, sales: pd.DataFrame) -> pd.DataFrame:
    df = sales.merge(product, on= 'product_id')
    df['is_first_quarter'] = df['sale_date'].between('2019-01-01', '2019-03-31')
    df['is_only_first_quarter'] = df.groupby('product_id')['is_first_quarter'].transform('min')
    return df[df['is_only_first_quarter'] == True].drop_duplicates(subset = ['product_id'])[['product_id', 'product_name']]

Group cumulative sum: groupby + cumsum

Leetcode 534. Game Play Analysis III

+————–+———+ | Column Name | Type | +————–+———+ | player_id | int | | device_id | int | | event_date | date | | games_played | int | +————–+———+ (player_id, event_date) is the primary key (combination of columns with unique values) of this table. This table shows the activity of players of some games. Each row is a record of a player who logged in and played a number of games (possibly 0) before logging out on someday using some device.

Write a solution to report the device that is first logged in for each player.

Return the result table in any order.

The result format is in the following example.

Example 1:

Input: Activity table: +———–+———–+————+————–+ | player_id | device_id | event_date | games_played | +———–+———–+————+————–+ | 1 | 2 | 2016-03-01 | 5 | | 1 | 2 | 2016-05-02 | 6 | | 2 | 3 | 2017-06-25 | 1 | | 3 | 1 | 2016-03-02 | 0 | | 3 | 4 | 2018-07-03 | 5 | +———–+———–+————+————–+ Output: +———–+———–+ | player_id | device_id | +———–+———–+ | 1 | 2 | | 2 | 3 | | 3 | 1 | +———–+———–+ –>

def game_play_analysis(activity: pd.DataFrame) -> pd.DataFrame:
    activity = activity.sort_values(by = ['player_id', 'event_date'])
    activity['games_played_so_far'] = activity.groupby('player_id')['games_played'].cumsum()
    return activity[['player_id', 'event_date', 'games_played_so_far']]

Leetcode 579. Find Cumulative Salary of an Employee

Leetcode 1308. Running Total for Different Genders

def running_total_for_different_genders(scores: pd.DataFrame) -> pd.DataFrame:
    scores = scores.sort_values(by = ['gender', 'day'])
    scores['total'] = scores.groupby('gender')['score_points'].cumsum()
    return scores[['gender', 'day', 'total']]

Leetcode 180. Consecutive Numbers

Leetcode 180. Consecutive Numbers

def consecutive_numbers(logs: pd.DataFrame) -> pd.DataFrame:
# 1. Sort by Id to ensure we are looking at the rows in sequential order
    logs['block_indicator'] = (logs['num'] != logs['num'].shift()).fillna(True).cumsum()
    logs['consec_counts'] = logs.groupby('block_indicator').transform('size')
    # groupby includes num because the output contains num; num is not used for grouping
    return logs[logs['consec_counts'] >=3].drop_duplicates(
        subset = ['num'], keep = 'first'
    )[['num']].rename(
        columns = {'num': 'ConsecutiveNums'}
    )

COUNT = .groupby.transform(‘size’)

Leetcode 1303: Find the Team Size

def team_size(employee: pd.DataFrame) -> pd.DataFrame:
employee['team_size'] = employee.groupby('team_id')['team_id'].transform('size')
return employee[['employee_id', 'team_size']]

Offset Functions

Offset functions let you look at the value of another row relative to the current row—no self-join required.

Function Description
LAG(col, n) Value of col from n rows before the current row
LEAD(col, n) Value of col from n rows after the current row
FIRST_VALUE(col) Value of col from the first row in the partition/frame
LAST_VALUE(col) Value of col from the last row in the partition/frame

5.1 LAG and LEAD (SQL)

SELECT
    salesperson,
    sale_date,
    amount,
    LAG(amount,  1, 0) OVER (PARTITION BY salesperson ORDER BY sale_date) AS prev_amount,
    LEAD(amount, 1, 0) OVER (PARTITION BY salesperson ORDER BY sale_date) AS next_amount,
    amount - LAG(amount, 1, 0)
             OVER (PARTITION BY salesperson ORDER BY sale_date)           AS change
FROM sales;

5.2 LAG and LEAD (pandas)

df_sorted = df.sort_values(["salesperson", "sale_date"])

df_sorted["prev_amount"] = (
    df_sorted.groupby("salesperson")["amount"].shift(1).fillna(0)
)
df_sorted["next_amount"] = (
    df_sorted.groupby("salesperson")["amount"].shift(-1).fillna(0)
)
df_sorted["change"] = df_sorted["amount"] - df_sorted["prev_amount"]

shift(1) is LAG(1), shift(-1) is LEAD(1).

5.3 FIRST_VALUE and LAST_VALUE (SQL)

SELECT
    region,
    salesperson,
    sale_date,
    amount,
    FIRST_VALUE(amount) OVER (
        PARTITION BY region ORDER BY sale_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS first_sale,
    LAST_VALUE(amount) OVER (
        PARTITION BY region ORDER BY sale_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS last_sale
FROM sales;

5.4 FIRST_VALUE and LAST_VALUE (pandas)

df_sorted = df.sort_values(["region", "sale_date"])

df_sorted["first_sale"] = df_sorted.groupby("region")["amount"].transform("first")
df_sorted["last_sale"]  = df_sorted.groupby("region")["amount"].transform("last")

1. What Is a Window Function?

A window is a logically ordered subset of rows defined by:

The function is evaluated for every row over its window, and the original row is preserved in the output.

┌──────────────────────────────────────────────┐
│  Full table                                  │
│  ┌────────────┐  ┌────────────┐              │
│  │ Partition A│  │ Partition B│  ...         │
│  │  row 1     │  │  row 1     │              │
│  │  row 2  ←── current row    │              │
│  │  row 3     │  │  row 3     │              │
│  └────────────┘  └────────────┘              │
└──────────────────────────────────────────────┘

2. The Sample Dataset

We will use a simple sales table throughout:

order_id salesperson region sale_date amount
1 Alice East 2024-01-05 500
2 Bob West 2024-01-07 300
3 Alice East 2024-01-12 700
4 Carol East 2024-01-15 450
5 Bob West 2024-01-20 600
6 Carol West 2024-01-22 200

SQL — create & populate:

CREATE TABLE sales (
    order_id    INT,
    salesperson VARCHAR(50),
    region      VARCHAR(50),
    sale_date   DATE,
    amount      INT
);

INSERT INTO sales VALUES
  (1, 'Alice', 'East', '2024-01-05', 500),
  (2, 'Bob',   'West', '2024-01-07', 300),
  (3, 'Alice', 'East', '2024-01-12', 700),
  (4, 'Carol', 'East', '2024-01-15', 450),
  (5, 'Bob',   'West', '2024-01-20', 600),
  (6, 'Carol', 'West', '2024-01-22', 200);

Python — create DataFrame:

import pandas as pd

data = {
    "order_id":    [1, 2, 3, 4, 5, 6],
    "salesperson": ["Alice", "Bob", "Alice", "Carol", "Bob", "Carol"],
    "region":      ["East", "West", "East", "East", "West", "West"],
    "sale_date":   pd.to_datetime([
        "2024-01-05", "2024-01-07", "2024-01-12",
        "2024-01-15", "2024-01-20", "2024-01-22"
    ]),
    "amount":      [500, 300, 700, 450, 600, 200],
}
df = pd.DataFrame(data)

3.1 SQL

SELECT
    salesperson,
    region,
    amount,
    ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount DESC) AS row_num,
    RANK()       OVER (PARTITION BY region ORDER BY amount DESC) AS rnk,
    DENSE_RANK() OVER (PARTITION BY region ORDER BY amount DESC) AS dense_rnk
FROM sales;

3.2 pandas

df["row_num"] = (
    df.groupby("region")["amount"]
      .rank(method="first", ascending=False)
      .astype(int)
)

df["rnk"] = (
    df.groupby("region")["amount"]
      .rank(method="min", ascending=False)
      .astype(int)
)

df["dense_rnk"] = (
    df.groupby("region")["amount"]
      .rank(method="dense", ascending=False)
      .astype(int)
)

print(df[["salesperson", "region", "amount", "row_num", "rnk", "dense_rnk"]])

Key difference: rank(method="first")ROW_NUMBER, rank(method="min")RANK, rank(method="dense")DENSE_RANK.


4. Aggregate Window Functions

Aggregates like SUM, AVG, MIN, and MAX can be computed over a window so that you see both the individual row value and a group-level summary in the same result.

4.1 Running total and group total (SQL)

SELECT
    salesperson,
    region,
    sale_date,
    amount,
    -- total per region (no ORDER BY → whole partition)
    SUM(amount) OVER (PARTITION BY region) AS region_total,
    -- running total per region ordered by date
    SUM(amount) OVER (
        PARTITION BY region
        ORDER BY sale_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total
FROM sales;

4.2 Running total and group total (pandas)

# Group total
df["region_total"] = df.groupby("region")["amount"].transform("sum")

# Running total — sort first, then cumsum within group
df_sorted = df.sort_values(["region", "sale_date"])
df_sorted["running_total"] = df_sorted.groupby("region")["amount"].cumsum()

4.3 Moving average over the last 3 rows (SQL)

SELECT
    salesperson,
    sale_date,
    amount,
    AVG(amount) OVER (
        ORDER BY sale_date
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ) AS moving_avg_3
FROM sales;

4.4 Moving average over the last 3 rows (pandas)

df_sorted = df.sort_values("sale_date").reset_index(drop=True)
df_sorted["moving_avg_3"] = (
    df_sorted["amount"]
      .rolling(window=3, min_periods=1)
      .mean()
)

rolling(window=3) corresponds to ROWS BETWEEN 2 PRECEDING AND CURRENT ROW. Use min_periods=1 to avoid NaN at the start.



6. NTILE — Bucketing Rows

NTILE(n) divides the ordered partition into n roughly equal buckets.

6.1 SQL

SELECT
    salesperson,
    amount,
    NTILE(3) OVER (ORDER BY amount DESC) AS bucket
FROM sales;

6.2 pandas

df["bucket"] = pd.qcut(df["amount"], q=3, labels=[3, 2, 1]).astype(int)

pd.qcut creates quantile-based buckets. Label 1 is assigned to the highest bucket to mimic SQL’s descending order convention.


7. PERCENT_RANK and CUME_DIST

Function Formula
PERCENT_RANK() $(rank - 1) / (N - 1)$
CUME_DIST() (number of rows ≤ current) / N

7.1 SQL

SELECT
    salesperson,
    amount,
    PERCENT_RANK() OVER (ORDER BY amount) AS pct_rank,
    CUME_DIST()    OVER (ORDER BY amount) AS cume_dist
FROM sales;

7.2 pandas

df["pct_rank"]  = df["amount"].rank(pct=False, method="min")
df["pct_rank"]  = (df["pct_rank"] - 1) / (len(df) - 1)

df["cume_dist"] = df["amount"].rank(pct=True, method="max")

References