A practical guide to window functions—ranking, aggregation, and offset operations computed over a sliding frame—demonstrated side-by-side in SQL and Python's pandas library.
After the database server has completed all of the steps necessary to evaluate a query, including joining, filtering, grouping, and sorting, the result set is complete and ready to be returned to the caller. Imagine if you could pause the query execution at this point and take a walk through the result set while it is still held in memory; what types of analysis might you want to do? If your result set contains sales data, perhaps you might want to generate rankings for salespeople or regions, or calculate percentage differences between one time period and another. If you are generating results for a financial report, perhaps you would like to calculate subtotals for each report section, and a grand total for the final section. Using analytic functions, you can do all of these things and more.
Window functions let you compute aggregations, rankings, or offsets across a set of rows that are related to the current row—without collapsing the result into a single row the way GROUP BY does. They are one of the most powerful tools for data analysis in both SQL and pandas.
| Operation | SQL | pandas |
|---|---|---|
| Dense rank | DENSE_RANK() OVER (...) | rank(method="dense") |
| Rank | RANK() OVER (...) | rank(method="min") |
| Group total | SUM() OVER (PARTITION BY ...) | groupby().transform("sum") |
| Row number | ROW_NUMBER() OVER (...) | rank(method="first") |
| Running total | SUM() OVER (ORDER BY ... ROWS ...) | groupby().cumsum() |
| Moving avg | AVG() OVER (ROWS BETWEEN n PRECEDING ...) | rolling(window=n).mean() |
| Previous row | LAG(col, 1) OVER (...) | groupby().shift(1) |
| Next row | LEAD(col, 1) OVER (...) | groupby().shift(-1) |
| First in group | FIRST_VALUE(col) OVER (...) | groupby().transform("first") |
| Last in group | LAST_VALUE(col) OVER (...) | groupby().transform("last") |
| Buckets | NTILE(n) OVER (...) | pd.qcut(..., q=n) |
| Percentile rank | PERCENT_RANK() OVER (...) | manual formula with rank() |
| Cumulative dist | CUME_DIST() OVER (...) | rank(pct=True, method="max") |
Ranking functions assign an ordinal position to each row within its partition.
| Function | Behavior |
|---|---|
DENSE_RANK() | Tied rows get the same rank; next rank does not skip |
ROW_NUMBER() | Unique sequential integer, no ties |
RANK() | Tied rows get the same rank; next rank skips |
` DataFrame.rank(axis=0, method=’average’, numeric_only=False, na_option=’keep’, ascending=True, pct=False) `
This is a almost a tutorial for rank function.
def order_scores(scores: pd.DataFrame) -> pd.DataFrame:
scores['rank'] = scores['score'].rank(method = 'dense', ascending = False)
return scores.sort_values(by = 'score', ascending = False)[['score', 'rank']]
Dense rank is used here.
def department_highest_salary(employee: pd.DataFrame, department: pd.DataFrame) -> pd.DataFrame:
df = employee.merge(department, left_on='departmentId', right_on='id', how='left', suffixes=('_emp', '_dept'))
df['rank'] = df.groupby('departmentId')['salary'].rank(method = 'min', ascending = False)
return df[df['rank']==1][['name_dept', 'name_emp', 'salary']].rename(
columns = {
'name_dept' : 'Department',
'name_emp' : 'Employee',
'salary' : 'Salary'
}
)
Dense rank is used here.
def top_three_salaries(employee: pd.DataFrame, department: pd.DataFrame) -> pd.DataFrame:
df = employee.merge(department, how = 'left', left_on = 'departmentId', right_on = 'id', suffixes = ('_emp', '_dept'))
df['salrank'] = df.groupby('departmentId')['salary'].rank(method = 'dense', ascending = False)
return df[df['salrank']<=3][['name_dept', 'name_emp', 'salary']].rename(
columns = {
'name_dept' : 'Department',
'name_emp' : 'Employee',
'salary' : 'Salary'
}
)
employees['salary_count'] = employees.groupby('salary')['salary'].transform('size')
filtered_employees = employees[employees['salary_count'] > 1].copy() #since we are assigning new columns after filtering...
filtered_employees['team_id'] = filtered_employees['salary'].rank(method = 'dense').astype(int)
return filtered_employees[['employee_id', 'name', 'salary', 'team_id']].sort_values(['team_id', 'employee_id'])
def game_analysis(activity: pd.DataFrame) -> pd.DataFrame:
activity['event_date_rank'] = activity.groupby('player_id')['event_date'].rank(method = 'min')
return activity[activity['event_date_rank']==1][['player_id', 'device_id']]
Follow up: What if more than one customer has the largest number of orders, can you find all the customer_number in this case?
def largest_orders(orders: pd.DataFrame) -> pd.DataFrame:
order_count = orders.groupby('customer_number').size().reset_index(name='order_count') #result is two columns
order_count['count_rank'] = order_count['order_count'].rank(method = 'min', ascending = False)
return order_count[order_count['count_rank'] == 1][['customer_number']]
def sales_analysis(sales: pd.DataFrame) -> pd.DataFrame:
sales['year_rank'] = sales.groupby('product_id')['year'].rank(method = 'min')
return sales[sales['year_rank']==1][['product_id', 'year', 'quantity', 'price']].rename(
columns = {'year': 'first_year'}
)
def sales_analysis(sales: pd.DataFrame, product: pd.DataFrame) -> pd.DataFrame:
sales_record = sales.merge(product, on = 'product_id', how = 'left')
revenue = sales_record.groupby('seller_id')['price'].sum().reset_index(name = 'revenue')
revenue['revenue_rank'] = revenue['revenue'].rank(method = 'min', ascending = False)
return revenue[revenue['revenue_rank'] == 1][['seller_id']]
def highest_grade(enrollments: pd.DataFrame) -> pd.DataFrame:
sorted_df = enrollments.sort_values(
by = ['student_id', 'grade', 'course_id'],
ascending = [True, False, True]
)
top_grades = sorted_df.drop_duplicates(subset = ['student_id'], keep ='first')
return top_grades[['student_id', 'course_id', 'grade']]
import pandas as pd
import numpy as np
def average_salary(salary: pd.DataFrame, employee: pd.DataFrame) -> pd.DataFrame:
# 1. Merge tables
df = salary.merge(employee, on='employee_id')
# 2. Format the date as a string 'YYYY-MM' (safer than to_period)
df['pay_month'] = df['pay_date'].dt.strftime('%Y-%m')
# 3. Calculate the averages using your transform logic
df['salary_avg'] = df.groupby('pay_month')['amount'].transform('mean')
df['salary_avg_dept'] = df.groupby(['pay_month', 'department_id'])['amount'].transform('mean')
# 4. Drop duplicates to get unique month-department combinations
df = df.drop_duplicates(
subset=['pay_month', 'department_id'], keep='first'
).sort_values(['pay_month', 'department_id']).copy()
# 5. Create the exact 'higher', 'lower', 'same' comparison
conditions = [
df['salary_avg_dept'] > df['salary_avg'],
df['salary_avg_dept'] < df['salary_avg']
]
choices = ['higher', 'lower']
df['comparison'] = np.select(conditions, choices, default='same')
# 6. Return only the requested columns
return df[['pay_month', 'department_id', 'comparison']]
min import pandas as pd
def sales_analysis(product: pd.DataFrame, sales: pd.DataFrame) -> pd.DataFrame:
df = sales.merge(product, on= 'product_id')
df['is_first_quarter'] = df['sale_date'].between('2019-01-01', '2019-03-31')
df['is_only_first_quarter'] = df.groupby('product_id')['is_first_quarter'].transform('min')
return df[df['is_only_first_quarter'] == True].drop_duplicates(subset = ['product_id'])[['product_id', 'product_name']]
+————–+———+ | Column Name | Type | +————–+———+ | player_id | int | | device_id | int | | event_date | date | | games_played | int | +————–+———+ (player_id, event_date) is the primary key (combination of columns with unique values) of this table. This table shows the activity of players of some games. Each row is a record of a player who logged in and played a number of games (possibly 0) before logging out on someday using some device.
Write a solution to report the device that is first logged in for each player.
Return the result table in any order.
The result format is in the following example.
Example 1:
Input: Activity table: +———–+———–+————+————–+ | player_id | device_id | event_date | games_played | +———–+———–+————+————–+ | 1 | 2 | 2016-03-01 | 5 | | 1 | 2 | 2016-05-02 | 6 | | 2 | 3 | 2017-06-25 | 1 | | 3 | 1 | 2016-03-02 | 0 | | 3 | 4 | 2018-07-03 | 5 | +———–+———–+————+————–+ Output: +———–+———–+ | player_id | device_id | +———–+———–+ | 1 | 2 | | 2 | 3 | | 3 | 1 | +———–+———–+ –>
def game_play_analysis(activity: pd.DataFrame) -> pd.DataFrame:
activity = activity.sort_values(by = ['player_id', 'event_date'])
activity['games_played_so_far'] = activity.groupby('player_id')['games_played'].cumsum()
return activity[['player_id', 'event_date', 'games_played_so_far']]
def running_total_for_different_genders(scores: pd.DataFrame) -> pd.DataFrame:
scores = scores.sort_values(by = ['gender', 'day'])
scores['total'] = scores.groupby('gender')['score_points'].cumsum()
return scores[['gender', 'day', 'total']]
def consecutive_numbers(logs: pd.DataFrame) -> pd.DataFrame:
# 1. Sort by Id to ensure we are looking at the rows in sequential order
logs['block_indicator'] = (logs['num'] != logs['num'].shift()).fillna(True).cumsum()
logs['consec_counts'] = logs.groupby('block_indicator').transform('size')
# groupby includes num because the output contains num; num is not used for grouping
return logs[logs['consec_counts'] >=3].drop_duplicates(
subset = ['num'], keep = 'first'
)[['num']].rename(
columns = {'num': 'ConsecutiveNums'}
)
def team_size(employee: pd.DataFrame) -> pd.DataFrame:
employee['team_size'] = employee.groupby('team_id')['team_id'].transform('size')
return employee[['employee_id', 'team_size']]
Offset functions let you look at the value of another row relative to the current row—no self-join required.
| Function | Description |
|---|---|
LAG(col, n) | Value of col from n rows before the current row |
LEAD(col, n) | Value of col from n rows after the current row |
FIRST_VALUE(col) | Value of col from the first row in the partition/frame |
LAST_VALUE(col) | Value of col from the last row in the partition/frame |
SELECT
salesperson,
sale_date,
amount,
LAG(amount, 1, 0) OVER (PARTITION BY salesperson ORDER BY sale_date) AS prev_amount,
LEAD(amount, 1, 0) OVER (PARTITION BY salesperson ORDER BY sale_date) AS next_amount,
amount - LAG(amount, 1, 0)
OVER (PARTITION BY salesperson ORDER BY sale_date) AS change
FROM sales;
df_sorted = df.sort_values(["salesperson", "sale_date"])
df_sorted["prev_amount"] = (
df_sorted.groupby("salesperson")["amount"].shift(1).fillna(0)
)
df_sorted["next_amount"] = (
df_sorted.groupby("salesperson")["amount"].shift(-1).fillna(0)
)
df_sorted["change"] = df_sorted["amount"] - df_sorted["prev_amount"]
shift(1)isLAG(1),shift(-1)isLEAD(1).
SELECT
region,
salesperson,
sale_date,
amount,
FIRST_VALUE(amount) OVER (
PARTITION BY region ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS first_sale,
LAST_VALUE(amount) OVER (
PARTITION BY region ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS last_sale
FROM sales;
df_sorted = df.sort_values(["region", "sale_date"])
df_sorted["first_sale"] = df_sorted.groupby("region")["amount"].transform("first")
df_sorted["last_sale"] = df_sorted.groupby("region")["amount"].transform("last")
A window is a logically ordered subset of rows defined by:
GROUP BY, but rows are kept separate).ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW).The function is evaluated for every row over its window, and the original row is preserved in the output.
┌──────────────────────────────────────────────┐
│ Full table │
│ ┌────────────┐ ┌────────────┐ │
│ │ Partition A│ │ Partition B│ ... │
│ │ row 1 │ │ row 1 │ │
│ │ row 2 ←── current row │ │
│ │ row 3 │ │ row 3 │ │
│ └────────────┘ └────────────┘ │
└──────────────────────────────────────────────┘
We will use a simple sales table throughout:
| order_id | salesperson | region | sale_date | amount |
|---|---|---|---|---|
| 1 | Alice | East | 2024-01-05 | 500 |
| 2 | Bob | West | 2024-01-07 | 300 |
| 3 | Alice | East | 2024-01-12 | 700 |
| 4 | Carol | East | 2024-01-15 | 450 |
| 5 | Bob | West | 2024-01-20 | 600 |
| 6 | Carol | West | 2024-01-22 | 200 |
SQL — create & populate:
CREATE TABLE sales (
order_id INT,
salesperson VARCHAR(50),
region VARCHAR(50),
sale_date DATE,
amount INT
);
INSERT INTO sales VALUES
(1, 'Alice', 'East', '2024-01-05', 500),
(2, 'Bob', 'West', '2024-01-07', 300),
(3, 'Alice', 'East', '2024-01-12', 700),
(4, 'Carol', 'East', '2024-01-15', 450),
(5, 'Bob', 'West', '2024-01-20', 600),
(6, 'Carol', 'West', '2024-01-22', 200);
Python — create DataFrame:
import pandas as pd
data = {
"order_id": [1, 2, 3, 4, 5, 6],
"salesperson": ["Alice", "Bob", "Alice", "Carol", "Bob", "Carol"],
"region": ["East", "West", "East", "East", "West", "West"],
"sale_date": pd.to_datetime([
"2024-01-05", "2024-01-07", "2024-01-12",
"2024-01-15", "2024-01-20", "2024-01-22"
]),
"amount": [500, 300, 700, 450, 600, 200],
}
df = pd.DataFrame(data)
SELECT
salesperson,
region,
amount,
ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount DESC) AS row_num,
RANK() OVER (PARTITION BY region ORDER BY amount DESC) AS rnk,
DENSE_RANK() OVER (PARTITION BY region ORDER BY amount DESC) AS dense_rnk
FROM sales;
df["row_num"] = (
df.groupby("region")["amount"]
.rank(method="first", ascending=False)
.astype(int)
)
df["rnk"] = (
df.groupby("region")["amount"]
.rank(method="min", ascending=False)
.astype(int)
)
df["dense_rnk"] = (
df.groupby("region")["amount"]
.rank(method="dense", ascending=False)
.astype(int)
)
print(df[["salesperson", "region", "amount", "row_num", "rnk", "dense_rnk"]])
Key difference:
rank(method="first")→ROW_NUMBER,rank(method="min")→RANK,rank(method="dense")→DENSE_RANK.
Aggregates like SUM, AVG, MIN, and MAX can be computed over a window so that you see both the individual row value and a group-level summary in the same result.
SELECT
salesperson,
region,
sale_date,
amount,
-- total per region (no ORDER BY → whole partition)
SUM(amount) OVER (PARTITION BY region) AS region_total,
-- running total per region ordered by date
SUM(amount) OVER (
PARTITION BY region
ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM sales;
# Group total
df["region_total"] = df.groupby("region")["amount"].transform("sum")
# Running total — sort first, then cumsum within group
df_sorted = df.sort_values(["region", "sale_date"])
df_sorted["running_total"] = df_sorted.groupby("region")["amount"].cumsum()
SELECT
salesperson,
sale_date,
amount,
AVG(amount) OVER (
ORDER BY sale_date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS moving_avg_3
FROM sales;
df_sorted = df.sort_values("sale_date").reset_index(drop=True)
df_sorted["moving_avg_3"] = (
df_sorted["amount"]
.rolling(window=3, min_periods=1)
.mean()
)
rolling(window=3)corresponds toROWS BETWEEN 2 PRECEDING AND CURRENT ROW. Usemin_periods=1to avoidNaNat the start.
NTILE(n) divides the ordered partition into n roughly equal buckets.
SELECT
salesperson,
amount,
NTILE(3) OVER (ORDER BY amount DESC) AS bucket
FROM sales;
df["bucket"] = pd.qcut(df["amount"], q=3, labels=[3, 2, 1]).astype(int)
pd.qcutcreates quantile-based buckets. Label1is assigned to the highest bucket to mimic SQL’s descending order convention.
| Function | Formula |
|---|---|
PERCENT_RANK() | $(rank - 1) / (N - 1)$ |
CUME_DIST() | (number of rows ≤ current) / N |
SELECT
salesperson,
amount,
PERCENT_RANK() OVER (ORDER BY amount) AS pct_rank,
CUME_DIST() OVER (ORDER BY amount) AS cume_dist
FROM sales;
df["pct_rank"] = df["amount"].rank(pct=False, method="min")
df["pct_rank"] = (df["pct_rank"] - 1) / (len(df) - 1)
df["cume_dist"] = df["amount"].rank(pct=True, method="max")
DataFrame.groupby User GuideDataFrame.rolling Documentation