pandas Basic

A practical guide to window functions—ranking, aggregation, and offset operations computed over a sliding frame—demonstrated side-by-side in SQL and Python's pandas library.

This is a summary and pandas version of Leetcode post by farlowdw

Understanding query execution order is critical. How can you accurately determine your result set if you do not know how it is being formed?
Consistent formatting matters. This is not just a cosmetic issue–your style and formatting can impact how you think.
Window functions make difficult problems easy. There’s a reason why window functions show up in most Medium/Hard problems. They’re immensely powerful when used with care. I provide a listing of different window functions and some of the problems in which I have found them to be helpful.

The followings are for experts:
Common table expressions (CTEs) are agents of clarity and utility. Have you ever gotten tired of making a mess by using sizable subqueries? CTEs provide the ability to use “named subqueries” that are especially useful when dealing with hierarchical data. You can also use recursive CTEs (more on this later) which add a whole new dimension to crafting powerful queries to solve complex problems. I provide problem solutions to illustrate how CTEs can be agents of clarity and utility.
Recursive CTEs. Tricky at first but opens up all sorts of possibilities, especially when it comes to dealing with hierarchical data. I provide some references, examples, and, just as with window functions, I provide a listing of different Leetcode problems where using WITH RECURSIVE is a viable option.

Order of Execution of a Query

Trying to write a SQL SELECT statement without understanding its logical processing order is like trying to evaluate a complicated mathematical expression with no parentheses. It’s unnecessarily difficult, and it can lead to uncertainty in what the result set for a query should be.

The mental model

The logical processing order of the SELECT statement is generally as follows:

FROM/JOIN (and all associated ON conditions)
WHERE
GROUP BY
HAVING
SELECT (including window functions)
DISTINCT
ORDER BY
LIMIT/OFFSET

In SQL, this is different from how code is written. in pandas, however, we can write the code in this order.

Easy problems for syntax learning

2879. Display the first three rows

.head(3)

2886. Change Data Type

.astype() for a single column
.astype({dictionary}) for multiple columns

2885. Rename Columns

.rename(columns = {dictionary, key = old name, value = new name})

2878. Get the Size of a DataFrame

.shape, not .shape()

Filtering

620. Not Boring Movies

filtering uses & and |, not and and or.
quotient and remainder are // and %.

def not_boring_movies(cinema: pd.DataFrame) -> pd.DataFrame:
    return cinema[
        (cinema['id']%2 == 1) & (cinema['description'] != 'boring') # NOT and! & is bitwise.
        ].sort_values('rating', ascending = False)

183. Customers Who Never Order

do not use merge. it’s an overkill.
.isin()

def find_customers(customers: pd.DataFrame, orders: pd.DataFrame) -> pd.DataFrame:
    return customers[~customers['id'].isin(orders['customerId'])][['name']].rename(
        columns = {'name': 'Customers'}
    )

1757. Recyclable and Low Fat Products

envelope booleans with ()

def find_customers(customers: pd.DataFrame, orders: pd.DataFrame) -> pd.DataFrame:
    return customers[~customers['id'].isin(orders['customerId'])][['name']].rename(
        columns = {'name': 'Customers'}
    )

1581. Customer Who Visited but Did Not Make Any Transactions

.isin()

def find_customers(visits: pd.DataFrame, transactions: pd.DataFrame) -> pd.DataFrame:
    return visits[
        ~visits['visit_id'].isin(transactions['visit_id'].drop_duplicates())
    ].groupby('customer_id').agg(
        count_no_trans = ('visit_id', 'size')
    ).reset_index()

Handling missing values

2887. Fill Missing Data

.fillna({column:value})

def fill_missing_values(products: pd.DataFrame) -> pd.DataFrame:
    return products.fillna({'quantity': 0})

627. Swap Sex of Employees

.replace( {'m':'f', 'f':'m'}, inplace = True)
the problem asks to not use any select statement and write only one update statement.

def swap_salary(salary: pd.DataFrame) -> pd.DataFrame:
    salary['sex'].replace( {'m':'f', 'f':'m'}, inplace = True)
    return salary

Handling duplicates

2882. Drop Duplicate Rows is the basic syntax learning problem.
.drop_duplicates(subset, keep, inplace)

1789. Primary Department for Each Employee

sort_values + drop_duplicates

import pandas as pd

def find_primary_department(employee: pd.DataFrame) -> pd.DataFrame:
    return employee.sort_values(
        'primary_flag', ascending = False
        ).drop_duplicates(
            subset = 'employee_id', keep = 'first'
            )[['employee_id', 'department_id']]

182. Duplicate Emails

Not dropping duplicate. Here, we report the duplicates.
.duplicated(columnname) outputs a boolean vector

import pandas as pd

def duplicate_emails(person: pd.DataFrame) -> pd.DataFrame:
    return person[person.duplicated('email')].drop_duplicates('email').rename(columns = {'email':'Email'})[['Email']]

string

1683. Invalid Tweets

.str.len()

import pandas as pd

def invalid_tweets(tweets: pd.DataFrame) -> pd.DataFrame:
    return tweets[tweets['content'].str.len()>15][['tweet_id']]

1667. Fix Names in a Table

.str.capitalize()

import pandas as pd

def fix_names(users: pd.DataFrame) -> pd.DataFrame:
    users['name'] = users['name'].str.capitalize()
    return users.sort_values('user_id')

1873. Calculate Special Bonus

.str.startswith('M')

import pandas as pd

def calculate_special_bonus(employees: pd.DataFrame) -> pd.DataFrame:
    employees['bonus'] = (
    (~employees['name'].str.startswith('M') & 
    (employees['employee_id']%2==1)
    )*employees['salary'])
    return employees[
        ['employee_id', 'bonus']
    ].sort_values('employee_id')

1527. Patients With No Condition

.str.contains()
this problem is ill-posed. That’s why the acceptance rate is 30%

import pandas as pd

def find_patients(patients: pd.DataFrame) -> pd.DataFrame:
    return patients[
        (patients.conditions.str.contains(' DIAB1')) |
        (patients.conditions.str.startswith('DIAB1'))
    ]

1484. Group Sold Products By The Date

lambda x: ‘,’.join(sorted(x.unique()))
one column, two aggregation: use .agg with list of tuples
use lambda for complicated aggregation
Series.unique() outputs a python list
sorted(list)

import pandas as pd

def categorize_products(activities: pd.DataFrame) -> pd.DataFrame:
    return activities.groupby('sell_date')['product'].agg(
        [
            ('num_sold', 'nunique'),
            ('products', lambda x: ','.join(sorted(x.unique())))
            # to Series.unique() outputs a python list
            # sorted(list)
        ]
    ).reset_index()

References

Colt Steele, [SQL Window Functions in 10 Minutes](https://youtu.be/y1KCM8vbYe4?si=TOH9VCXTSoxq8xNo
PostgreSQL Window Function Documentation
pandas DataFrame.groupby User Guide
pandas DataFrame.rolling Documentation