pandas Basic

A practical guide to window functions—ranking, aggregation, and offset operations computed over a sliding frame—demonstrated side-by-side in SQL and Python's pandas library.

This is a summary and pandas version of Leetcode post by farlowdw

  1. Understanding query execution order is critical. How can you accurately determine your result set if you do not know how it is being formed?
  2. Consistent formatting matters. This is not just a cosmetic issue–your style and formatting can impact how you think.
  3. Window functions make difficult problems easy. There’s a reason why window functions show up in most Medium/Hard problems. They’re immensely powerful when used with care. I provide a listing of different window functions and some of the problems in which I have found them to be helpful.

    The followings are for experts:

  4. Common table expressions (CTEs) are agents of clarity and utility. Have you ever gotten tired of making a mess by using sizable subqueries? CTEs provide the ability to use “named subqueries” that are especially useful when dealing with hierarchical data. You can also use recursive CTEs (more on this later) which add a whole new dimension to crafting powerful queries to solve complex problems. I provide problem solutions to illustrate how CTEs can be agents of clarity and utility.
  5. Recursive CTEs. Tricky at first but opens up all sorts of possibilities, especially when it comes to dealing with hierarchical data. I provide some references, examples, and, just as with window functions, I provide a listing of different Leetcode problems where using WITH RECURSIVE is a viable option.

Order of Execution of a Query

Trying to write a SQL SELECT statement without understanding its logical processing order is like trying to evaluate a complicated mathematical expression with no parentheses. It’s unnecessarily difficult, and it can lead to uncertainty in what the result set for a query should be.

The mental model

The logical processing order of the SELECT statement is generally as follows:

  1. FROM/JOIN (and all associated ON conditions)
  2. WHERE
  3. GROUP BY
  4. HAVING
  5. SELECT (including window functions)
  6. DISTINCT
  7. ORDER BY
  8. LIMIT/OFFSET

In SQL, this is different from how code is written. in pandas, however, we can write the code in this order.

Easy problems for syntax learning

2879. Display the first three rows

2886. Change Data Type

2885. Rename Columns

2878. Get the Size of a DataFrame

Filtering

620. Not Boring Movies

def not_boring_movies(cinema: pd.DataFrame) -> pd.DataFrame:
    return cinema[
        (cinema['id']%2 == 1) & (cinema['description'] != 'boring') # NOT and! & is bitwise.
        ].sort_values('rating', ascending = False)

183. Customers Who Never Order

def find_customers(customers: pd.DataFrame, orders: pd.DataFrame) -> pd.DataFrame:
    return customers[~customers['id'].isin(orders['customerId'])][['name']].rename(
        columns = {'name': 'Customers'}
    )

1757. Recyclable and Low Fat Products

def find_customers(customers: pd.DataFrame, orders: pd.DataFrame) -> pd.DataFrame:
    return customers[~customers['id'].isin(orders['customerId'])][['name']].rename(
        columns = {'name': 'Customers'}
    )

1581. Customer Who Visited but Did Not Make Any Transactions

def find_customers(visits: pd.DataFrame, transactions: pd.DataFrame) -> pd.DataFrame:
    return visits[
        ~visits['visit_id'].isin(transactions['visit_id'].drop_duplicates())
    ].groupby('customer_id').agg(
        count_no_trans = ('visit_id', 'size')
    ).reset_index()

Handling missing values

2887. Fill Missing Data

def fill_missing_values(products: pd.DataFrame) -> pd.DataFrame:
    return products.fillna({'quantity': 0})

627. Swap Sex of Employees

def swap_salary(salary: pd.DataFrame) -> pd.DataFrame:
    salary['sex'].replace( {'m':'f', 'f':'m'}, inplace = True)
    return salary

Handling duplicates

1789. Primary Department for Each Employee

import pandas as pd

def find_primary_department(employee: pd.DataFrame) -> pd.DataFrame:
    return employee.sort_values(
        'primary_flag', ascending = False
        ).drop_duplicates(
            subset = 'employee_id', keep = 'first'
            )[['employee_id', 'department_id']]

182. Duplicate Emails

import pandas as pd

def duplicate_emails(person: pd.DataFrame) -> pd.DataFrame:
    return person[person.duplicated('email')].drop_duplicates('email').rename(columns = {'email':'Email'})[['Email']]

string

1683. Invalid Tweets

import pandas as pd

def invalid_tweets(tweets: pd.DataFrame) -> pd.DataFrame:
    return tweets[tweets['content'].str.len()>15][['tweet_id']]

1667. Fix Names in a Table

import pandas as pd

def fix_names(users: pd.DataFrame) -> pd.DataFrame:
    users['name'] = users['name'].str.capitalize()
    return users.sort_values('user_id')

1873. Calculate Special Bonus

import pandas as pd

def calculate_special_bonus(employees: pd.DataFrame) -> pd.DataFrame:
    employees['bonus'] = (
    (~employees['name'].str.startswith('M') & 
    (employees['employee_id']%2==1)
    )*employees['salary'])
    return employees[
        ['employee_id', 'bonus']
    ].sort_values('employee_id')

1527. Patients With No Condition

import pandas as pd

def find_patients(patients: pd.DataFrame) -> pd.DataFrame:
    return patients[
        (patients.conditions.str.contains(' DIAB1')) |
        (patients.conditions.str.startswith('DIAB1'))
    ]

1484. Group Sold Products By The Date

import pandas as pd

def categorize_products(activities: pd.DataFrame) -> pd.DataFrame:
    return activities.groupby('sell_date')['product'].agg(
        [
            ('num_sold', 'nunique'),
            ('products', lambda x: ','.join(sorted(x.unique())))
            # to Series.unique() outputs a python list
            # sorted(list)
        ]
    ).reset_index()

References