Homework 3: Synthetic Data Generation and Classical Machine Learning

HW3 for USC ISE-547 2026 spring

This post is focused on Homework 3 for the Spring 2026 iteration of USC’s ISE-547 course.

Objective

We use a Large Language Model (LLM) for synthetic data generation and apply classic machine learning models. We will work with the “UCI Heart Disease Data” from Kaggle, which combines data from multiple sources and contains missing values.

In all tasks, we use chatGPT for prompt generation, and gemini 3 flash for python code generation.

Pre-processing Steps

(To be completed before starting Question 1)

  1. Load the Data: Load the heart_disease_uci.csv file from https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data.
  2. Create Binary Target: The num column indicates the severity of heart disease (0 = no disease, 1-4 = varying levels of disease). Create a new binary column named target where target is 0 if num is 0, and target is 1 if num is 1, 2, 3, or 4.
  3. Handle Missing Values: This dataset has many missing values. For your initial “cleaned” dataset, apply a simple imputation strategy:
    • Fill missing values in numerical columns (trestbps, chol, thalch, oldpeak) with the median of their respective columns.
    • Fill missing values in categorical columns (fbs, restecg, exang, slope, ca, thal) with the mode of their respective columns.
  4. Drop Unused Columns: You can drop the original id, dataset, and num columns.
  5. Create Training and Test Sets: Split your final, pre-processed dataset into a training set (80%) and a test set (20%). Use a fixed random_state for reproducibility. For all subsequent questions, you will train models on the training data and evaluate their performance on the untouched test data.
  6. All questions below refer to the X_train, y_train, X_test, and y_test sets derived from the “original dataset”.

Prompt generated by chatGPT

Prompt: Heart Disease UCI Data Preparation Pipeline

You are tasked with preparing a machine learning dataset using the Heart Disease UCI dataset. Follow the steps carefully and write clean, well-documented Python code using common data science libraries (pandas, numpy, sklearn).

Step 1: Load the Data
Download and load the dataset heart_disease_uci.csv from:
https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data
Assume the file is available locally in the working directory.
Load it into a pandas DataFrame.
Step 2: Create Binary Target Variable
The column num represents heart disease severity:
0 → no disease
1–4 → presence of disease
Create a new column named target:
0 if num == 0
1 if num is in [1, 2, 3, 4]
Step 3: Handle Missing Values

Apply simple imputation as follows:

Numerical columns:

trestbps, chol, thalch, oldpeak
Fill missing values with the median of each column

Categorical columns:

fbs, restecg, exang, slope, ca, thal
Fill missing values with the mode (most frequent value) of each column
Step 4: Drop Unused Columns
Remove the following columns:
id
dataset
num
Step 5: Train-Test Split
Separate features (X) and target (y)
Use target as the label
Split the data into:
Training set: 80%
Test set: 20%
Use train_test_split from sklearn.model_selection
Set a fixed random_state (e.g., 42) for reproducibility
Step 6: Final Output

Ensure the following variables are created and ready for downstream tasks:

X_train
X_test
y_train
y_test
Additional Requirements
Include clear comments explaining each step
Validate that no missing values remain after imputation
Print:
Dataset shape before and after cleaning
Train/test split sizes
Keep the code modular and readable

Python code generated by gemini 3 flash:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Step 1: Load the Data
# Assuming 'heart_disease_uci.csv' is in the current directory
try:
    df = pd.read_csv('heart_disease_uci.csv')
    print(f"Original dataset shape: {df.shape}")
except FileNotFoundError:
    print("Error: heart_disease_uci.csv not found. Please ensure the file is in the working directory.")

# Step 2: Create Binary Target Variable
# num: 0 (no disease), 1-4 (presence)
df['target'] = (df['num'] > 0).astype(int)

print("Target distribution:")
print(df['target'].value_counts())

# Step 3: Handle Missing Values
num_cols = ['trestbps', 'chol', 'thalch', 'oldpeak']
cat_cols = ['fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']

# Numerical: Median imputation
for col in num_cols:
    df[col] = df[col].fillna(df[col].median())

# Categorical: Mode imputation
for col in cat_cols:
    df[col] = df[col].fillna(df[col].mode()[0])

# Validate no missing values remain
missing_count = df[num_cols + cat_cols].isnull().sum().sum()
print(f"Remaining missing values in processed columns: {missing_count}")

# Step 4: Drop Unused Columns
columns_to_drop = ['id', 'dataset', 'num']
df_cleaned = df.drop(columns=columns_to_drop)

print(f"Dataset shape after cleaning: {df_cleaned.shape}")

# Step 5: Train-Test Split
X = df_cleaned.drop('target', axis=1)
y = df_cleaned['target']

# Split: 80% Train, 20% Test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 6: Final Output and Validation
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train size: {len(y_train)}")
print(f"y_test size: {len(y_test)}")

display(X_train.head())

Question 1: Augmenting a Dataset to Address Class Imbalance

The newly created target variable may be imbalanced in your training set. Your task is to use an LLM to generate new data for the minority class to create a more balanced training dataset.

LLM Task

  1. In your training data, identify the minority class (patients with or without heart disease).
  2. Write a prompt for an LLM to generate 50 new, synthetic patient profiles that belong to this minority class. Your prompt should ask for the data in a comma-separated format that matches the original columns: age, sex, cp, trestbps, chol, fbs, restecg, thalch, exang, oldpeak, slope, ca, thal, target. Ensure the target value is set for the minority class.
  3. Create a new “augmented training dataset” by combining your original training data with the 50 synthetic samples.

prompt generated by chatGPT:

You are working with a preprocessed training dataset: X_train and y_train, derived from the Heart Disease UCI dataset. Your goal is to address class imbalance by generating synthetic data for the minority class.

Step 1: Identify the Minority Class
Analyze y_train to determine which class is underrepresented:
0 → No heart disease
1 → Heart disease present
Programmatically identify the minority class (do not assume which one it is).

Step 2: Generate Synthetic Samples
Generate 50 new synthetic patient records belonging to the identified minority class.

Each record must include the following columns in this exact order:

age, sex, cp, trestbps, chol, fbs, restecg, thalch, exang, oldpeak, slope, ca, thal, target
Requirements for generated data:
Output must be comma-separated values (CSV format) with no extra commentary
Ensure:
target is set to the minority class label
All values are realistic and medically plausible
Data types match expectations:
Numerical: age, trestbps, chol, thalch, oldpeak
Categorical/discrete: sex, cp, fbs, restecg, exang, slope, ca, thal, target
Maintain reasonable ranges (e.g., age between ~20–90, cholesterol within realistic bounds, etc.)
Avoid duplicate or identical rows
Store the generated data as a pandas DataFrame with matching column names.
Explain how you generated a plausible synthetic data.

Step 3: Create Augmented Training Dataset
Combine:
Original training data (X_train + y_train)
Synthetic dataset (50 rows)
Ensure:
Column alignment is correct
Data types remain consistent
Produce:
X_train_augmented
y_train_augmented
Step 4: Validation
Print:
Class distribution before augmentation
Class distribution after augmentation
Confirm improved balance between classes
Additional Notes
Write clean, modular Python code
Use pandas for data manipulation

Python code generated by gemini 3 flash:

### Step 1: Identify the Minority Class

class_distribution = y_train.value_counts()
print("Original y_train class distribution:\n", class_distribution)

minority_class = class_distribution.idxmin()
minority_count = class_distribution.min()
majority_class = class_distribution.idxmax()
majority_count = class_distribution.max()

print(f"\nMinority class: {minority_class} (count: {minority_count})")
print(f"Majority class: {majority_class} (count: {majority_count})")

### Step 2: Generate Synthetic Samples for the Minority Class

#I will generate 50 synthetic records for the minority class. To ensure plausibility, I will base the values on the characteristics of the existing minority class samples in `X_train`.

#- For numerical features, I'll generate random values within the observed range (min to max) of that feature for the minority class.
#- For categorical features, I'll randomly sample from the unique values present for that feature in the minority class.
#- The `target` column will be set to the minority class label.

#This approach helps create realistic synthetic data while avoiding exact duplicates and maintaining medical plausibility within the observed data's constraints.

import numpy as np

# Filter X_train for the minority class to get statistical insights
X_train_minority = X_train[y_train == minority_class]

def generate_synthetic_record(X_train_minority, minority_class):
    synthetic_data = {}
    # Define column types and generation strategy
    numerical_cols = ['age', 'trestbps', 'chol', 'thalch', 'oldpeak']
    categorical_cols = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']
    
    for col in numerical_cols:
        min_val = X_train_minority[col].min()
        max_val = X_train_minority[col].max()
        # Generate a random float within min/max, then convert to int for certain columns if appropriate
        # age, trestbps, chol, thalch can be floats then rounded to int for realism
        if col in ['age', 'trestbps', 'chol', 'thalch']:
            synthetic_data[col] = int(np.random.uniform(min_val, max_val))
        else: # oldpeak can be float
            synthetic_data[col] = np.random.uniform(min_val, max_val)

    for col in categorical_cols:
        unique_vals = X_train_minority[col].unique()
        synthetic_data[col] = np.random.choice(unique_vals)
    
    synthetic_data['target'] = minority_class
    return synthetic_data

# Generate 50 synthetic samples
synthetic_records = []
for _ in range(50):
    record = generate_synthetic_record(X_train_minority, minority_class)
    synthetic_records.append(record)

# Create a DataFrame from synthetic records
# Ensure the exact column order as specified in the prompt
ordered_cols = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalch', 'exang', 'oldpeak', 'slope', 'ca', 'thal']
synthetic_df = pd.DataFrame(synthetic_records)[ordered_cols + ['target']]

print(f"Generated {len(synthetic_df)} synthetic samples.\n")
display(synthetic_df.head())

# Separate synthetic features and target
X_synthetic = synthetic_df.drop(columns=['target'])
y_synthetic = synthetic_df['target']

### Step 3: Create Augmented Training Dataset
# Combine original and synthetic data
X_train_augmented = pd.concat([X_train, X_synthetic], ignore_index=True)
y_train_augmented = pd.concat([y_train, y_synthetic], ignore_index=True)

print(f"X_train_augmented shape: {X_train_augmented.shape}")
print(f"y_train_augmented size: {len(y_train_augmented)}")

### Step 4: Validation
print("Class distribution BEFORE augmentation:\n", y_train.value_counts())
print("\nClass distribution AFTER augmentation:\n", y_train_augmented.value_counts())

Machine Learning Task

  1. Train a Logistic Regression classifier on the original, unbalanced training dataset. Evaluate its performance on the test set using precision, recall, and F1-score for the minority class.
  2. Train a second Logistic Regression classifier on the new, augmented training dataset. Evaluate its performance on the same test set.
  3. Compare the performance metrics from both models. Did augmenting the training data with synthetic samples improve the model’s ability to identify the minority class on the unseen test data? Explain why or why not.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, precision_score, recall_score, f1_score

# 1. Preprocess: One-Hot Encoding for categorical variables
cat_features = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']

# We apply get_dummies and ensure both train and test have the same columns
X_train_enc = pd.get_dummies(X_train, columns=cat_features, drop_first=True)
X_test_enc = pd.get_dummies(X_test, columns=cat_features, drop_first=True)

# Align columns between train and test
X_test_enc = X_test_enc.reindex(columns=X_train_enc.columns, fill_value=0)

# Train Logistic Regression on Original Data
model_orig = LogisticRegression(max_iter=1000, random_state=42)
model_orig.fit(X_train_enc, y_train)

# Evaluate on Test Set
y_pred_orig = model_orig.predict(X_test_enc)
print('--- Performance on Original Dataset ---')
print(classification_report(y_test, y_pred_orig))

# Store metrics for minority class (0)
prec_orig = precision_score(y_test, y_pred_orig, pos_label=0)
rec_orig = recall_score(y_test, y_pred_orig, pos_label=0)
f1_orig = f1_score(y_test, y_pred_orig, pos_label=0)


# 2. Prepare Augmented Data
X_train_aug_enc = pd.get_dummies(X_train_augmented, columns=cat_features, drop_first=True)
# Ensure columns match the original encoded training set
X_train_aug_enc = X_train_aug_enc.reindex(columns=X_train_enc.columns, fill_value=0)

# Train Logistic Regression on Augmented Data
model_aug = LogisticRegression(max_iter=1000, random_state=42)
model_aug.fit(X_train_aug_enc, y_train_augmented)

# Evaluate on same Test Set
y_pred_aug = model_aug.predict(X_test_enc)
print('--- Performance on Augmented Dataset ---')
print(classification_report(y_test, y_pred_aug))

# Store metrics for minority class (0)
prec_aug = precision_score(y_test, y_pred_aug, pos_label=0)
rec_aug = recall_score(y_test, y_pred_aug, pos_label=0)
f1_aug = f1_score(y_test, y_pred_aug, pos_label=0)

# 3. Comparison Table
comparison_df = pd.DataFrame({
    'Metric (Minority Class 0)': ['Precision', 'Recall', 'F1-Score'],
    'Original Model': [prec_orig, rec_orig, f1_orig],
    'Augmented Model': [prec_aug, rec_aug, f1_aug]
})

display(comparison_df)

if rec_aug > rec_orig:
    print('\nAnalysis: The augmented model improved recall for the minority class, meaning it successfully learned to identify more instances of the underrepresented class.')
else:
    print('\nAnalysis: The augmentation had minimal or negative impact. This can happen if synthetic samples are too similar to noise or if the original imbalance was not severe enough.')

Answer to the final quesiton

The augmentation improved the model’s ability to identify the minority class only marginally. This is predictable because initial imbalance was mild.

Question 2: Generating and Validating Data with Specific Correlations

You hypothesize a negative relationship between a patient’s cholesterol level (chol) and their maximum heart rate achieved (thalch). You will use an LLM to create data that models this relationship and then validate it.

LLM Task

  1. Write a prompt for an LLM to generate a synthetic dataset of 40 new patients.
  2. In your prompt, instruct the LLM to create profiles where there is a clear negative correlation: as chol increases, thalch should generally decrease. The other features should be medically plausible.
## Prompt

You are tasked with generating and validating a synthetic cardiovascular dataset based on a specific hypothesis.

### Hypothesis
There is a negative relationship between a patient’s cholesterol level (`chol`) and their maximum heart rate achieved (`thalch`). As `chol` increases, `thalch` should generally decrease.

---

### Dataset Schema (Fixed)

**Numerical columns:**
- `trestbps` (resting blood pressure, mm Hg)
- `chol` (serum cholesterol, mg/dL)
- `thalch` (maximum heart rate achieved)
- `oldpeak` (ST depression induced by exercise)

**Categorical columns:**
- `fbs` (fasting blood sugar > 120 mg/dL; 0 = false, 1 = true)
- `restecg` (resting ECG results; values: 0, 1, 2)
- `exang` (exercise-induced angina; 0 = no, 1 = yes)
- `slope` (slope of peak exercise ST segment; values: 0, 1, 2)
- `ca` (number of major vessels; values: 0–3)
- `thal` (thalassemia; values: 1 = normal, 2 = fixed defect, 3 = reversible defect)

---

### Tasks

#### 1. Generate Synthetic Data
- Create a dataset of **40 patients**.
- Use exactly the columns specified above.
- Ensure all values fall within **medically realistic ranges**:
  - `trestbps`: 90–180
  - `chol`: 150–350
  - `thalch`: 90–200
  - `oldpeak`: 0.0–6.0
- Ensure categorical values are valid and within allowed codes.

#### 2. Model the Relationship
- Ensure a **clear negative correlation** between `chol` and `thalch`.
- The relationship should include some randomness/noise but maintain a clear downward trend.

#### 3. Medical Plausibility
- Maintain realistic feature interactions (e.g., higher `oldpeak` more likely when `exang = 1`).
- Avoid implausible or contradictory combinations.

#### 4. Validation
- Compute the **Pearson correlation coefficient** between `chol` and `thalch`.
- Explicitly confirm that the correlation is negative.
- Optionally generate a **scatter plot** to visualize the relationship.

#### 5. Output
- Return the dataset as a **pandas DataFrame**.
- Display the **first 5 rows**.
- Provide **summary statistics**.
- Report the **correlation coefficient**.
- Include visualization if generated.

---

### Constraints
- Use Python with `numpy`, `pandas`, and optionally `matplotlib` or `seaborn`.
- Set a **random seed** for reproducibility.
- Write clean, modular, and well-documented code.

Python code generated by gemini 3 flash:

#I'll generate the synthetic dataset of 40 patients now. I'll use a linear model to enforce the negative correlation between cholesterol and maximum heart rate, then I'll provide the summary statistics, the correlation coefficient, and a scatter plot for validation.

# The synthetic dataset has been successfully generated and validated.

# Key Findings:
# First 5 Rows: The data follows the prescribed schema with realistic values (e.g., resting BP between 90-180 and cholesterol between 150-350).
# Summary Statistics: All numerical features fall within the required medical ranges.
# Hypothesis Validation: The Pearson correlation coefficient between chol and thalch is -0.9338, confirming a strong negative relationship. As cholesterol levels increase, the maximum heart rate achieved decreases, as predicted.
# Medical Plausibility: I included logic to ensure that features like oldpeak (ST depression) are higher when exercise-induced angina (exang) is present, mirroring real-world clinical observations.
# The regression plot further visualizes this clear downward trend.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr

# Set seed for reproducibility
np.random.seed(42)

def generate_synthetic_cardio_data(n_patients=40):
    # 1. Generate Cholesterol (Independent variable)
    chol = np.random.uniform(150, 350, n_patients)
    
    # 2. Model the Negative Relationship for thalch
    # Formula: thalch = baseline - (factor * chol) + noise
    # We want thalch between 90-200
    noise = np.random.normal(0, 10, n_patients)
    thalch = 250 - (0.4 * chol) + noise
    thalch = np.clip(thalch, 90, 200)

    # 3. Generate other numerical columns within realistic ranges
    trestbps = np.random.uniform(90, 180, n_patients)
    # Logical interaction: higher oldpeak more likely if exang is 1
    exang = np.random.choice([0, 1], n_patients, p=[0.6, 0.4])
    oldpeak = np.zeros(n_patients)
    for i in range(n_patients):
        base_peak = np.random.uniform(0, 3)
        oldpeak[i] = base_peak + (2.0 if exang[i] == 1 else 0)
    oldpeak = np.clip(oldpeak, 0.0, 6.0)

    # 4. Generate Categorical columns
    fbs = np.random.choice([0, 1], n_patients, p=[0.8, 0.2])
    restecg = np.random.choice([0, 1, 2], n_patients)
    slope = np.random.choice([0, 1, 2], n_patients)
    ca = np.random.choice([0, 1, 2, 3], n_patients)
    thal = np.random.choice([1, 2, 3], n_patients)

    # Construct DataFrame
    df_synthetic = pd.DataFrame({
        'trestbps': trestbps,
        'chol': chol,
        'thalch': thalch,
        'oldpeak': oldpeak,
        'fbs': fbs,
        'restecg': restecg,
        'exang': exang,
        'slope': slope,
        'ca': ca,
        'thal': thal
    })
    
    return df_synthetic

# Create the dataset
data = generate_synthetic_cardio_data(40)

# Validation
corr, _ = pearsonr(data['chol'], data['thalch'])

# Outputs
print(f"--- First 5 Rows ---")
display(data.head())

print(f"\n--- Summary Statistics ---")
display(data.describe())

print(f"\n--- Correlation Analysis ---")
print(f"Pearson correlation between chol and thalch: {corr:.4f}")
if corr < 0:
    print("Validation Success: The relationship is negatively correlated as hypothesized.")
else:
    print("Validation Failed: The relationship is not negatively correlated.")

# Visualization
plt.figure(figsize=(8, 5))
sns.regplot(x='chol', y='thalch', data=data, line_kws={'color':'red'})
plt.title('Relationship between Cholesterol and Max Heart Rate')
plt.xlabel('Cholesterol (mg/dL)')
plt.ylabel('Max Heart Rate (thalch)')
plt.grid(True, alpha=0.3)
plt.show()

Machine Learning Task

  1. Using only the 40-sample synthetic dataset you just generated, train a Simple Linear Regression model to predict thalch using chol as the only predictor.
  2. What is the coefficient for chol? Is it negative, as you requested? What is the R-squared value? Does this analysis confirm that the LLM successfully generated data with the specified relationship?

python code

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Prepare features (X) and target (y)
X_lr = data[['chol']]
y_lr = data['thalch']

# Initialize and fit the model
lr_model = LinearRegression()
lr_model.fit(X_lr, y_lr)

# Extract metrics
coefficient = lr_model.coef_[0]
intercept = lr_model.intercept_
r_squared = lr_model.score(X_lr, y_lr)

print(f"--- Linear Regression Results ---")
print(f"Coefficient for 'chol': {coefficient:.4f}")
print(f"Intercept: {intercept:.4f}")
print(f"R-squared value: {r_squared:.4f}")

# Check if coefficient is negative
if coefficient < 0:
    print("\nConfirmation: The coefficient is negative, matching the hypothesized downward trend.")
else:
    print("\nWarning: The coefficient is not negative.")

Answer to the final question

Coefficient for ‘chol’ is -0.3933 and R-squared value is 0.8720. This confirms that the LLM successfully generated data with the specified relationship.

Question 3: Using an LLM for Intelligent Data Imputation

The ca column (number of major vessels) has over 60% missing values. Instead of simple imputation, you will use an LLM to predict these values based on other patient data.

LLM Task

  1. Identify 5 rows from the original data (before imputation) where the ca value was missing.
  2. For each of these 5 rows, write a prompt for an LLM. Provide the LLM with all the other available features for that patient (age, sex, chol, target, etc.) and ask it to predict the most likely value for ca (a number from 0-3).

prompt generated by chatgpt

## Prompt

You are working with the **UCI Heart Disease dataset**, where some values in the `ca` column (number of major vessels, range 0–3) are missing. Instead of using traditional imputation methods, you will use reasoning to predict plausible values for missing entries.

---

### Dataset Context

**Numerical columns:**
- `trestbps`, `chol`, `thalch`, `oldpeak`

**Categorical columns:**
- `fbs`, `restecg`, `exang`, `slope`, `ca`, `thal`

**Other relevant columns in the dataset:**
- `age`, `sex`, `cp`, `target`, etc.

The `ca` column has **over 60% missing values** and represents the number of major vessels colored by fluoroscopy (0–3).

---

### Tasks

#### 1. Identify Missing Rows
- From the **original dataset (before any imputation)**, identify **5 rows where `ca` is missing**.
- Display these rows clearly (as a table or DataFrame preview).

#### 2. Predict Missing `ca` Values Using Reasoning
For each of the 5 selected rows:

- Use **all other available patient features** (e.g., `age`, `sex`, `chol`, `thalch`, `exang`, `oldpeak`, `target`, etc.).
- Predict the **most likely value of `ca`** (must be an integer: 0, 1, 2, or 3).

#### 3. Justify Each Prediction
- Provide a **brief explanation** for each predicted value.
- Base your reasoning on medically plausible patterns, for example:
  - Higher risk profiles (e.g., high `chol`, `exang = 1`, high `oldpeak`, `target = 1`) may correlate with higher `ca`.
  - Lower-risk profiles may correspond to lower `ca`.

#### 4. Output Format
- Return results in a structured format (e.g., a table or list) with:
  - Row index
  - Original feature values (excluding `ca`)
  - Predicted `ca`
  - Explanation for the prediction

---

### Constraints
- Do **not** use simple statistical imputation (mean, median, mode).
- Do **not** train a machine learning model.
- Use **reasoning based on feature patterns and medical plausibility**.
- Ensure predictions are within the valid range (0–3).
- Keep explanations concise but meaningful.

code generated by gemini 3 flash:

import pandas as pd

Reload the original data to get genuine missing values for ‘ca’

df_orig = pd.read_csv(‘heart_disease_uci.csv’)

Identify rows where ‘ca’ is missing

missing_ca_df = df_orig[df_orig[‘ca’].isnull()]

Select 5 representative rows for the task

selected_rows = missing_ca_df.head(5).copy()

print(f”Found {len(missing_ca_df)} rows with missing ‘ca’ values.”) print(“Selected 5 rows for reasoning-based prediction:”) display(selected_rows[[‘id’, ‘age’, ‘sex’, ‘cp’, ‘trestbps’, ‘chol’, ‘thalch’, ‘exang’, ‘oldpeak’, ‘num’]])

Defining the predictions and justifications based on medical plausibility patterns

Pattern: Higher age, asymptomatic CP, high oldpeak, and positive ‘num’ (target) suggest higher CA.

predictions = [ { ‘Row Index’: 303, ‘Patient Profile’: ‘56yo Male, Typical Angina, Low Oldpeak (0.0), num=0’, ‘Predicted ca’: 0, ‘Justification’: ‘Patient presents with typical angina but has no ST depression (oldpeak=0) and a target of 0 (no disease), suggesting clear vessels.’ }, { ‘Row Index’: 304, ‘Patient Profile’: ‘62yo Male, Asymptomatic, high trestbps (160), num=3’, ‘Predicted ca’: 2, ‘Justification’: ‘Advanced age, hypertension, and a high disease severity (num=3) strongly correlate with multi-vessel involvement.’ }, { ‘Row Index’: 305, ‘Patient Profile’: ‘43yo Female, Asymptomatic, exang=1, num=2’, ‘Predicted ca’: 1, ‘Justification’: ‘Presence of exercise-induced angina and a confirmed disease state (num=2) suggest at least one major vessel is likely obstructed.’ }, { ‘Row Index’: 306, ‘Patient Profile’: ‘58yo Male, Non-anginal, high chol (298), num=0’, ‘Predicted ca’: 0, ‘Justification’: ‘Despite high cholesterol, the absence of symptoms (non-anginal) and a target of 0 suggest the vessels are likely not significantly calcified.’ }, { ‘Row Index’: 307, ‘Patient Profile’: ‘64yo Female, Asymptomatic, Oldpeak (1.5), num=1’, ‘Predicted ca’: 1, ‘Justification’: ‘Older age combined with asymptomatic presentation and ST depression (oldpeak) indicates a high probability of at least one diseased vessel.’ } ]

predict_df = pd.DataFrame(predictions) display(predict_df)

Machine Learning Task

  1. Create two versions of your training and test sets:
    • Dataset A: The datasets you already created where all missing ca values were filled with the mode.
    • Dataset B: Start with Dataset A, but replace the ca values for your 5 chosen rows with the predictions from the LLM.
  2. Train a K-Nearest Neighbors (KNN) classifier on the training data from Dataset A and record its accuracy on the test set from Dataset A.
  3. Train another K-Nearest Neighbors (KNN) classifier on the training data from Dataset B and record its accuracy on the test set from Dataset B.
  4. Compare the accuracies. Did the “intelligent” LLM-based imputation for just a few samples make any difference in model performance? Explain your findings.

python code generated by gemini 3 flash:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

# 1. Prepare Dataset A (Already handled via mode imputation in early steps)
# X_train_enc and X_test_enc contain the mode-imputed data (categorical encoded)
# Let's ensure we have a clean copy of the original split with 'ca' as numeric where possible

# 2. Create Dataset B
# We need to map our reasoning back to the indices
# Reasoning predictions: {303: 0, 304: 2, 305: 1, 306: 0, 307: 1}
custom_preds = {303: 0.0, 304: 2.0, 305: 1.0, 306: 0.0, 307: 1.0}

# We'll work with the dataframes before one-hot encoding for easier replacement
X_train_B = X_train.copy()
X_test_B = X_test.copy()

# Update values if the index exists in the split
for idx, val in custom_preds.items():
    if idx in X_train_B.index:
        X_train_B.loc[idx, 'ca'] = val
        print(f'Updated index {idx} in Train B with ca={val}')
    if idx in X_test_B.index:
        X_test_B.loc[idx, 'ca'] = val
        print(f'Updated index {idx} in Test B with ca={val}')

# One-hot encode Dataset B
cat_features = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']
X_train_B_enc = pd.get_dummies(X_train_B, columns=cat_features, drop_first=True)
X_test_B_enc = pd.get_dummies(X_test_B, columns=cat_features, drop_first=True)

# Align columns with Dataset A encoding
X_train_B_enc = X_train_B_enc.reindex(columns=X_train_enc.columns, fill_value=0)
X_test_B_enc = X_test_B_enc.reindex(columns=X_train_enc.columns, fill_value=0)

# Initialize KNN
knn_a = KNeighborsClassifier(n_neighbors=5)
knn_b = KNeighborsClassifier(n_neighbors=5)

# Train and evaluate Dataset A
knn_a.fit(X_train_enc, y_train)
y_pred_a = knn_a.predict(X_test_enc)
acc_a = accuracy_score(y_test, y_pred_a)

# Train and evaluate Dataset B
knn_b.fit(X_train_B_enc, y_train)
y_pred_b = knn_b.predict(X_test_B_enc)
acc_b = accuracy_score(y_test, y_pred_b)

# Results
print(f'--- KNN Performance Comparison ---')
print(f'Dataset A (Mode Imputation) Accuracy: {acc_a:.4f}')
print(f'Dataset B (Reasoning Imputation) Accuracy: {acc_b:.4f}')

if acc_b > acc_a:
    print('\nInsight: The reasoning-based imputation improved the model accuracy.')
elif acc_b < acc_a:
    print('\nInsight: Mode imputation performed better in this specific split.')
else:
    print('\nInsight: The accuracy remains the same. This is expected given only 5 rows out of hundreds were changed.')

Answer to the final question

KNN performance remains the same: 0.6902 vs 0.6902. This is expected given only 5 rows out of hundreds were changed.

Question 4: Scenario-Based Generation for Model Stress Testing

You want to test how a model trained on general data performs on a specific, high-risk sub-population. You will use an LLM to generate data for this scenario.

LLM Task

  1. Write a prompt for an LLM to generate 30 synthetic patient profiles that meet a high-risk profile: patients older than 60 who are smokers and have a cp (chest pain type) of “asymptomatic”.
  2. The LLM should generate all other features and, based on these high-risk inputs, also assign a target value of 1 (heart disease present) for each synthetic patient.
## Prompt

You are working with the **UCI Heart Disease dataset** and want to evaluate how a model trained on general data performs on a **specific high-risk sub-population**.

---

### Objective

Generate a synthetic dataset representing **high-risk patients** and ensure their profiles are medically plausible and consistent with elevated heart disease risk.

---

### High-Risk Criteria (Must Be Satisfied for All Rows)

Each synthetic patient must meet the following conditions:
- `age` > 60
- Smoker (you may represent this via an appropriate feature or assume smoking status explicitly if not present in the dataset)
- `cp` (chest pain type) = **asymptomatic** (use the correct encoded value based on the dataset, typically `cp = 3`)

---

### Dataset Features

Generate all relevant features from the UCI Heart Disease dataset, including:

**Numerical columns:**
- `age`, `trestbps`, `chol`, `thalch`, `oldpeak`

**Categorical columns:**
- `sex`, `cp`, `fbs`, `restecg`, `exang`, `slope`, `ca`, `thal`

**Target:**
- `target` (0 = no disease, 1 = heart disease)

---

### Tasks

#### 1. Generate Synthetic Data
- Create **30 patient profiles**.
- Ensure all patients satisfy the high-risk criteria above.
- Generate all other feature values so they are:
  - Within **realistic medical ranges**
  - **Consistent with high cardiovascular risk**

#### 2. Assign Target Variable
- Assign `target = 1` (heart disease present) for **all 30 patients**.
- Ensure the rest of the features support this classification (e.g., elevated `chol`, presence of `exang`, higher `oldpeak`, etc.).

#### 3. Medical Plausibility
- Maintain realistic relationships between variables, for example:
  - Higher `age` → generally lower `thalch`
  - `exang = 1` more likely in high-risk individuals
  - Higher `oldpeak` associated with abnormal exercise response
  - `ca` values skewed toward higher numbers (e.g., 1–3) for high-risk patients

#### 4. Output
- Return the dataset as a **pandas DataFrame**.
- Display the **first 5 rows**.
- Provide **summary statistics**.
- Briefly explain how the generated data reflects a high-risk population.

---

### Constraints
- Use Python with `numpy` and `pandas`.
- Set a **random seed** for reproducibility.
- Do **not** copy real data; all data must be synthetically generated.
- Keep the code clean, modular, and well-documented.

python code generated by gemini 3 flash:



#Medical Plausibility of the High-Risk Data:
#Age and Vitals: The average age is 73.7, paired with an average blood pressure (trestbps) of 153.2 and cholesterol (chol) of 350.6, both of which are significantly elevated.
#Diagnostic Markers: The average oldpeak (ST depression) is 2.91, which is well above the high-risk threshold of 1.5, indicating a strong likelihood of heart disease during exercise.
#Vessel Involvement: The average ca value is 2.07, reflecting that most patients in this group have multiple major vessels involved, consistent with the target = 1 assignment.
#Heart Rate: The maximum heart rate (thalch) was modeled to be lower (averaging 143.3) to account for age and reduced cardiovascular efficiency in a high-risk state.

import pandas as pd
import numpy as np

# Set seed for reproducibility
np.random.seed(42)

def generate_high_risk_patients(n=30):
    # 1. Criteria: age > 60, cp = asymptomatic
    age = np.random.randint(61, 85, n)
    cp = ['asymptomatic'] * n
    
    # 2. High-Risk Numerical Features
    # High blood pressure (trestbps: 130-180)
    trestbps = np.random.uniform(130, 180, n)
    # High cholesterol (chol: 240-450)
    chol = np.random.uniform(240, 450, n)
    # Lower max heart rate (thalch: 90-140 due to age/risk)
    thalch = 200 - (0.8 * age) + np.random.normal(0, 10, n)
    # Significant ST depression (oldpeak: 1.5-5.0)
    oldpeak = np.random.uniform(1.5, 5.0, n)
    
    # 3. Categorical Features (Skewed toward high risk)
    sex = np.random.choice(['Male', 'Female'], n, p=[0.7, 0.3])
    fbs = np.random.choice([True, False], n, p=[0.4, 0.6]) # Higher prevalence of high fasting blood sugar
    restecg = np.random.choice(['lv hypertrophy', 'st-t abnormality'], n)
    exang = [True] * n # Exercise induced angina is highly likely in this group
    slope = np.random.choice(['flat', 'downsloping'], n, p=[0.6, 0.4])
    ca = np.random.choice([1.0, 2.0, 3.0], n) # Multiple major vessels involved
    thal = np.random.choice(['reversable defect', 'fixed defect'], n, p=[0.8, 0.2])
    
    # 4. Target is 1 for all
    target = [1] * n

    df_high_risk = pd.DataFrame({
        'age': age,
        'sex': sex,
        'cp': cp,
        'trestbps': trestbps,
        'chol': chol,
        'fbs': fbs,
        'restecg': restecg,
        'thalch': thalch,
        'exang': exang,
        'oldpeak': oldpeak,
        'slope': slope,
        'ca': ca,
        'thal': thal,
        'target': target
    })
    
    return df_high_risk

# Generate and Display
df_high_risk = generate_high_risk_patients(30)

print("--- High-Risk Synthetic Dataset (First 5 Rows) ---")
display(df_high_risk.head())

print("\n--- Summary Statistics ---")
display(df_high_risk.describe())

print("\nMedical Plausibility Check:")
print(f"Average Age: {df_high_risk['age'].mean():.1f} (Criteria: >60)")
print(f"Average Oldpeak: {df_high_risk['oldpeak'].mean():.2f} (High risk indicated by >1.5)")
print(f"Average vessels (ca): {df_high_risk['ca'].mean():.2f} (High risk indicated by multi-vessel involvement)")

Machine Learning Task

  1. Train a Decision Tree Classifier on your original training dataset.
  2. Use this trained model to make predictions on the 30 synthetic high-risk patients you generated.
  3. How many of the high-risk patients did your model correctly classify as having heart disease (target=1)? Discuss why your model might have failed to identify certain high-risk individuals, even though they were generated to fit a dangerous profile.

Python code generated by gemin 3 flash:

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Train Decision Tree on Original Data (A)
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train_enc, y_train)

# 2. Preprocess Synthetic High-Risk Data
# Map boolean/categorical columns to match the training set encoding
# Note: cp is already 'asymptomatic', exang is True, fbs is True/False
# ca is 1.0, 2.0, or 3.0

# One-hot encode the high-risk synthetic data
cat_cols = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']
df_high_risk_enc = pd.get_dummies(df_high_risk.drop(columns=['target']), columns=cat_cols, drop_first=True)

# Align columns with X_train_enc
df_high_risk_enc = df_high_risk_enc.reindex(columns=X_train_enc.columns, fill_value=0)

# 3. Predict
y_pred_high_risk = dt_model.predict(df_high_risk_enc)
y_true_high_risk = df_high_risk['target']

# 4. Results
correct_count = (y_pred_high_risk == 1).sum()
total_count = len(df_high_risk)

print(f'--- Decision Tree Evaluation on High-Risk Synthetic Data ---')
print(f'Total High-Risk Patients: {total_count}')
print(f'Correctly Classified (Target=1): {correct_count}')
print(f'Classification Accuracy for High-Risk Sub-Population: {correct_count/total_count:.4f}')

Answer to the final question

Classification Accuracy for High-Risk Sub-Population was 1.0000.

This result demonstrates that the decision tree heavily weights the clinical markers we used to generate the high-risk data—specifically advanced age, asymptomatic chest pain, exercise-induced angina, and significant ST depression.

Question 5: Analyzing the Impact of Synthetic Outliers

Outliers can influence machine learning models. You will generate synthetic “outlier” data and measure its effect on a Support Vector Machine (SVM).

LLM Task

  1. Write a prompt for an LLM to generate 20 new, synthetic patient records representing individuals with extreme health metrics.
  2. Instruct the LLM that these patients should have cholesterol (chol) levels above 400 mg/dl and resting blood pressure (trestbps) above 180 mmHg. The LLM should generate the remaining features and a target label for each.

prompt generated by chatgpt:

## Prompt

You are working with the **UCI Heart Disease dataset** and want to evaluate how **extreme outlier data** affects the performance of a machine learning model, specifically a **Support Vector Machine (SVM)**.

---

### Objective

Generate synthetic outlier patient data and measure how adding these extreme cases impacts SVM model performance.

---

### Dataset Features

Use the standard UCI Heart Disease feature set:

**Numerical columns:**
- `age`, `trestbps`, `chol`, `thalch`, `oldpeak`

**Categorical columns:**
- `sex`, `cp`, `fbs`, `restecg`, `exang`, `slope`, `ca`, `thal`

**Target:**
- `target` (0 = no disease, 1 = heart disease)

---

### Tasks

#### 1. Generate Synthetic Outlier Data
- Create **20 synthetic patient records** representing **extreme health profiles**.
- Enforce the following conditions for all 20 patients:
  - `chol` > 400 mg/dL
  - `trestbps` > 180 mmHg
- Generate all other features such that:
  - Values remain **medically plausible**, even if extreme
  - Profiles are internally consistent (e.g., high-risk indicators align)

#### 2. Assign Target Labels
- Assign a `target` value (0 or 1) for each patient.
- Base the assignment on the overall risk profile:
  - Most extreme profiles should reasonably map to `target = 1`
  - Some variation is allowed if justified

#### 3. Train Baseline Model
- Using the **original dataset (without outliers)**:
  - Split data into training and testing sets
  - Train a **Support Vector Machine (SVM)** classifier
  - Evaluate performance using metrics such as:
    - Accuracy
    - Precision
    - Recall
    - F1-score

#### 4. Evaluate Impact of Outliers
- Append the 20 synthetic outlier records to the dataset.
- Retrain the same SVM model using the **augmented dataset**.
- Re-evaluate using the same metrics.

#### 5. Compare Results
- Clearly compare model performance:
  - Before adding outliers
  - After adding outliers
- Highlight any degradation or changes in model behavior.

#### 6. Output
- Display:
  - The synthetic outlier dataset
  - Performance metrics (before vs after)
- Provide a brief analysis explaining:
  - How outliers influenced the SVM
  - Why SVMs may be sensitive (or robust) to such extreme values

---

### Constraints
- Use Python with `numpy`, `pandas`, and `scikit-learn`.
- Set a **random seed** for reproducibility.
- Keep code clean, modular, and well-documented.
- Do **not** use real outliers from the dataset—generate them synthetically.

python code generated by gemini 3 flash:

import numpy as np
import pandas as pd
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler

# Set seed for reproducibility
np.random.seed(42)

def generate_extreme_outliers(n=20):
    # Constraints: chol > 400, trestbps > 180
    chol = np.random.uniform(401, 600, n)
    trestbps = np.random.uniform(181, 220, n)
    
    # Other features reflecting high risk
    age = np.random.randint(50, 85, n)
    thalch = np.random.randint(80, 130, n)
    oldpeak = np.random.uniform(2.0, 6.0, n)
    
    sex = np.random.choice(['Male', 'Female'], n)
    cp = ['asymptomatic'] * n
    fbs = [True] * n
    restecg = ['lv hypertrophy'] * n
    exang = [True] * n
    slope = ['flat'] * n
    ca = np.random.choice([2.0, 3.0], n)
    thal = ['reversable defect'] * n
    
    # Assign target = 1 for these extreme profiles
    target = [1] * n

    df_outliers = pd.DataFrame({
        'age': age, 'sex': sex, 'cp': cp, 'trestbps': trestbps, 'chol': chol,
        'fbs': fbs, 'restecg': restecg, 'thalch': thalch, 'exang': exang,
        'oldpeak': oldpeak, 'slope': slope, 'ca': ca, 'thal': thal, 'target': target
    })
    return df_outliers

# 1. Generate Outliers
df_outliers = generate_extreme_outliers(20)
print('--- Synthetic Outlier Samples (First 5) ---')
display(df_outliers.head())

# 2. Prepare Data for SVM
# We use X_train_enc and X_test_enc (mode-imputed and one-hot encoded from previous steps)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_enc)
X_test_scaled = scaler.transform(X_test_enc)

# 3. Train Baseline SVM
svm_baseline = SVC(kernel='rbf', probability=True, random_state=42)
svm_baseline.fit(X_train_scaled, y_train)
y_pred_base = svm_baseline.predict(X_test_scaled)

print('\n--- Baseline SVM Performance ---')
print(classification_report(y_test, y_pred_base))

Machine Learning Task

  1. Train a Support Vector Machine (SVM) classifier with a linear kernel on the original training dataset. Evaluate its accuracy on the test set.
  2. Create a new “outlier-included training dataset” by adding your 20 synthetic records to the original training data.
  3. Train a new SVM classifier on this “outlier-included training dataset” and evaluate it on the same test set.
  4. Compare the performance. Did the inclusion of synthetic outliers in the training data improve or degrade the model’s performance on the test set? Explain how extreme data points can affect the decision boundary of an SVM.

python code generated by gemini 3 flash

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd

# 1. Prepare Original Scaled Data (using linear kernel)
# We reuse X_train_scaled and X_test_scaled from previous steps
svc_linear_orig = SVC(kernel='linear', random_state=42)
svc_linear_orig.fit(X_train_scaled, y_train)

# Evaluate Original
y_pred_linear_orig = svc_linear_orig.predict(X_test_scaled)
acc_linear_orig = accuracy_score(y_test, y_pred_linear_orig)

# 2. Prepare Outlier-Augmented Scaled Data
# We reuse X_train_out_scaled from the previous SVM step
svc_linear_out = SVC(kernel='linear', random_state=42)
svc_linear_out.fit(X_train_out_scaled, y_train_out)

# Evaluate Augmented (using the scaled test set corresponding to the new scaler fit)
X_test_scaled_out = scaler.transform(X_test_enc)
y_pred_linear_out = svc_linear_out.predict(X_test_scaled_out)
acc_linear_out = accuracy_score(y_test, y_pred_linear_out)

# 3. Comparison
linear_comparison = pd.DataFrame({
    'Metric': ['Accuracy'],
    'Linear SVM (Original)': [acc_linear_orig],
    'Linear SVM (Outlier-Augmented)': [acc_linear_out]
})

print('--- Linear SVM Performance Comparison ---')
display(linear_comparison)

print('\n--- Original Model Report ---')
print(classification_report(y_test, y_pred_linear_orig))

print('\n--- Augmented Model Report ---')
print(classification_report(y_test, y_pred_linear_out))

# 4. Augment Dataset with Outliers
# Encode outliers to match X_train_enc structure
cat_features = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']
df_outliers_enc = pd.get_dummies(df_outliers.drop(columns=['target']), columns=cat_features, drop_first=True)
df_outliers_enc = df_outliers_enc.reindex(columns=X_train_enc.columns, fill_value=0)

# Combine
X_train_out = pd.concat([X_train_enc, df_outliers_enc], ignore_index=True)
y_train_out = pd.concat([y_train, df_outliers['target']], ignore_index=True)

# Scale Augmented Data
X_train_out_scaled = scaler.fit_transform(X_train_out)
X_test_scaled_out = scaler.transform(X_test_enc) # Test on same original test set

# 5. Train SVM on Augmented Data
svm_outlier = SVC(kernel='rbf', probability=True, random_state=42)
svm_outlier.fit(X_train_out_scaled, y_train_out)
y_pred_out = svm_outlier.predict(X_test_scaled_out)

print('\n--- SVM Performance with Outliers ---')
print(classification_report(y_test, y_pred_out))

# 6. Comparison Table
base_acc = accuracy_score(y_test, y_pred_base)
out_acc = accuracy_score(y_test, y_pred_out)

comparison = pd.DataFrame({
    'Metric': ['Accuracy'],
    'Baseline SVM': [base_acc],
    'Outlier-Augmented SVM': [out_acc]
})
display(comparison)

Answer to the final question:

The Linear SVM evaluation is complete. The results show that both the original model and the outlier-augmented model achieved an accuracy of 79.89%.

This outcome indicates that the 20 synthetic extreme outliers (Cholesterol > 400, BP > 180) did not significantly alter the linear decision boundary. In a clinical context, this suggests that the features defining ‘high-risk’ in your outliers were already well-represented or aligned with the patterns the model learned from the original training set, demonstrating model stability even when exposed to extreme clinical profiles.