Synthetic Data Generation for Credit Risk

Key Objectives

Generate realistic financial and credit-related variables.
Introduce logical correlations to reflect real-world credit risk modelling.
Incorporate macroeconomic indicators to simulate external economic influences.

Core Business Variables

Feature	Description
Revenue	Annual revenue (Log-normal distribution)
Time_in_Business	Years since business started (Exponential distribution)
Credit_Score	Business credit score (500–1000, Truncated Normal)
Loan_Amount	Loan requested ($1,000–$50,000, Gamma distribution)
Industry	Business sector (Retail, Manufacturing, etc.)
Business_Size	Small, Medium, or Large
Years_with_Bank	Years the business has been a bank client
Number_of_Employees	Headcount
Annual_Profit	Calculated as % of revenue

Macroeconomic Indicators

Feature	Description
Unemployment Rate (%)	Higher → More defaults
Inflation Rate (%)	Higher → Increased loan risk
GDP Growth Rate (%)	Negative growth → More defaults
Market Volatility Index (VIX)	Measures financial market uncertainty
Interest Rate (%)	Higher rates → More expensive loans

Distributions & Relationships (Code)

Python

# Synthetic Credit Risk Dataset (reproducible, pure NumPy)
# --------------------------------------------------------
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(42)
num_samples = 10000

# Helper: truncated normal via rejection

def truncated_normal(mean, std, low, high, size):
    out = np.empty(size, dtype=float)
    filled = 0
    while filled < size:
        batch = np.random.normal(mean, std, size - filled)
        batch = batch[(batch >= low) & (batch <= high)]
        take = min(batch.size, size - filled)
        if take:
            out[filled:filled+take] = batch[:take]
            filled += take
    return out

# 1) Base features
mu, sigma = 10.0, 1.0
revenue = np.random.lognormal(mean=mu, sigma=sigma, size=num_samples)                 # right-skewed
credit_score = truncated_normal(750, 100, 500, 1000, num_samples)                     # bounded 500–1000
loan_amount = np.clip(np.random.gamma(5.0, 5000.0, num_samples), 1000, 50000)         # right-skewed
industry = np.random.choice(["Retail","Manufacturing","Services","Hospitality"],
                            p=[0.4,0.25,0.25,0.1], size=num_samples)
business_size = np.random.choice(["Small","Medium","Large"], p=[0.6,0.3,0.1], size=num_samples)
years_in_business = np.clip(np.random.exponential(5.0, num_samples).astype(int), 1, 20)
years_with_bank = (years_in_business * np.random.uniform(0.2,0.8, num_samples)).astype(int)

num_employees = np.empty(num_samples, dtype=int)
mask_s = business_size=="Small"; mask_m = business_size=="Medium"; mask_l = business_size=="Large"
num_employees[mask_s] = np.random.randint(1,20, mask_s.sum())
num_employees[mask_m] = np.random.randint(20,100, mask_m.sum())
num_employees[mask_l] = np.random.randint(100,500, mask_l.sum())

# 2) Logical correlations
size_mult = {"Small":1.0, "Medium":2.0, "Large":5.0}
revenue = revenue * np.vectorize(size_mult.get)(business_size)
annual_profit = np.clip(revenue * np.random.uniform(0.05,0.20, num_samples), 0, None)
credit_score = np.clip(
    credit_score + (revenue/1e5)*50 + (annual_profit/5e4)*30 + years_in_business*5, 500, 1000
)
loan_amount = loan_amount * (1.0 - 0.2 * ((1000.0 - credit_score)/1000.0))            # risk-adjusted

# 3) Default probability & label
industry_risk = {"Retail":1.2, "Manufacturing":0.8, "Services":1.0, "Hospitality":1.3}
ind_factor = np.vectorize(industry_risk.get)(industry)
pd_base = 0.3 - (credit_score - 500.0)/2000.0 + (loan_amount - 1000.0)/100000.0
pd = np.clip(pd_base, 0.05, 0.5) * ind_factor
pd = np.clip(pd, 0.01, 0.9)
default = np.random.binomial(1, pd, num_samples)

# 4) DataFrame
import pandas as pd
df = pd.DataFrame({
    "Revenue": revenue,
    "Time_in_Business": years_in_business,
    "Credit_Score": credit_score,
    "Loan_Amount": loan_amount,
    "Industry": industry,
    "Default": default,
    "Business_Size": business_size,
    "Years_with_Bank": years_with_bank,
    "Number_of_Employees": num_employees,
    "Annual_Profit": annual_profit
})

# 5) Plotting — one figure per chart (no seaborn)
# Save figures as PNG; replace paths as needed
import matplotlib.pyplot as plt

fig = plt.figure(); plt.hist(df["Revenue"], bins=100); plt.title("Revenue (log-normal)"); fig.savefig("revenue_hist.png"); plt.show()
fig = plt.figure(); plt.hist(df["Credit_Score"], bins=50); plt.title("Credit Score (trunc-normal + lift)"); fig.savefig("credit_score_hist.png"); plt.show()
fig = plt.figure(); plt.hist(df["Loan_Amount"], bins=60); plt.title("Loan Amount (gamma, adjusted)"); fig.savefig("loan_amount_hist.png"); plt.show()
fig = plt.figure(); plt.hist(df["Annual_Profit"], bins=80); plt.title("Annual Profit (5–20% margin)"); fig.savefig("annual_profit_hist.png"); plt.show()

# Relationships
fig = plt.figure(); plt.scatter(df["Revenue"], df["Credit_Score"], s=5, alpha=0.5); plt.title("Score vs Revenue"); fig.savefig("score_vs_revenue.png"); plt.show()
fig = plt.figure(); plt.scatter(df["Credit_Score"], df["Loan_Amount"], s=5, alpha=0.5); plt.title("Loan vs Score"); fig.savefig("loan_vs_score.png"); plt.show()

Tip: To make the code executable in-browser, wire up Thebe.js with a Binder kernel, or experiment with Pyodide for pure-in-browser Python.

Why these distributions & plots?

Revenue — Log-normal: firm revenues are multiplicative growth processes and exhibit heavy right tails; the log-normal captures many small firms and a few very large ones. The histogram should be strongly right-skewed.
Credit Score — Truncated Normal: scores are bounded (500–1000 here). Truncation respects business rules; the distribution centers near prime credit with mass trimmed at the edges.
Loan Amount — Gamma: loan sizes are positive and right-skewed; a gamma matches the empirical pattern of many small/moderate loans and fewer big ones. We also risk-adjust amounts down for lower scores.
Years in Business — Exponential: survival-bias shape with many young firms and a decaying tail of older survivors.
Annual Profit — Proportional to revenue: simple margins (5–20%) create a realistic spread consistent with heterogeneous cost structures.
Relationship plots: Score vs Revenue sanity-checks monotonicity (better performance → higher score). Loan vs Score verifies risk-based lending (larger loans concentrate at higher scores).
Categorical checks (Industry mix & default rate by industry): ensure priors and relative risks behave as intended before modelling.

Notes

Replace dataset_path with your own relative path if deploying on your site (e.g., assets/data/synthetic_credit_data.csv).
Consider generating the dataset deterministically with a fixed seed and documenting correlations in a separate section.
Add a Modeling section to demonstrate train/validation splits, calibration curves, and lift charts.