Synthetic Data Generation Workflow - Bank Marketing Data¶


Synthetic data is artificially generated data that mimics the statistical properties of a real dataset (e.g. column structure, distributions, relationships between variables). It is useful for sharing data safely, augmenting datasets, and testing pipelines without exposing sensitive information.

This notebook shows a synthetic data generation workflow using the Synthetic Data Vault (SDV) library and a marketing dataset.

The dataset represents a bank marketing campaign from UCI Machine Learning Repository, with 4,119 records, 21 variables, and mixed numeric and categorical types.

The goal is to generate a synthetic version of the dataset that preserves the statistical properties of the original, including column distributions, inter-variable correlations, and category adherence.

The final model used in this workflow is SDV's CopulaGANSynthesizer, a hybrid approach that combines Gaussian copula marginal fitting with CTGAN-based joint distribution learning. Four synthesizers were evaluated in total; see the Synthetic Data Generation section for a discussion of the tradeoffs.

The notebook is organized as follows:

  • Data Description: discussion of context, interesting variables, and missing values
  • Data Loading: schema definition, ingestion, and data summary
  • Synthetic Data Generation: metadata detection, model training, and sampling
  • Synthetic Data Validation: distribution comparison, correlation analysis, quality scores, and constraint checks
  • Save Outputs: save synthetic dataset, fitted model, and metadata to disk
  • Conditional Generation: brief demo of SDV's conditional generation functionality

Data Description¶


The dataset comes from direct marketing campaigns run by a Portuguese bank between 2008 and 2013. Each record represents one client contact (by phone), with the target variable y indicating whether the client subscribed to a term deposit.

Variables fall into three groups:

  • Client attributes: age, job, marital status, education, and whether the client has credit in default, a housing loan, or a personal loan
  • Campaign contact info: contact type, month, day of week, call duration, and number of contacts in the current campaign
  • Economic context: employment variation rate, consumer price index, consumer confidence index, Euribor 3-month rate, and number of employees

Note on pdays: This variable records the number of days since the client was last contacted in a previous campaign. The value 999 means they were never previously contacted, which accounts for about 96% of rows. This is a sentinel value, not missing data, and it makes pdays effectively bimodal, which will show up in the validation results.

Note on missing values: Some columns (e.g. default, housing, and loan) contain an unknown category which we treat as a valid category rather than imputing or converting to NaN. This preserves the information that the value was unknown at the time of contact. Otherwise, the dataset has no true missing values.

Data Loading¶


We define a schema with data types for each column and then load the data.

In [1]:
import pandas as pd
import numpy as np
from minieda import summarize
import sdv
from sdv.metadata import Metadata
from sdv.single_table import CopulaGANSynthesizer
from sdv.evaluation.single_table import evaluate_quality
from sdv.evaluation.single_table import run_diagnostic

from sdmetrics.single_table import NewRowSynthesis
from sdmetrics.single_table import DCRBaselineProtection

from sdv.sampling import Condition

import matplotlib.pyplot as plt
import os
import warnings

warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning)

print(pd.__version__)
print(sdv.__version__)
2.2.3
1.34.2
In [2]:
# Function for printing tables
def style_table(df, caption, precision=1, caption_side="bottom", font_size="16px", font_weight="normal"):
    return (df.style
            .format(precision=precision).set_caption(caption)
            .set_table_styles([{"selector": "caption", 
                                "props": [("caption-side", caption_side),
                                          ("font-size", font_size),
                                          ("font-weight", font_weight)]}]))
In [3]:
data_schema = {
    'age':              'int64',      # client age
    'job':              'category',   # type of job (admin, blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, student, technician, unemployed, unknown)
    'marital':          'category',   # marital status (divorced, married, single, unknown)
    'education':        'category',   # education level (basic.4y, basic.6y, basic.9y, high.school, illiterate, professional.course, university.degree, unknown)
    'default':          'category',   # has credit in default? (yes, no, unknown)
    'housing':          'category',   # has housing loan? (yes, no, unknown)
    'loan':             'category',   # has personal loan? (yes, no, unknown)
    'contact':          'category',   # contact communication type (cellular, telephone)
    'month':            'category',   # last contact month of year (jan–dec)
    'day_of_week':      'category',   # last contact day of the week (mon, tue, wed, thu, fri)
    'duration':         'int64',      # last contact duration in seconds — benchmark only, not for predictive modeling
    'campaign':         'int64',      # number of contacts performed during this campaign (includes last contact)
    'pdays':            'int64',      # days since client was last contacted from previous campaign (999 = not previously contacted)
    'previous':         'int64',      # number of contacts performed before this campaign
    'poutcome':         'category',   # outcome of previous marketing campaign (failure, nonexistent, success)
    'emp.var.rate':     'float64',    # employment variation rate — quarterly indicator
    'cons.price.idx':   'float64',    # consumer price index — monthly indicator
    'cons.conf.idx':    'float64',    # consumer confidence index — monthly indicator
    'euribor3m':        'float64',    # euribor 3 month rate — daily indicator
    'nr.employed':      'float64',    # number of employees — quarterly indicator
    'y':                'category',   # target: has the client subscribed a term deposit? (yes, no)
}

# Load data with schema
data = pd.read_csv('data/bank-additional.csv', sep=";", dtype=data_schema)

# Convert yes/no columns to True/False
bool_cols = ['y']
for col in bool_cols:
    data[col] = data[col].map({'yes': True, 'no': False})
In [4]:
summarize(data)
Out[4]:
dtype count unique unique_pct missing missing_pct zero zero_pct top freq mean std min 50% max skew
age int64 4119 67 1.63 0 0.0 0 0.00 40.11 10.31 18.0 38.0 88.0 0.72
campaign int64 4119 25 0.61 0 0.0 0 0.00 2.54 2.57 1.0 2.0 35.0 4.0
nr.employed float64 4119 11 0.27 0 0.0 0 0.00 5166.48 73.67 4963.6 5191.0 5228.1 -1.08
euribor3m float64 4119 234 5.68 0 0.0 0 0.00 3.62 1.73 0.64 4.86 5.04 -0.72
cons.conf.idx float64 4119 26 0.63 0 0.0 0 0.00 -40.5 4.59 -50.8 -41.8 -26.9 0.29
cons.price.idx float64 4119 26 0.63 0 0.0 0 0.00 93.58 0.58 92.2 93.75 94.77 -0.22
emp.var.rate float64 4119 10 0.24 0 0.0 0 0.00 0.08 1.56 -3.4 1.1 1.4 -0.73
previous int64 4119 7 0.17 0 0.0 3523 85.53 0.19 0.54 0.0 0.0 6.0 4.02
pdays int64 4119 21 0.51 0 0.0 2 0.05 960.42 191.92 0.0 999.0 999.0 -4.78
duration int64 4119 828 20.10 0 0.0 1 0.02 256.79 254.7 0.0 181.0 3643.0 3.29
job category 4119 12 0.29 0 0.0 0 0.00 admin. 1012
day_of_week category 4119 5 0.12 0 0.0 0 0.00 thu 860
month category 4119 10 0.24 0 0.0 0 0.00 may 1378
contact category 4119 2 0.05 0 0.0 0 0.00 cellular 2652
poutcome category 4119 3 0.07 0 0.0 0 0.00 nonexistent 3523
loan category 4119 3 0.07 0 0.0 0 0.00 no 3349
housing category 4119 3 0.07 0 0.0 0 0.00 yes 2175
default category 4119 3 0.07 0 0.0 0 0.00 no 3315
education category 4119 8 0.19 0 0.0 0 0.00 university.degree 1264
marital category 4119 4 0.10 0 0.0 0 0.00 married 2509
y category 4119 2 0.05 0 0.0 0 0.00 False 3668

Synthetic Data Generation¶


We use CopulaGANSynthesizer as the primary synthesizer. It combines Gaussian copula marginal fitting with CTGAN-based joint distribution learning, which produces better fidelity on both column distributions and categorical relationships than GaussianCopulaSynthesizer alone (Column Shapes 0.88 vs 0.74, categorical pair trends 0.67 vs 0.43).

Three other synthesizers were evaluated (code and results not shown in this notebook):

  • GaussianCopulaSynthesizer produced good validity and privacy scores but systematically miscalibrated heavily imbalanced categorical columns, inflating the unknown category in default and marital to become the dominant value in the synthetic data.
  • CTGANSynthesizer failed to reproduce inter-variable correlations at this dataset size, producing a near-flat correlation matrix.
  • TVAESynthesizer dropped minority categories entirely for several columns including loan and marital.

CopulaGAN avoids all three failure modes while accepting a noticeable tradeoff in nearest-neighbor privacy score (0.37 vs 0.53 for Gaussian Copula) — synthetic rows sit closer to real rows than random noise.

We sample 4,119 rows (same as the original dataset).

In [5]:
metadata = Metadata.detect_from_dataframe(data=data, table_name='bank_marketing')

# CopulaGAN requires categorical columns to be `object` dtype instead of pandas' `category`
data_copulagan = data.copy()
cat_cols = data.select_dtypes('category').columns
data_copulagan[cat_cols] = data_copulagan[cat_cols].astype(object)

# Instantiate and fit the model
synthesizer = CopulaGANSynthesizer(metadata, epochs=150, verbose=True)
synthesizer.fit(data_copulagan)

# Generate synthetic data, with the same number of rows as the original dataset
synthetic_data = synthesizer.sample(num_rows=data_copulagan.shape[0])

# Restore category dtype to match real data (CopulaGAN outputs object dtype)
synthetic_data[cat_cols] = synthetic_data[cat_cols].astype('category')

print(synthetic_data.shape)
Gen. (-00.64) | Discrim. (-00.49): 100%|█████████████████████████████████████████████| 150/150 [01:39<00:00,  1.51it/s]
(4119, 21)
In [6]:
synthetic_data.head()
Out[6]:
age job marital education default housing loan contact month day_of_week ... campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
0 49 technician single university.degree unknown no no cellular jul mon ... 7 999 0 nonexistent 1.4 94.500 -42.3 4.967 5228.1 False
1 50 blue-collar married basic.4y no yes no telephone jun mon ... 2 999 0 nonexistent 1.4 93.950 -36.3 4.145 5228.1 False
2 50 blue-collar single basic.6y no yes no cellular jul tue ... 6 999 0 nonexistent 1.2 93.546 -35.9 4.955 5228.1 False
3 36 technician married basic.9y no yes yes cellular jul wed ... 4 999 0 nonexistent 1.4 94.097 -36.7 4.980 5228.1 False
4 88 admin. single university.degree no yes no cellular apr thu ... 1 4 1 failure -1.7 93.007 -26.9 1.085 5007.6 True

5 rows × 21 columns

In [7]:
summarize(synthetic_data)
Out[7]:
dtype count unique unique_pct missing missing_pct zero zero_pct top freq mean std min 50% max skew
age int64 4119 68 1.65 0 0.0 0 0.00 47.81 14.07 20.0 46.0 88.0 0.73
campaign int64 4119 33 0.80 0 0.0 0 0.00 3.16 4.15 1.0 2.0 34.0 3.75
nr.employed float64 4119 1453 35.28 0 0.0 0 0.00 5142.09 86.9 4963.6 5178.9 5228.1 -0.72
euribor3m float64 4119 1462 35.49 0 0.0 0 0.00 3.36 1.73 0.64 4.31 4.99 -0.36
cons.conf.idx float64 4119 221 5.37 0 0.0 0 0.00 -39.34 5.6 -50.8 -41.3 -26.9 0.56
cons.price.idx float64 4119 1392 33.79 0 0.0 0 0.00 93.56 0.59 92.2 93.49 94.77 -0.03
emp.var.rate float64 4119 49 1.19 0 0.0 19 0.46 -0.15 1.65 -3.4 0.7 1.4 -0.55
previous int64 4119 7 0.17 0 0.0 3090 75.02 0.41 0.92 0.0 0.0 6.0 3.2
pdays int64 4119 190 4.61 0 0.0 59 1.43 858.2 336.89 0.0 999.0 999.0 -1.99
duration int64 4119 1240 30.10 0 0.0 0 0.00 482.64 546.12 4.0 307.0 3643.0 2.83
job category 4119 12 0.29 0 0.0 0 0.00 blue-collar 889
day_of_week category 4119 5 0.12 0 0.0 0 0.00 tue 1166
month category 4119 10 0.24 0 0.0 0 0.00 may 920
contact category 4119 2 0.05 0 0.0 0 0.00 cellular 3016
poutcome category 4119 3 0.07 0 0.0 0 0.00 nonexistent 3178
loan category 4119 3 0.07 0 0.0 0 0.00 no 2695
housing category 4119 3 0.07 0 0.0 0 0.00 yes 2561
default category 4119 3 0.07 0 0.0 0 0.00 no 3186
education category 4119 8 0.19 0 0.0 0 0.00 high.school 1153
marital category 4119 4 0.10 0 0.0 0 0.00 married 1927
y category 4119 2 0.05 0 0.0 0 0.00 False 3168

Synthetic Data Validation¶


In this section, we use several methods to check how the synthetic data compares to the original dataset.

Per-variable Distribution Comparison¶

We plot each column of the real and synthetic data side-by-side using KDE curves for numeric columns and grouped bar plots for categorical columns.

A well-fitting synthetic dataset should show close overlap throughout.

The table below the plots summarizes mean, std, min, and max for the numeric columns.

There is a noticeable difference in the means of several features, e.g. duration, previous, emp.var.rate.

In [8]:
categorical_cols = ['job', 'marital', 'education', 'default', 'housing', 'loan',
                    'contact', 'month', 'day_of_week', 'poutcome']
boolean_cols     = ['y']
numeric_cols     = ['age', 'duration', 'campaign', 'pdays', 'previous',
                    'emp.var.rate', 'cons.price.idx', 'cons.conf.idx',
                    'euribor3m', 'nr.employed']

all_cols = categorical_cols + boolean_cols + numeric_cols
n_cols   = len(all_cols)

# Set up grid
n_grid_cols = 3
n_grid_rows = int(np.ceil(n_cols / n_grid_cols))

fig, axes = plt.subplots(n_grid_rows, n_grid_cols, figsize=(18, n_grid_rows * 4))
axes = axes.flatten()

for i, col in enumerate(all_cols):
    ax = axes[i]

    if col in numeric_cols:
        # Overlapping KDE plots for numeric columns
        data[col].plot.kde(ax=ax, label='Real', color='steelblue', linewidth=2)
        synthetic_data[col].plot.kde(ax=ax, label='Synthetic', color='coral',
                                      linewidth=2, linestyle='--')
        ax.set_title(col, fontsize=13)
        ax.legend()
        ax.tick_params(axis='both', labelsize=11)

    else:
        # Grouped bar chart for categorical and boolean columns
        real_counts = data[col].value_counts(normalize=True).sort_index()
        synth_counts = synthetic_data[col].value_counts(normalize=True).sort_index()

        # Align categories in case synthetic has different order
        all_categories = sorted(set(real_counts.index) | set(synth_counts.index),
                                key=str)
        real_counts  = real_counts.reindex(all_categories, fill_value=0)
        synth_counts = synth_counts.reindex(all_categories, fill_value=0)

        x = np.arange(len(all_categories))
        width = 0.35

        ax.bar(x - width/2, real_counts.values,  width, label='Real',
               color='steelblue', alpha=0.8)
        ax.bar(x + width/2, synth_counts.values, width, label='Synthetic',
               color='coral', alpha=0.8)

        ax.set_xticks(x)
        ax.set_xticklabels(all_categories, rotation=45, ha='right', fontsize=11)
        ax.set_title(col, fontsize=13)
        ax.legend()

# Hide any unused subplots
for j in range(i + 1, len(axes)):
    axes[j].set_visible(False)

plt.suptitle('Per-Variable Distribution: Real vs Synthetic', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [9]:
real_stats  = data[numeric_cols].describe().T[['mean', 'std', 'min', 'max']]
synth_stats = synthetic_data[numeric_cols].describe().T[['mean', 'std', 'min', 'max']]

real_stats.columns  = ['real_mean', 'real_std', 'real_min', 'real_max']
synth_stats.columns = ['synth_mean', 'synth_std', 'synth_min', 'synth_max']

summary = pd.concat([real_stats, synth_stats], axis=1)
style_table(summary.round(3), caption="Numeric Column Statistics: Real vs Synthetic")
Out[9]:
Numeric Column Statistics: Real vs Synthetic
  real_mean real_std real_min real_max synth_mean synth_std synth_min synth_max
age 40.1 10.3 18.0 88.0 47.8 14.1 20.0 88.0
duration 256.8 254.7 0.0 3643.0 482.6 546.1 4.0 3643.0
campaign 2.5 2.6 1.0 35.0 3.2 4.1 1.0 34.0
pdays 960.4 191.9 0.0 999.0 858.2 336.9 0.0 999.0
previous 0.2 0.5 0.0 6.0 0.4 0.9 0.0 6.0
emp.var.rate 0.1 1.6 -3.4 1.4 -0.1 1.6 -3.4 1.4
cons.price.idx 93.6 0.6 92.2 94.8 93.6 0.6 92.2 94.8
cons.conf.idx -40.5 4.6 -50.8 -26.9 -39.3 5.6 -50.8 -26.9
euribor3m 3.6 1.7 0.6 5.0 3.4 1.7 0.6 5.0
nr.employed 5166.5 73.7 4963.6 5228.1 5142.1 86.9 4963.6 5228.1

Correlation Structure¶


Beyond individual column distributions, a synthetic dataset should preserve relationships between variables. We first compare numeric correlation matrices visually using heatmaps. Then we use SDV's evaluate_quality function to compute Column Pair Trends scores.

Column Pair Trends evaluates two types of relationships:

  • Numeric-numeric pairs (Correlation Similarity): how well linear correlations between numeric columns are reproduced
  • Categorical/mixed pairs (Contingency Similarity): how well associations between categorical columns, and between categorical and numeric columns, are preserved

Each row in the tables below represents a column pair. The Score measures how well the relationship was preserved in the synthetic data, from 0 (not preserved) to 1 (perfectly preserved). The tables below show only column pairs that meet SDV's association threshold; pairs where the real data shows a weak relationship are excluded.

We also compute mean scores to summarize performance of each of the pair types.

In [10]:
fig, axes = plt.subplots(1, 2, figsize=(12, 6))

real_corr  = data[numeric_cols].corr()
synth_corr = synthetic_data[numeric_cols].corr()

# Shared color scale
vmin, vmax = -1, 1

for ax, corr, title in zip(axes, [real_corr, synth_corr], ['Real Data', 'Synthetic Data']):
    im = ax.imshow(corr, cmap='coolwarm', vmin=vmin, vmax=vmax, aspect='auto')
    ax.set_xticks(range(len(numeric_cols)))
    ax.set_yticks(range(len(numeric_cols)))
    ax.set_xticklabels(numeric_cols, rotation=45, ha='right', fontsize=11)
    ax.set_yticklabels(numeric_cols, fontsize=11)
    ax.set_title(title, fontsize=14)

    # Annotate cells with correlation values
    for i in range(len(numeric_cols)):
        for j in range(len(numeric_cols)):
            ax.text(j, i, f'{corr.iloc[i, j]:.2f}',
                    ha='center', va='center', fontsize=8,
                    color='white' if abs(corr.iloc[i, j]) > 0.6 else 'black')

# Dedicated colorbar axis
cbar_ax = fig.add_axes([0.92, 0.18, 0.02, 0.7])  # [left, bottom, width, height]
fig.colorbar(im, cax=cbar_ax)

plt.suptitle('Correlation Matrix: Real vs Synthetic', fontsize=16)
plt.tight_layout(rect=[0, 0, 0.91, 1])
plt.show()
No description has been provided for this image
In [11]:
quality_report = evaluate_quality(
    real_data=data,
    synthetic_data=synthetic_data,
    metadata=metadata,
    verbose=False
)

pair_trends = quality_report.get_details(property_name='Column Pair Trends')
pair_trends_nona = pair_trends.dropna(subset=['Score'])

# Split into correlation (numeric-numeric) and contingency (categorical) pairs
correlation_pairs = pair_trends_nona[pair_trends_nona['Metric'] == 'CorrelationSimilarity']
contingency_pairs = pair_trends_nona[pair_trends_nona['Metric'] == 'ContingencySimilarity']
In [12]:
style_table(correlation_pairs[['Column 1', 'Column 2', 'Score', 'Real Correlation', 'Synthetic Correlation']],
            caption="Numeric-Numeric Pairs (Correlation Similarity)")
Out[12]:
Numeric-Numeric Pairs (Correlation Similarity)
  Column 1 Column 2 Score Real Correlation Synthetic Correlation
195 emp.var.rate cons.price.idx 0.8 0.8 0.4
197 emp.var.rate euribor3m 0.8 1.0 0.5
198 emp.var.rate nr.employed 0.8 0.9 0.4
201 cons.price.idx euribor3m 0.9 0.7 0.4
207 euribor3m nr.employed 0.8 0.9 0.5
In [13]:
style_table(contingency_pairs[['Column 1', 'Column 2', 'Score', 'Real Association']],
            caption="Categorical/Mixed Pairs (Contingency Similarity)")
Out[13]:
Categorical/Mixed Pairs (Contingency Similarity)
  Column 1 Column 2 Score Real Association
21 job education 0.6 0.4
90 housing loan 0.8 0.7
119 contact month 0.7 0.6
126 contact emp.var.rate 0.6 0.7
127 contact cons.price.idx 0.6 0.7
128 contact cons.conf.idx 0.7 0.4
129 contact euribor3m 0.8 0.5
130 contact nr.employed 0.7 0.5
138 month emp.var.rate 0.4 0.7
139 month cons.price.idx 0.2 0.7
140 month cons.conf.idx 0.3 0.6
141 month euribor3m 0.5 0.5
142 month nr.employed 0.4 0.6
164 duration y 0.7 0.4
174 pdays previous 0.8 0.6
175 pdays poutcome 0.8 0.9
181 pdays y 0.8 0.3
182 previous poutcome 0.8 0.7
189 poutcome emp.var.rate 0.6 0.4
190 poutcome cons.price.idx 0.6 0.4
191 poutcome cons.conf.idx 0.7 0.4
192 poutcome euribor3m 0.7 0.4
193 poutcome nr.employed 0.6 0.4
194 poutcome y 0.9 0.3
199 emp.var.rate y 0.7 0.3
203 cons.price.idx y 0.6 0.3
206 cons.conf.idx y 0.7 0.4
208 euribor3m y 0.7 0.4
209 nr.employed y 0.7 0.4
In [14]:
print(f"Numeric-Numeric Pairs Mean Score: {correlation_pairs['Score'].mean():.3f}")
print(f"Categorical/Mixed Pairs Mean Score: {contingency_pairs['Score'].mean():.3f}")
Numeric-Numeric Pairs Mean Score: 0.798
Categorical/Mixed Pairs Mean Score: 0.644

SDV Quality Scores¶


SDV's evaluate_quality function scores each column individually on how well its distribution was reproduced ('Column Shapes') and each column pair on how well their relationship was preserved ('Column Pair Trends').

We already examined the pair trends above; here we look at the per-column scores.

In [15]:
quality_report.get_details(property_name='Column Shapes')
Out[15]:
Column Metric Score
0 age KSComplement 0.764020
1 job TVComplement 0.879825
2 marital TVComplement 0.858704
3 education TVComplement 0.866472
4 default TVComplement 0.968682
5 housing TVComplement 0.873513
6 loan TVComplement 0.841224
7 contact TVComplement 0.911629
8 month TVComplement 0.871328
9 day_of_week TVComplement 0.837096
10 duration KSComplement 0.749697
11 campaign KSComplement 0.932993
12 pdays KSComplement 0.889051
13 previous TVComplement 0.894877
14 poutcome TVComplement 0.916242
15 emp.var.rate KSComplement 0.864530
16 cons.price.idx KSComplement 0.817674
17 cons.conf.idx KSComplement 0.815246
18 euribor3m KSComplement 0.837825
19 nr.employed KSComplement 0.738043
20 y TVComplement 0.878611

Privacy and Memorization Checks¶


We run two checks to confirm that the synthetic data doesn't reproduce records from the training data.

  • New Row Score (NewRowSynthesis): checks whether any synthetic rows are exact copies of real rows. A score of 1.0 means every synthetic row is new.
  • Nearest-Neighbor Privacy Score (DCRBaselineProtection): measures how close synthetic rows are to their nearest real neighbors, relative to a random data baseline. A score of 1.0 means the synthetic data is no closer to the real data than random noise (the best possible privacy). A score near 0.5 is typical for a well-trained model.

The New Row Score of 1.0 confirms no exact copies. The Nearest-Neighbor Privacy Score of 0.354 is below the ideal 0.5 baseline — the synthetic data has a median distance of 0.126 to its nearest real neighbors, compared to 0.356 for randomly generated data. In other words, synthetic rows are closer to real rows than random noise would be, which is expected for a model that has learned the data distribution well. This does not indicate memorization of individual records, which is confirmed by the New Row Score of 1.0.

In [16]:
metadata_dict = metadata.to_dict()['tables']['bank_marketing']

def to_str_categories(df):
    return df.copy().assign(**{
        col: df[col].astype(str) 
        for col in df.select_dtypes('category').columns
    })

new_row_score = NewRowSynthesis.compute(
    real_data=to_str_categories(data),
    synthetic_data=to_str_categories(synthetic_data),
    metadata=metadata_dict,
    numerical_match_tolerance=0.01,
    synthetic_sample_size=None
)

nn_privacy_score = DCRBaselineProtection.compute_breakdown(
    real_data=data,
    synthetic_data=synthetic_data,
    metadata=metadata_dict
)

print(f"New Row Score: {new_row_score:.3f}")
print(f"Nearest-Neighbor Privacy Score: {nn_privacy_score['score']:.3f}")
print(f"Synthetic median DCR to real data: {nn_privacy_score['median_DCR_to_real_data']['synthetic_data']:.3f}")
print(f"Random baseline median DCR to real data: {nn_privacy_score['median_DCR_to_real_data']['random_data_baseline']:.3f}")
New Row Score: 1.000
Nearest-Neighbor Privacy Score: 0.417
Synthetic median DCR to real data: 0.148
Random baseline median DCR to real data: 0.354

SDV Validity Checks¶


To conclude this section, we run SDV's built-in diagnostic report, which captures basic diagnostic measurements across the entire dataset, reporting areas that may be problematic. It checks:

  • Data Structure: the synthetic table has the same column names as the real data.
  • Data Validity: synthetic values respect the original data's column types and value ranges.

We expect both the scores to be close to 100%.

In [17]:
diagnostic_report = run_diagnostic(
    real_data=data,
    synthetic_data=synthetic_data,
    metadata=metadata,
    verbose=True
)
Generating report ...

(1/2) Evaluating Data Validity: |███████████████████████████████████████████████████| 21/21 [00:00<00:00, 1613.70it/s]|
Data Validity Score: 100.0%

(2/2) Evaluating Data Structure: |█████████████████████████████████████████████████████| 1/1 [00:00<00:00, 333.41it/s]|
Data Structure Score: 100.0%

Overall Score (Average): 100.0%

Synthetic Data Quality Summary¶


The table below summarizes the key validation metrics computed in this section.

  • Data Validity and Data Structure should score close to 100%.
  • Column Shapes, Column Pair Trends, and the privacy metrics don't have a fixed target; however, higher is better for quality scores, while a Nearest-Neighbor Privacy Score near 0.5 indicates a good balance between utility and privacy.
In [18]:
summary = {
    'Column Shapes':                    quality_report.get_properties().set_index('Property').loc['Column Shapes', 'Score'],
    'Column Pair Trends (numeric)':     correlation_pairs['Score'].mean(),
    'Column Pair Trends (categorical)': contingency_pairs['Score'].mean(),
    'Data Validity':                    diagnostic_report.get_properties().set_index('Property').loc['Data Validity', 'Score'],
    'Data Structure':                   diagnostic_report.get_properties().set_index('Property').loc['Data Structure', 'Score'],
    'New Row Score':                    new_row_score,
    'Nearest-Neighbor Privacy':         nn_privacy_score['score'],
}

summary_df = pd.DataFrame.from_dict(summary, orient='index', columns=['Score'])
style_table(summary_df, caption='Synthetic Data Quality Summary', precision=2)
Out[18]:
Synthetic Data Quality Summary
  Score
Column Shapes 0.86
Column Pair Trends (numeric) 0.80
Column Pair Trends (categorical) 0.64
Data Validity 1.00
Data Structure 1.00
New Row Score 1.00
Nearest-Neighbor Privacy 0.42

Save Outputs¶


For reproducibility, we save the following items:

  • The synthetic dataset
  • Fitted synthesizer
  • Metadata

The synthesizer can be reloaded to generate additional synthetic rows without retraining.

In [19]:
os.makedirs('outputs', exist_ok=True)

# Save synthetic data to CSV
synthetic_data.to_csv('outputs/synthetic_data.csv', index=False)

# Save the fitted synthesizer for reproducibility
synthesizer.save('outputs/copulagan_synthesizer.pkl')

# Save metadata to JSON
metadata.save_to_json('outputs/metadata.json', mode='overwrite')

Conditional Generation¶


Beyond generating a full synthetic dataset, SDV supports conditional generation: we sample rows that match specified demographic or behavioral criteria while drawing the remaining columns from the learned distribution. This can be useful for generating targeted test datasets, augmenting underrepresented segments, or simulating specific customer profiles.

Conditional generation respects statistical patterns learned from the data but does not enforce logical constraints between columns. In practice the model learns these relationships reasonably well. For example, conditioning on job='retired' produces a mean age of 39 with plausible older clients represented. However, implausible combinations can still appear, such as retired clients in their late 20s. SDV's constraint framework can address this for cases where hard rules are required, but is outside the scope of this notebook.

In [20]:
# Basic example: generate 200 rows for young clients (age 25)
young_condition = Condition(
    num_rows=200,
    column_values={'age': 25}
)
young_synthetic = synthesizer.sample_from_conditions(conditions=[young_condition])
young_synthetic[cat_cols] = young_synthetic[cat_cols].astype('category') # Restore category dtype to match real data (CopulaGAN outputs object dtype)
print(young_synthetic.shape)
young_synthetic.head()
Sampling conditions: 100%|███████████████████████████████████████████████████████████| 200/200 [00:04<00:00, 48.81it/s]
(200, 21)

Out[20]:
age job marital education default housing loan contact month day_of_week ... campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
0 25 blue-collar married university.degree unknown yes unknown telephone may thu ... 2 999 0 failure 1.0 93.997 -36.5 4.928 5169.5 False
1 25 unemployed single university.degree no no yes cellular mar wed ... 2 999 0 nonexistent -1.1 93.548 -50.8 1.560 5091.6 True
2 25 blue-collar married unknown unknown no yes cellular jul tue ... 2 999 0 nonexistent 1.4 94.004 -42.9 4.962 5228.1 False
3 25 admin. single basic.9y unknown unknown no cellular sep fri ... 6 999 0 nonexistent -1.7 93.955 -42.0 4.846 5228.1 False
4 25 blue-collar divorced high.school no yes no cellular may fri ... 2 999 0 nonexistent 1.4 94.455 -42.0 4.972 5228.1 False

5 rows × 21 columns

In [21]:
# Complex example: generate 200 rows for retired clients, contacted by cellular in May, with no previous campaign contact
complex_condition = Condition(
    num_rows=200,
    column_values={
        'job':      'retired',
        'contact':  'cellular',
        'month':    'may',
        'poutcome': 'nonexistent',
        'previous': 0
    }
)
complex_synthetic = synthesizer.sample_from_conditions(conditions=[complex_condition])
complex_synthetic[cat_cols] = complex_synthetic[cat_cols].astype('category') # Restore category dtype to match real data (CopulaGAN outputs object dtype)
print(complex_synthetic.shape)
complex_synthetic.head()
Sampling conditions: 100%|███████████████████████████████████████████████████████████| 200/200 [00:07<00:00, 28.25it/s]
(200, 21)

Out[21]:
age job marital education default housing loan contact month day_of_week ... campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
0 33 retired married basic.9y no no no cellular may tue ... 1 999 0 nonexistent 0.6 94.559 -35.9 4.453 5228.1 False
1 50 retired married basic.9y unknown yes no cellular may fri ... 8 999 0 nonexistent 0.1 94.122 -41.8 4.915 5194.5 False
2 61 retired married professional.course no yes yes cellular may tue ... 1 999 0 nonexistent 1.4 93.980 -47.2 1.141 4989.6 True
3 44 retired married university.degree no no yes cellular may wed ... 2 999 0 nonexistent 1.4 93.095 -42.7 4.966 5228.1 False
4 45 retired married unknown unknown yes no cellular may fri ... 1 999 0 nonexistent 1.4 94.059 -42.9 4.890 5191.3 False

5 rows × 21 columns

In [22]:
summarize(complex_synthetic)
Out[22]:
dtype count unique unique_pct missing missing_pct zero zero_pct top freq mean std min 50% max skew
age int64 200 47 23.5 0 0.0 0 0.0 46.22 12.19 24.0 45.0 88.0 0.92
campaign int64 200 15 7.5 0 0.0 0 0.0 2.91 3.96 1.0 2.0 30.0 4.49
nr.employed float64 200 139 69.5 0 0.0 0 0.0 5138.65 87.1 4963.6 5177.45 5228.1 -0.79
euribor3m float64 200 168 84.0 0 0.0 0 0.0 3.46 1.67 0.64 4.32 4.99 -0.53
cons.conf.idx float64 200 80 40.0 0 0.0 0 0.0 -40.49 4.91 -50.8 -41.75 -26.9 0.74
cons.price.idx float64 200 182 91.0 0 0.0 0 0.0 93.51 0.59 92.2 93.46 94.77 -0.08
emp.var.rate float64 200 46 23.0 0 0.0 3 1.5 -0.45 1.73 -3.4 -0.4 1.4 -0.37
previous int64 200 1 0.5 0 0.0 200 100.0 0.0 0.0 0.0 0.0 0.0 0.0
pdays int64 200 6 3.0 0 0.0 1 0.5 974.51 153.37 0.0 999.0 999.0 -6.14
duration int64 200 183 91.5 0 0.0 0 0.0 492.77 531.43 5.0 327.0 3643.0 2.69
job category 200 1 0.5 0 0.0 0 0.0 retired 200
day_of_week category 200 5 2.5 0 0.0 0 0.0 wed 65
month category 200 1 0.5 0 0.0 0 0.0 may 200
contact category 200 1 0.5 0 0.0 0 0.0 cellular 200
poutcome category 200 1 0.5 0 0.0 0 0.0 nonexistent 200
loan category 200 3 1.5 0 0.0 0 0.0 no 123
housing category 200 3 1.5 0 0.0 0 0.0 yes 133
default category 200 3 1.5 0 0.0 0 0.0 no 148
education category 200 8 4.0 0 0.0 0 0.0 high.school 51
marital category 200 4 2.0 0 0.0 0 0.0 married 91
y category 200 2 1.0 0 0.0 0 0.0 False 164