Data Generation¶
Generative Model¶
The synthetic data follows a sparse linear model:
where \(A \in \mathbb{R}^{m \times n}\) is a mixing matrix with unit-norm columns and \(\mathbf{z} \in \mathbb{R}^n\) is \(k\)-sparse (exactly \(k\) non-zero entries).
ID / OOD Split¶
Compositional generalisation is tested by controlling which combinations of active latents appear at train vs. test time. The split is defined by the first latent (\(z_0\)):
| Setting | Active latents | Used for |
|---|---|---|
| A | \(z_0\) active + \((k-1)\) from ID pool | ID training |
| B | \(k\) from ID pool (no \(z_0\)) | ID training |
| C | \(z_0\) active + \((k-1)\) from OOD pool | OOD test |
Key idea
The ID and OOD pools are disjoint subsets of \(\{z_1, \ldots, z_{n-1}\}\), so OOD samples contain \(z_0\) paired with latents never seen active alongside it during training.
Usage¶
from src.data import generate_datasets
train, val, ood, A = generate_datasets(
seed=0,
num_latents=10, # n: number of latent sources
k=3, # sparsity level
n_samples=2000, # ID training samples
input_dim=None, # m: observation dim (defaults to num_latents // 2)
)
Z_train, Y_train, labels_train = train # Z: sparse codes, Y: observations
Z_ood, Y_ood, labels_ood = ood
# A: ground-truth mixing matrix (m x n)
Parameters¶
| Parameter | Description | Default |
|---|---|---|
num_latents |
\(n\) — number of latent sources | required |
k |
number of simultaneously active sources | required |
n_samples |
number of ID training samples | required |
input_dim |
\(m\) — observation dimension | num_latents // 2 |
seed |
random seed for reproducibility | required |
Mixing Matrix¶
generate_matrix(m, n) draws entries i.i.d. from \(\mathcal{N}(0, 1)\) and normalises each column to unit \(\ell_2\) norm, satisfying the restricted isometry property (RIP) in expectation when \(m = \mathcal{O}(k \log(n/k))\).