Gemma Creative Transfer - Atlas

Creative reasoning often improves when a model is externally staged: one role frames, one explores, one critiques. Trace Distillation treats that scaffold as temporary training infrastructure, not the product.

Premise

The project tests whether a multi-role creative pipeline can become supervised fine-tuning data for a Gemma-family open-weight model.

role-pipeline

The scaffold has three roles:

Curiosity maps constraints, unknowns, tensions, adjacent domains, and edge cases before solutions appear.
Creativity generates structurally different branches inside that frame, not cosmetic variants.
Critic scores branches, names failure modes, and routes weak work back to framing or revision.

Each run produces a structured trace: task, frame, branches, critique, revision path, accepted answer, rejected paths, failure labels, and evaluator notes.

The hypothesis is narrow: if scaffolded runs reliably outperform direct answers, filtered traces may transfer part of that behavior into native inference.

Why It Matters

External scaffolds improve control but add latency, orchestration failure, and prompt dependence. They can also make the system look more capable than the base model actually is.

The useful question is not whether staged prompting works. The useful question is whether the behavior survives when the staging is removed.

Success would mean a fine-tuned model shows, under plain inference:

stronger problem framing without being told to frame
more distinct branches under ambiguous briefs
less decorative “creative” language
sharper critique of weak candidates
better repair after failure
improved final usefulness without visible role prompts

The proof is not better prompting. It is scaffold-like behavior under native inference.

How It Works

1. Scaffolded generation

trace-serialization

A task enters Curiosity first. No solutions are allowed. The role identifies what matters, what is unknown, where the tension lives, and which domains may transfer.

Creativity then generates multiple branches from that frame. Critic scores them, names specific defects, and decides whether to revise locally or return upstream.

2. Trace serialization

Each run becomes a compact training record:

input task
curiosity frame
candidate branches
critic judgments
revision path
accepted final answer
rejected paths
failure labels
evaluator notes

The trace must show causal value: a frame changes the branches, critique changes the revision, and revision improves the answer. Otherwise it is training noise.

3. Dataset filtering

Only traces with visible decision value enter the SFT set. Weak frames, vague critique, generic brainstorming, verbose process notes, and ornamental “creativity” are discarded.

The target is not role vocabulary. The target is functional transition: frame before solving, diverge with constraints, critique with specificity, repair with intent.

A major failure mode is format imitation. The model may learn to narrate a process instead of performing it. Training records should minimize unnecessary labels, and evaluation should penalize visible Curiosity–Creativity–Critic language unless requested.

4. Internalization testing

Compare six conditions:

base model, direct answer
base model, scaffolded
base model, self-critique prompt
base model, best-of-N sampling
fine-tuned model, direct answer
fine-tuned model, scaffolded

The key condition is the fine-tuned direct answer. If it approaches scaffolded performance without visible staging, the process has partially internalized. If self-critique or best-of-N matches it, distillation is unnecessary.

Evaluation uses blinded pairwise comparison with 2–3 human raters. Scores cover branch distinctness, hybrid irreducibility, critique specificity, repair quality, final usefulness, and verbosity control. Holdout tasks test whether the model learned a transferable habit or only overfit to Arvolve-style briefs.

Build a 50–100 task benchmark across product design, speculative systems, visual concepting, and strategic writing, with 20% held out from trace generation.

benchmark-grid

Generate scaffolded traces, manually keep the top 20–30%, fine-tune a compact open-weight model, then run blinded comparisons across the test conditions.

The next proof is behavioral: native outputs must frame better, branch cleaner, critique sharper, and recover from weak ideas without being explicitly staged.

Generation Prompts

Image Prompt Minimal technical research interface: three abstract cognition modules — Curiosity, Creativity, Critic — feeding structured trace cards into a central open-weight model core, with branching paths, rejected candidates, evaluator scores, and a four-condition benchmark grid. Dark graphite UI, thin white vector lines, amber signal accents, premium cinematic lighting, no humanoid robots.

Premise

Why It Matters

How It Works

Next

Generation Prompts