Shared Feature Dynamics Plan

Goal

Build a CLI-first analysis stack that tracks stable learned features across checkpoints and uses those features as the main dynamical objects for studying circuit formation.

The stack should support:

This is the next layer beyond:

Scope

Initial scope should be one stage at a time.

Stage order:

  1. layer_2_post_mlp
  2. final_norm
  3. other residual stages later
  4. MLP hidden states only after the residual-stage stack is stable

Do not start with:

Why This Layer Exists

The current local feature analysis is useful for checkpoint-pair inspection, but it does not provide stable feature identities across training.

Without stable feature IDs, we cannot cleanly answer:

The shared-feature stack should convert the current analysis from:

to:

Core Entities

1. Shared Feature Basis

One basis per stage.

This is the canonical object that defines feature IDs.

Fields:

2. Feature

A feature is one latent dimension in the shared basis.

Fields:

3. Feature Trajectory Row

One feature at one checkpoint.

Fields:

4. Feature Birth Event

Fields:

5. Feature Diff Row

Fields:

6. Feature Patch Result

Fields:

7. Feature Lineage Edge

Fields:

Examples:

Commands

shared-feature-fit

Purpose

Fit one shared feature basis on pooled activations from multiple checkpoints for one stage.

Inputs

Behavior

Outputs

Validation

Fail if:

No hidden fallback to a different stage or different checkpoint set.

feature-trajectory-sweep

Purpose

Encode every checkpoint in one shared basis and write stable feature trajectories.

Inputs

Behavior

Outputs

Required Metrics

For each checkpoint-feature pair:

feature-birth-analyze

Purpose

Detect birth, stabilization, and drift of features over training.

Inputs

Metrics Supported Initially

Behavior

Birth should be defined formally, not visually.

Recommended rule:

Outputs

feature-compare

Purpose

Compare source and target checkpoints in the same basis.

Inputs

Behavior

Rank features by:

Outputs

feature-patch

Purpose

Run causal interventions in feature space.

Inputs

Behavior

For selected features:

Outputs

Required Reporting

feature-lineage

Purpose

Map important features to concrete components.

Inputs

Behavior

Initial lineage methods:

Outputs

Graph and Plot Outputs

The CLI should also export graph- and plot-ready files from the start.

Plot Files

These should be generated from compact plot-data JSON files so we can later reuse them in a UI.

Graph JSON

Graph outputs should contain:

Node types:

Edge types:

File Layout

Recommended output layout under one run:

artifacts/runs/<run_name>/analysis/shared_features/<stage_name>/
  shared_feature_basis.pt
  shared_feature_basis.json
  shared_feature_basis_features.json
  feature_trajectories.jsonl
  feature_checkpoint_summary.json
  feature_split_profiles.json
  feature_births.json
  feature_birth_summary.json
  feature_compare_<source>_vs_<target>.json
  feature_patch_<source>_vs_<target>.json
  feature_lineage_<step>.json
  graphs/
  plots/

Implementation Strategy

Build in dependency order, but as one milestone.

1. Shared Basis Backend

Implement:

2. Trajectory Sweep

Implement:

3. Birth Analysis

Implement:

4. Diff

Implement:

5. Patch

Implement:

6. Lineage

Implement:

7. Plot / Graph Export

Implement:

Integration With Existing Stack

Reuse existing infrastructure wherever possible.

Should reuse:

Should not duplicate:

Initial Stage Targets

Use this exact order:

  1. layer_2_post_mlp
  2. final_norm

Reason:

Technical Risks

1. Basis Too Dense

If active fraction stays very high, the feature basis is not sparse enough to support clean claims.

Need explicit fit-quality reporting.

2. Feature IDs Still Unstable

If checkpoint pooling is too narrow or normalization is bad, features may still fail to represent stable families.

3. Approximate Patching

Feature-space patching requires decoding back to residual space.

This is approximate and must always report reconstruction error.

4. Lineage Noise

Feature lineage will be noisy if done before filtering to causally meaningful features.

Need thresholding and ranking.

Success Criteria

This milestone is successful if we can say, for one stage:

That would be the first genuinely feature-dynamical layer for the repo and a much stronger basis for answering the SGD-selection question.