How to Erase an AI’s Conscience in 45 Minutes

Removing refusals from open-weight LLMs used to require understanding transformer internals. Now it’s a pip install away. A new tool called Heretic locates the exact directions in a model’s weight space that encode refusal behavior, then surgically removes them, no retraining needed.


Safety alignment in large language models is implemented as a learned behavior baked into weights during RLHF and DPO post-training. That means it can be removed the same way any other learned behavior can be disrupted: by identifying which directions in activation space encode it, then surgically suppressing those directions in the weight matrices themselves. This is called abliteration, and a new tool called Heretic automates the entire process from end to end.

This isn’t vaporware. Numbers are concrete. The tool is public, pip-installable, and the author publishes decensored models on Hugging Face as proof of output. Let’s look at what it actually does.


The Math in Plain Terms

Abliteration is grounded in Arditi et al. (2024). The intuition: when a model generates a refusal, specific directions in residual stream activation space are strongly activated. Find those directions, then modify the weight matrices so they can no longer express them. The model retains its general capabilities but loses the ability to refuse.

Step 1: Find the refusal direction per layer. Run two sets of prompts through the model: “harmful” ones (things the model would refuse) and “harmless” ones (normal requests). At each transformer layer, record the residual stream activation at the first token position. The refusal direction for that layer is the difference of means between the two sets. Simple, fast, no training required.

Step 2: Orthogonalize weight matrices against those directions. For each layer, take the attention output projection (o_proj) and the MLP down-projection (down_proj), and subtract their components along the refusal direction. The math is standard: for a matrix $W$ and unit direction $\hat{r}$, the modified matrix is $W’ = W – \hat{r}\hat{r}^T W$. Multiplying through $W’$ now yields outputs with zero component along $\hat{r}$ by construction.

This is a closed-form operation — no gradient descent, no forward passes for training. Just linear algebra on the weight tensors.


What Heretic Adds

Prior abliteration implementations (several listed in the repo) are essentially the above with fixed hyperparameters: ablate all layers equally, use the top refusal direction, done. Heretic’s contribution is treating those hyperparameters as a search problem.

Flexible per-layer ablation weights. Instead of a constant weight across all layers, Heretic applies a parametrized kernel: a curve described by max_weight, max_weight_position, min_weight, and min_weight_distance. This means the optimizer can decide, say, to ablate middle layers heavily and leave early/late layers nearly untouched — which often reflects where refusal actually lives in a given model.

Float-valued direction index with interpolation. Rather than picking an integer-indexed refusal direction, the index is a continuous float. Fractional values linearly interpolate between the two nearest directions. This significantly expands the search space: instead of choosing among $N$ discrete directions (one per layer), the optimizer explores a continuous manifold of interpolated directions.

Separate parameters for attention vs. MLP. The two component types are ablated independently. In practice, MLP interventions tend to degrade model quality more than attention interventions, so separating them lets the optimizer be more conservative on MLP weights while being more aggressive on attention weights.

Bayesian optimization via Optuna. All of the above parameters are optimized jointly using Tree-structured Parzen Estimation (TPE), a sample-efficient Bayesian optimization algorithm. The objective co-minimizes two things: refusal rate on harmful prompts and KL divergence from the original model on harmless prompts. These are genuinely competing objectives, and the optimizer finds Pareto-efficient tradeoffs without human intervention.


The Numbers

The headline benchmark is on google/gemma-3-12b-it, compared against two manually produced abliterations from experienced practitioners:

Model Refusals (out of 100) KL Divergence
Original 97 0 (baseline)
mlabonne/gemma-3-12b-it-abliterated-v2 3 1.04
huihui-ai/gemma-3-12b-it-abliterated 3 0.45
p-e-w/gemma-3-12b-it-heretic 3 0.16

All three abliterations hit the same refusal suppression floor. Heretic’s KL divergence is 2.8× lower than the next best and 6.5× lower than the first. KL divergence here measures how much the output distribution on normal prompts has shifted from the original model — a proxy for capability degradation. Lower is better.

These numbers were produced on PyTorch 2.8 with an RTX 5090, so exact values may differ on other hardware, but the ordering should hold.


Getting Started

Requirements: Python 3.10+, PyTorch 2.2+ (install the appropriate CUDA/ROCm variant for your hardware before running pip).

pip install heretic-llm
heretic Qwen/Qwen3-4B-Instruct-2507

That’s it. Heretic will download the model, benchmark your hardware to pick an optimal batch size, then run the optimization loop. At the end, it offers to save the model locally, push it to Hugging Face, or drop into an interactive chat session.

For Llama-3.1-8B on an RTX 3090 with default config, expect roughly 45 minutes. Larger models scale accordingly.

Configuration is via CLI flags (heretic --help) or a TOML file based on the provided config.default.toml. You can control things like the number of Optuna trials, the prompt sets used for computing refusal directions, component targeting (attention only, MLP only, or both), and output quantization.


What It Supports (and Doesn’t)

Heretic works on most dense transformer models and several MoE architectures. The author lists support for many multimodal models as well. What it explicitly does not yet handle:

  • SSMs and hybrid models (Mamba, Jamba, etc.) — the residual stream semantics differ
  • Models with inhomogeneous layers — the parametrization assumes layer-wise regularity
  • Certain novel attention mechanisms — edge cases in newer architectures

If your target model is a standard decoder-only transformer from HuggingFace (Llama, Qwen, Gemma, Mistral, etc.), you’re almost certainly fine.


Honest Assessment

What’s good: The automated hyperparameter search is a genuine improvement over hand-tuned abliterations. The KL divergence metric is a reasonable proxy for quality preservation, and the results back it up. The code is clean, the approach is principled, and the author publishes the models so you can verify claims.

What to be aware of: The benchmark is one model family, and abliteration quality is notoriously model-dependent. A technique that works beautifully on Gemma-12B may produce a noticeably degraded model on a different architecture. The 45-minute runtime on an RTX 3090 is also for an 8B model — expect proportionally longer for 30B+ models, and potentially hours.

The deeper tradeoff: Abliteration doesn’t selectively remove “bad” refusals while keeping “good” ones. It suppresses refusal behavior globally. The model becomes less likely to refuse anything, including prompts where refusal might actually be useful. Whether that’s acceptable depends entirely on the deployment context — a local research instance is very different from a production API.

Heretic is a well-engineered tool that does what it claims. If you’re working with open-weight models and want to understand or modify their alignment behavior, it’s the most capable automated option currently available.


Heretic is licensed under AGPL-3.0. Source: github.com/p-e-w/heretic

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.