Contrastive targeted SFT as a mechinterp method - has anyone mapped causal dependency interactions this way? [D]
Hi All, I've been running experiments on targeted SFT for specific capability dimensions on a 31B model. After running small training run to prime the model slightly in the direction I want, then ran a judge across 40 domains scoring six independent quality dimensions. One dimension consistently scored weakest across five runs. I am now training contrastive variants from the same checkpoint - examples with that dimension deep vs examples with it deliberately shallow, same everything else. The plan is to see if I can find the difference between the the two checkpoints to locate the circuit, then ablate those heads and measure which OTHER dimensions degrade. The idea is that if ablating dimension A's circuit causes dimension B's judge score to drop, there's a causal dependency in the network, B reads from A's residual stream output. And If I can do this for each dimension and build a causal dependency graph of how capabilities relate inside the model. Then use that graph to determine optimal training order for future rounds (train upstream nodes first, and would help me know which downstream nodes get better signal). A few specific questions: Has anyone done iterative targeted SFT guided by circuit tracing between rounds, and or by trying somewhat contrastive approaches to try to find any areas in the network? I can find papers on circuit discovery and papers on targeted SFT separately which somewhat validate this idea, but not the closed loop where mechinterp findings from a round determine training strategy for the next, and or what circuits may interact with each other in isolated scenarios, and how specific orders of training in specific directions may change how things behave. For the contrastive ablation - does anyone have any tips on what can work best in this area or could bring out more analysis? When tracing downstream dependencies via ablation, how do you distinguish direct from indirect effects? If ablating circuit A degrades dimension C, that could be A > C directly or A > B > C through an intermediate. Does anyone have a practical method for resolving this beyond ablating at multiple layers? After elemental training rounds, I plan to test whether dimensions compose naturally by running prompts that require causal chaining between two dimensions. For pairs that fail, I'm considering activation steering (injecting both dimension vectors simultaneously) as a diagnostic, if steering fixes it, possibly it's a routing problem, if not, could be a capability gap. Has anyone combined steering with fine tuning diagnostics like this? For context I don't have a ML background, I am self taught through running experiments, but from what I am learning purely from first principle understanding and experiments, it feels that if you can map these circuits and their direct second, third and so on order interactions in isolated directions (for say a group of related strengths/weaknesses you're directly trying to isolate and steer, wouldn't this be a potentially way to isolate circuits for stronger training runs? Btw if anyone has any general topics or links that are super interesting around anything related to this I'd be fascinated to see and learn about! If there's established methodology for any of this that I'm reinventing badly, I'd genuinely appreciate being pointed to it. I am so fascinated with this, it seems that if you can somehow eventually solve this problem, you could create better possible behaviour control or targeted understanding easier? submitted by /u/Substantial_Diver469 [link] [comments]
Comments
No comments yet. Start the discussion.