Decoupling Generalization and Adaptation in Meta-Learning for Large Language Models

Vetcha, Nitin; Xu, Binqian; Liu, Dianbo

DeGAML-LLM: Decoupling Generalization and Adaptation in Meta-Learning for Large Language Models

Nitin Vetcha^1,3, Binqian Xu², Dianbo Liu¹

¹National University of Singapore ²Nanjing University of Science and Technology ³Indian Institute of Science

Code 🤗 Checkpoints Paper (Coming Soon)

DeGAML-LLM comparison with MAML-en-LLM and ABMLL

Visual comparison of parameter exploration and update dynamics across meta-learning paradigms. MAML-en-LLM and ABMLL explicitly adapt parameters for multiple tasks through coupled meta-updates. DeGAML-LLM (ours) decouples generalization and adaptation: task-conditioned parameter generation explores diverse regions of the parameter space, while task-specific adaptation proceeds independently.

Abstract

Fine-tuning large language models (LLMs) for downstream tasks remains expensive, even with parameter-efficient methods like Low-Rank Adaptation (LoRA). In this regard, meta-learning approaches such as Model-Agnostic Meta-Learning for LLMs (MAML-en-LLM) and Amortized Bayesian Meta-Learning for LoRA (ABMLL) have emerged as promising solutions for rapid downstream LLM adaptation. However, these methods fundamentally couple two distinct objectives: learning generalizable initializations and enabling efficient task adaptation. We argue that this coupling limits both the quality of learned representations and adaptation efficiency.

In this paper, we introduce DeGAML-LLM (Decoupled Generalization and Adaptation Meta-Learning for Large Language Models), a novel framework that explicitly separates these two objectives through dedicated parameter spaces. Specifically, we maintain a generalization module that learns task-agnostic representations across the task distribution, and an adaptation module that specializes in rapid task-specific adjustment. Extensive experiments on common-sense reasoning, mathematics, logic, social, medical and coding benchmarks across model scales demonstrate that DeGAML-LLM outperforms existing meta-learning and standard multi-task baselines.

Key Contributions

Problem Identification: We analyze existing meta-learning methods for LLMs and highlight the limitations arising from coupling cross-task generalization and task-specific adaptation within a single optimization process.

Novel Framework: We propose DeGAML-LLM, a decoupled meta-learning framework that separates generalization and adaptation into distinct parameter modules, enabling more flexible task-specific adaptation.

Empirical Validation: We demonstrate that DeGAML-LLM outperforms prior meta-learning and multi-task baselines on a diverse set of in-domain and out-of-domain benchmarks.

Method Overview

Internal Transformer Architecture Modifications. (a) MAML-en-LLM updates all weights via gradients from a meta-learned initialization. (b) ABMLL samples low-rank LoRA adapters from a learned Bayesian posterior distribution. (c) DeGAML-LLM uses a separate generator to predict initial adapter weights, which are then refined by a separate RL policy without gradient backpropagation to the generator.

🔮

Generalization Module

Learns to generate LoRA adapter parameters from task prompts using a hyperconvolutional decoder trained on checkpoint trajectories. Captures cross-task structural knowledge without encoding any specific adaptation trajectory.

⚡

Adaptation Module

Refines generated parameters via an RL policy that selects from four adaptation families: Test-Time Training (TTT), Test-Time Scaling (TTS), LoRA Mixing, and Latent Space optimization.

Experimental Results

In-Domain Tasks (Common-Sense Reasoning)

Qwen2.5-1.5B-Instruct

Method	ARC-c	ARC-e	HellaSwag	BoolQ	PIQA	WinoGrande	Avg
No Meta-Train LoRA	74.5	84.4	55.8	55.6	65.6	48.2	64.0
Union Train LoRA	63.2	73.9	48.9	55.1	47.8	61.3	58.3
ABMLL	69.9	83.2	51.1	63.2	54.3	52.9	62.4
MAML-en-LLM	66.0	84.3	59.3	58.7	68.1	56.8	65.5
DeGAML-LLM (Ours)	73.7	88.4	57.2	58.8	70.7	57.3	67.7
Δ (vs MAML-en-LLM)	+7.7	+4.1	-2.1	+0.1	+2.6	+0.5	+2.2
Δ (vs ABMLL)	+3.8	+5.2	+6.1	-4.4	+16.4	+4.4	+5.3
Δ (vs No Meta-Train)	-0.8	+4.0	+1.4	+3.2	+5.1	+9.1	+3.7
Δ (vs Union Train)	+10.5	+14.5	+8.3	+3.7	+22.9	-4.0	+9.4

Qwen2.5-0.5B-Instruct

Method	ARC-c	ARC-e	HellaSwag	BoolQ	PIQA	WinoGrande	Avg
No Meta-Train LoRA	40.7	59.4	23.4	22.1	66.2	35.7	41.2
Union Train LoRA	39.7	47.4	26.3	14.7	51.1	50.5	38.3
ABMLL	37.6	54.4	26.5	62.2	37.6	34.5	42.1
MAML-en-LLM	47.7	63.7	36.3	46.2	67.7	50.1	51.9
DeGAML-LLM (Ours)	55.5	74.7	48.3	58.7	60.1	52.8	58.4
Δ (vs MAML-en-LLM)	+7.8	+11.0	+12.3	+12.5	-7.6	+2.7	+6.5
Δ (vs ABMLL)	+17.9	+20.3	+21.8	-3.5	+22.5	+18.3	+16.3
Δ (vs No Meta-Train)	+14.8	+15.3	+24.9	+36.7	-6.1	+17.1	+17.2
Δ (vs Union Train)	+15.8	+27.3	+22.0	+44.1	+9.0	+2.3	+20.1

Out-of-Domain Tasks

Qwen2.5-1.5B-Instruct

Method	GSM-8K	MATH	DivLogicEval	SocialIQA	CodeMMLU	JAMA	Avg
Union Train LoRA	34.2	32.2	24.1	51.4	34.7	34.7	36.1
ABMLL	28.7	15.9	26.9	66.3	39.6	28.5	34.3
MAML-en-LLM	35.6	43.5	31.2	68.7	42.3	32.5	42.3
DeGAML-LLM (Ours)	51.4	46.9	31.4	69.5	44.6	41.5	47.5
Δ (vs MAML-en-LLM)	+15.8	+3.4	+0.2	+0.8	+2.3	+9.0	+5.3
Δ (vs ABMLL)	+22.7	+31.0	+4.5	+3.2	+5.0	+13.0	+13.2
Δ (vs Union Train)	+17.2	+14.7	+7.3	+18.1	+9.9	+6.8	+11.4

Qwen2.5-0.5B-Instruct

Method	GSM-8K	MATH	DivLogicEval	SocialIQA	CodeMMLU	JAMA	Avg
Union Train LoRA	15.6	6.8	20.3	39.5	29.8	29.9	23.6
ABMLL	20.4	7.1	23.7	53.1	28.2	16.8	24.9
MAML-en-LLM	29.1	26.3	25.1	54.9	34.1	26.4	32.6
DeGAML-LLM (Ours)	30.3	24.5	28.7	55.1	35.6	31.2	34.2
Δ (vs MAML-en-LLM)	+1.2	-1.8	+3.6	+0.2	+1.5	+4.8	+1.6
Δ (vs ABMLL)	+9.9	+17.4	+5.0	+2.0	+7.4	+14.4	+9.3
Δ (vs Union Train)	+14.7	+17.7	+8.4	+15.6	+5.8	+1.3	+10.6

Key Findings:

DeGAML-LLM consistently outperforms baselines across both model scales
Particularly strong on out-of-domain tasks: +15.8 on GSM-8K and +9.0 on JAMA (1.5B)
Larger improvements on smaller 0.5B model demonstrate effectiveness at limited capacity
Average improvement of +5.3 points over MAML-en-LLM on out-of-domain tasks (1.5B)

Ablation Study

Impact of generalization and adaptation stages. Base Model denotes the frozen pretrained LLM without any LoRA adapters. Generalization evaluates performance using generated LoRA parameters without task-specific refinement. Adaptation applies RL-based refinement to the generated parameters.

In-Domain Tasks

Qwen2.5-1.5B-Instruct

Stage	ARC-c	ARC-e	HellaSwag	BoolQ	PIQA	WinoGrande	Avg
Base Model	71.5	83.0	50.9	56.3	45.8	50.6	59.6
+ Generalization	73.0 (+2.1%)	83.7 (+0.8%)	56.2 (+10.4%)	55.2 (-2.0%)	56.4 (+23.1%)	50.2 (-0.8%)	62.5 (+4.9%)
+ Adaptation	73.7 (+1.0%)	88.4 (+5.6%)	57.2 (+1.8%)	58.8 (+6.5%)	70.7 (+25.4%)	57.3 (+14.1%)	67.7 (+8.3%)

Qwen2.5-0.5B-Instruct

Stage	ARC-c	ARC-e	HellaSwag	BoolQ	PIQA	WinoGrande	Avg
Base Model	38.3	54.8	26.5	37.0	16.6	50.2	37.2
+ Generalization	42.7 (+11.5%)	63.2 (+15.3%)	25.9 (-2.3%)	44.9 (+21.4%)	47.6 (+186.7%)	50.0 (-0.4%)	45.7 (+22.9%)
+ Adaptation	55.5 (+30.0%)	74.7 (+18.2%)	48.3 (+86.5%)	58.7 (+30.7%)	60.1 (+26.3%)	52.8 (+5.6%)	58.4 (+27.8%)

Out-of-Domain Tasks

Qwen2.5-1.5B-Instruct

Stage	GSM-8K	MATH	DivLogicEval	SocialIQA	CodeMMLU	JAMA	Avg
Base Model	51.8	30.3	28.3	65.9	42.6	38.9	42.9
+ Generalization	32.6 (-37.1%)	40.1 (+32.3%)	28.6 (+1.1%)	68.6 (+4.1%)	44.1 (+3.5%)	39.5 (+1.5%)	42.2 (-1.6%)
+ Adaptation	51.4 (+57.7%)	46.9 (+17.0%)	31.4 (+9.8%)	69.5 (+1.3%)	44.6 (+1.1%)	41.5 (+5.1%)	47.5 (+12.6%)

Qwen2.5-0.5B-Instruct

Stage	GSM-8K	MATH	DivLogicEval	SocialIQA	CodeMMLU	JAMA	Avg
Base Model	15.2	2.8	22.4	50.8	32.4	23.8	24.5
+ Generalization	20.8 (+36.8%)	24.1 (+760.7%)	21.0 (-6.3%)	33.5 (-34.1%)	29.1 (-10.2%)	11.7 (-50.8%)	25.7 (+4.9%)
+ Adaptation	30.3 (+45.7%)	24.5 (+1.7%)	28.7 (+36.7%)	55.1 (+64.5%)	35.6 (+22.3%)	31.2 (+166.7%)	34.2 (+33.1%)

Ablation Insights:

The generalization module alone provides substantial improvements over the base model
The adaptation module further refines performance, especially on complex tasks (HellaSwag: +86.5% for 0.5B)
Out-of-domain tasks show particularly large gains from adaptation (MATH: +760.7% generalization, then adaptation refines it)
Decoupling enables the adaptation policy to recover from suboptimal generalization (e.g., GSM-8K: -37.1% → +57.7% for 1.5B)

BibTeX

@article{vetcha2025degaml,
  title={Decoupling Generalization and Adaptation in Meta-Learning for Large Language Models},
  author={Vetcha, Nitin and Xu, Binqian and Liu, Dianbo},
  year={2026},
  url={https://github.com/nitinvetcha/DeGAML-LLM}
}