LLM Math Reasoning — Aaryan Kamdar

Overview

This project investigates data quality over data quantity for LLM fine-tuning. Rather than training on more examples, we use a novel loss-based difficulty filtering algorithm to select only the most informative 15K examples from a 100K pool — targeting the "Zone of Proximal Development" where samples are challenging but learnable.

The base model is Qwen2.5-3B-Instruct fine-tuned on the AceReason-1.1-SFT dataset. Through loss-based selection, we achieved a +34.2% improvement over random selection (30.0% → 64.2% on MATH-500) while gaining entirely new problem-solving capabilities on AIME 2025 (0% → 3.33%).

Data Selection Strategy

We implemented an automated loss-based difficulty filtering strategy using the base model's cross-entropy loss to identify optimally challenging training examples from AceReason-100K.

01

Loss Computation

Base Qwen2.5-3B computes cross-entropy loss for all 100K examples. Only output/solution tokens counted — prompt tokens masked.

02

Difficulty Stratification

Sorted by loss: Bottom 25% (too easy) excluded · Middle 50% (optimal) selected · Top 25% (too hard/noisy) excluded

03

Uniform Sampling

15,000 examples uniformly sampled from the optimal middle 50% — balanced representation across the learning zone

Zone of Proximal Development: Learning is most effective when tasks are slightly beyond current ability. By selecting examples where the model shows partial understanding, we maximize learning signal and avoid both trivial problems and incomprehensible noise.

Complete Benchmark Results

Benchmark	Baseline (Original)	Random Selection	Advanced (Ours)	vs Random
MATH-500	63.40%	30.0%	64.20%	+34.2% ↑
AIME 2024	3.33%	0.0%	3.33%	+3.33% ↑
AIME 2025	0.00%	0.0%	3.33%	+3.33% ↑ (new!)
GPQA-Diamond	29.29%	27.27%	19.19%	−10.1% ↓
MMLU-Redux-2	57.05%	29.82%	31.11%	−25.9% ↓

64.2%

MATH-500 (Advanced)

+34.2%

Over Random Selection

3.33%

AIME 2025 (was 0%)

MATH-500 (Ours)

64.2%

MATH-500 (Baseline)

63.4%

MATH-500 (Random)

30.0%

AIME 2025 (Ours)

3.33%

AIME 2025 (Baseline)

0%

Key Findings

Data selection quality is critical — Advanced selection (64.2%) dramatically outperformed random (30.0%) by +34.2 percentage points, validating loss-based filtering over naive sampling.
Random selection is harmful — Random selection caused catastrophic forgetting, dropping 33.4% below baseline (63.4% → 30.0%). Poor data selection can be worse than no fine-tuning at all.
Specialization involves trade-offs — Advanced model gained +0.8% on MATH-500 but lost −10.1% on GPQA and −25.9% on MMLU. Training on 100% mathematics overwrites general knowledge in a 3B parameter model.
Advanced selection enables new capabilities — Only the advanced model solved AIME 2025 problems (0% → 3.33%), demonstrating that well-selected training data unlocks previously impossible problems.

Training Configuration

Base Model

Qwen2.5-3B-Instruct

Learning Rate

1e-5 (ultra-conservative)

Effective Batch

16 (2 per-device × 8 accum.)

Epochs

3

LR Scheduler

Cosine

Warmup Ratio

0.1

Sequence Length

16,384 tokens

Precision

BFloat16

Framework

DeepSpeed ZeRO-3

Attention

Flash Attention 2

Pipeline

LLaMA-Factory

Dataset

15K selected from 100K

Ultra-conservative LR (1e-5): Initial experiments with LR = 5e-4 (50× higher) caused complete performance collapse. The ultra-low learning rate was essential for maintaining strong baseline while achieving specialization gains.

Conclusions

Data quality >> quantity — Loss-based difficulty filtering validated with +34.2% over random
Baseline maintained — 64.2% vs 63.4% on MATH-500 (+0.8%) despite specialized training
New capabilities unlocked — AIME 2025: 0% → 3.33% (previously unsolvable)
Trade-offs documented — GPQA: −10.1%, MMLU: −25.9% (expected for domain-specific fine-tuning in small models)
Multi-task training or larger models needed to maintain broad capability while specializing

Links

Advanced model — huggingface.co/aaryankamdar/TML_PROJECT_MODELS
Random selection model — huggingface.co/aaryankamdar/TML_Trained_model_random_dataset
Dataset — huggingface.co/datasets/aaryankamdar/TML_Project
Code — github.com/aaryan-kamdar/TrustWorthyMachineLearningProject1

Advanced LLM Fine-Tuning:Mathematical Reasoning