Shakti LLMs

A 4B-parameter vision-language model optimized for enterprise and edge deployment

Introduction

Shakti-VLM-4B is a 4B-parameter vision-language model developed by SandLogic, designed for tasks involving both visual and textual inputs. It supports a wide range of applications such as document question answering, chart understanding, OCR-based extraction, and general visual reasoning. The model is trained using a three-stage pipeline that separates language modeling, visual alignment, and full fine-tuning. It uses architectural choices like QK-normalization, hybrid layer normalization, and extended positional embeddings to improve training stability and efficiency. Shakti-VLM-4B achieves strong results across standard multimodal benchmarks, while optimized for enterprise and edge deployment.

Model Capabilities

Shakti-VLM-4B is trained to handle a range of vision-language tasks:

Document Visual Question Answering (DocVQA)

Answering questions based on the content of scanned or digital documents.

Text-based Visual QA

Handling tasks like OCR-VQA and TextVQA that involve reasoning over text present in images.

Chart and Diagram Interpretation

Understanding structured visual data like charts and tables.

Multimodal Reasoning

Answering complex questions that require understanding both the image and the accompanying text.

Mathematical and Scientific QA

Reasoning over math or science questions with visual context, e.g., figures, tables, and diagrams.

These capabilities make Shakti-VLM-4B suitable for enterprise scenarios such as automated document workflows, visual form understanding, and structured data extraction.

Architecture

Shakti-VLM-4B follows a standard vision-language model structure with three core components:

Vision Encoder

The vision encoder is a transformer-based model with:

48 layers
1920 hidden dimension
24 attention heads

It uses dynamic patch sizes (from 14×14 to 32×32) to process images of various resolutions, supporting flexible input for both dense documents and high-resolution visuals.

Stability Improvements:

QK-Normalization in the attention mechanism
Hybrid normalization strategy (Pre-LayerNorm + RMSNorm)
Enhanced Rotary Position Embeddings (RoPE) with 2D positional bias

Projection Layer

Maps visual features from the encoder into visual tokens that can be combined with text embeddings. These visual tokens are concatenated with text and fed to the decoder.

Decoder

Based on the Shakti-2.5B language model, designed to handle long-context sequences (up to 32,768 tokens). Supports multimodal generation and reasoning when both image and text tokens are present.

Dataset Details

Shakti-VLM-4B is trained using a diverse collection of datasets organized across three distinct training stages:

Stage 1: Decoder Pretraining

Text-only datasets for language modeling and long-context understanding

Stage 2: Vision-Language Alignment

Paired image-text datasets for aligning visual and textual embeddings

Stage 3: Fine-Tuning

Multimodal tasks including instruction-tuning and human feedback datasets

Pre-Training		Fine-Tuning (Stage 3)
Stage 1	Stage 2	Stage 3
Dolma (Books subset) The Stack FineWebEdu-dedup	OBELICS PDFA LAION-400M LAION COCO COYO DocVQA TextCaptions Visual-7W OCR-VQA DataComp COCOCaption	PDFA Docmatrix Leopard-instruct MIMIC-IT ScienceQA RLAIF-V-Dataset LLaVA-CoT-100k the_cauldron rlaif-v_formatt

Training Details

Shakti-VLM-4B uses a three-stage training approach to separate concerns of language understanding, multimodal alignment, and end-to-end task learning.

Stage 1: Decoder Pretraining (Text-only)

Focuses on training the text decoder independently on large-scale text-only data to build strong language understanding and handle long context lengths.

The vision encoder and projection layers are frozen
Rotary position embeddings with dynamic scaling for long sequences
Cosine learning rate schedule starting at 2e-4

This stage ensures the decoder can handle complex language tasks before adding visual inputs.

Stage 2: Vision-Language Alignment

Aligns visual features with language representations by training the vision encoder and projection layer while keeping the decoder frozen.

Projection layer maps visual embeddings into tokens compatible with the decoder
Sequence length of 32,768 tokens and image size of 448×448 with dynamic resizing
Learning rate of 4e-5 with cosine scheduling

This enables cross-modal alignment without disrupting thedecoder's pre-learned language modeling.

Stage 3: End-to-End Fine-Tuning

Trains the full model — encoder, projection, and decoder — together on multimodal datasets to adapt to real-world tasks.

Includes instruction tuning and in-context instruction learning
Uses RLHF and Direct Preference Optimization (DPO) to refine outputs
Learning rate of 4e-5 with cosine scheduling; weight decay of 0.01

This helps the model generalize to tasks like visual question answering, document analysis, and visual reasoning.

Benchmark Results and Comparison

Shakti-VLM-4B is evaluated across diverse multimodal benchmarks, showing strong performance across document understanding, OCR, chart reasoning, and general vision-language tasks.

Evaluation Categories

•

Document QA: DocVQA, InfoVQA

•

Text-based QA: TextVQA, OCRBench

•

Chart/Diagram Interpretation: ChartQA

•

Multimodal Reasoning: MMMU, MMStar, MMVet

•

Mathematical Reasoning: MathVista, MathVerse

•

Real-world Scenarios: RealWorldQA

Benchmark	Shakti-4B	InternVL2-1B	Phi-3-Vision-4B	MiniCPM-V-2.6-8B	Qwen2VL-7B	Qwen-2.5VL-7B
MMMU val	59.78	47.9	46.1	49.8	54.1	58.6
DocVQA test	92.92	89.2	-	90.8	94.5	95.7
InfoVQA test	77.3	67.0	-	-	76.5	82.6
ChartQA test	85.56	74.4	70.9	80.1	84.3	84.9
TextVQA val	80.75	78.8	70.5	72.7	74.1	79.7
OCRBench	849	788	639	852	845	864
MME sum	2340.99	2064.1	1508.0	2348.4	2326.8	-
MMStar	62.33	-	-	57.5	60.7	63.9
MMMU Pro val	37.47	-	-	-	-	41
VQA v2 val	78	-	-	-	-	-
Ai2d	83.83	78.9	76.7	-	-	-
RealworldQA	71.18	60.7	58.8	-	70.1	-
Math Vista (testmini)	48.5	58.6	44.5	60.6	58.2	68.2
MMT-Bench (test)	66.26	-	-	-	63.7	63.6
MMVet	62.3	55.7	44.1	60	62.0	67.1
HallusionBench	47.9	41.9	39	48.1	50.6	52.9
MMBench (test)	81.7	78.6	73.6	-	83.0	82.6
Math Vision	19.05	17.8	17.4	16.1	16.3	25.07
Math Verse	28.78	32	24.1	25.7	31.9	-
Olympaid Bench	1.3	1.1	-	-	-	-
BLINK	50.11	46.1	58.3	53	53.2	-
MTVQA	16.02	15.3	-	-	25.6	-

Key Observations

✓

Strong Multimodal Reasoning: Excels on complex reasoning benchmarks like MMMU (59.78%), outperforming all comparison models.

✓

Document and Visual Intelligence: Demonstrates robust capabilities in document understanding (DocVQA, TextVQA, InfoVQA) and advanced visual reasoning (ChartQA, MMStar, MMVet).

✓

Real-world Applicability: Achieves highest scores on RealworldQA, highlighting practical effectiveness in everyday scenarios.

✓

Efficient and Competitive: Outperforms or matches larger models (Qwen2VL-7B, MiniCPM-V-2.6-8B) despite having fewer parameters.

✓

Balanced Generalization: Maintains consistently high performance across diverse tasks, reflecting broad capabilities.

Conclusion

Shakti-VLM-4B is a highly capable, efficient, and enterprise-ready vision-language model. Its modular training, long-context support, and parameter-efficient architecture make it a standout performer across a wide range of multimodal benchmarks. By leveraging hybrid normalization techniques, enhanced positional embeddings, and instruction-tuned fine-tuning strategies, it sets a new standard for lightweight VLMs, making it ideal for edge deployment and real-world applications.