A 4B-parameter vision-language model optimized for enterprise and edge deployment

Introduction

Shakti-VLM-4B is a 4B-parameter vision-language model developed by SandLogic, designed for tasks involving both visual and textual inputs. It supports a wide range of applications such as document question answering, chart understanding, OCR-based extraction, and general visual reasoning. The model is trained using a three-stage pipeline that separates language modeling, visual alignment, and full fine-tuning. It uses architectural choices like QK-normalization, hybrid layer normalization, and extended positional embeddings to improve training stability and efficiency. Shakti-VLM-4B achieves strong results across standard multimodal benchmarks, while optimized for enterprise and edge deployment.

Model Capabilities

Shakti-VLM-4B is trained to handle a range of vision-language tasks:

Document Visual Question Answering (DocVQA)

Answering questions based on the content of scanned or digital documents.

Text-based Visual QA

Handling tasks like OCR-VQA and TextVQA that involve reasoning over text present in images.

Chart and Diagram Interpretation

Understanding structured visual data like charts and tables.

Multimodal Reasoning

Answering complex questions that require understanding both the image and the accompanying text.

Mathematical and Scientific QA

Reasoning over math or science questions with visual context, e.g., figures, tables, and diagrams.

These capabilities make Shakti-VLM-4B suitable for enterprise scenarios such as automated document workflows, visual form understanding, and structured data extraction.

Architecture

Shakti-VLM-4B follows a standard vision-language model structure with three core components:

Vision Encoder

The vision encoder is a transformer-based model with:

  • 48 layers
  • 1920 hidden dimension
  • 24 attention heads

It uses dynamic patch sizes (from 14×14 to 32×32) to process images of various resolutions, supporting flexible input for both dense documents and high-resolution visuals.

Stability Improvements:

  • QK-Normalization in the attention mechanism
  • Hybrid normalization strategy (Pre-LayerNorm + RMSNorm)
  • Enhanced Rotary Position Embeddings (RoPE) with 2D positional bias

Projection Layer

Maps visual features from the encoder into visual tokens that can be combined with text embeddings. These visual tokens are concatenated with text and fed to the decoder.

Decoder

Based on the Shakti-2.5B language model, designed to handle long-context sequences (up to 32,768 tokens). Supports multimodal generation and reasoning when both image and text tokens are present.

Dataset Details

Shakti-VLM-4B is trained using a diverse collection of datasets organized across three distinct training stages:

Stage 1: Decoder Pretraining

Text-only datasets for language modeling and long-context understanding

Stage 2: Vision-Language Alignment

Paired image-text datasets for aligning visual and textual embeddings

Stage 3: Fine-Tuning

Multimodal tasks including instruction-tuning and human feedback datasets

Pre-TrainingFine-Tuning
(Stage 3)
Stage 1Stage 2Stage 3
  • Dolma (Books subset)
  • The Stack
  • FineWebEdu-dedup
  • OBELICS
  • PDFA
  • LAION-400M
  • LAION COCO
  • COYO
  • DocVQA
  • TextCaptions
  • Visual-7W
  • OCR-VQA
  • DataComp
  • COCOCaption
  • PDFA
  • Docmatrix
  • Leopard-instruct
  • MIMIC-IT
  • ScienceQA
  • RLAIF-V-Dataset
  • LLaVA-CoT-100k
  • the_cauldron
  • rlaif-v_formatt

Training Details

Shakti-VLM-4B uses a three-stage training approach to separate concerns of language understanding, multimodal alignment, and end-to-end task learning.

Stage 1: Decoder Pretraining (Text-only)

Focuses on training the text decoder independently on large-scale text-only data to build strong language understanding and handle long context lengths.

  • The vision encoder and projection layers are frozen
  • Rotary position embeddings with dynamic scaling for long sequences
  • Cosine learning rate schedule starting at 2e-4

This stage ensures the decoder can handle complex language tasks before adding visual inputs.

Stage 2: Vision-Language Alignment

Aligns visual features with language representations by training the vision encoder and projection layer while keeping the decoder frozen.

  • Projection layer maps visual embeddings into tokens compatible with the decoder
  • Sequence length of 32,768 tokens and image size of 448×448 with dynamic resizing
  • Learning rate of 4e-5 with cosine scheduling

This enables cross-modal alignment without disrupting thedecoder's pre-learned language modeling.

Stage 3: End-to-End Fine-Tuning

Trains the full model — encoder, projection, and decoder — together on multimodal datasets to adapt to real-world tasks.

  • Includes instruction tuning and in-context instruction learning
  • Uses RLHF and Direct Preference Optimization (DPO) to refine outputs
  • Learning rate of 4e-5 with cosine scheduling; weight decay of 0.01

This helps the model generalize to tasks like visual question answering, document analysis, and visual reasoning.

Benchmark Results and Comparison

Shakti-VLM-4B is evaluated across diverse multimodal benchmarks, showing strong performance across document understanding, OCR, chart reasoning, and general vision-language tasks.

Evaluation Categories

Document QA: DocVQA, InfoVQA
Text-based QA: TextVQA, OCRBench
Chart/Diagram Interpretation: ChartQA
Multimodal Reasoning: MMMU, MMStar, MMVet
Mathematical Reasoning: MathVista, MathVerse
Real-world Scenarios: RealWorldQA
BenchmarkShakti-4BInternVL2-1BPhi-3-Vision-4BMiniCPM-V-2.6-8BQwen2VL-7BQwen-2.5VL-7B
MMMU val59.7847.946.149.854.158.6
DocVQA test92.9289.2-90.894.595.7
InfoVQA test77.367.0--76.582.6
ChartQA test85.5674.470.980.184.384.9
TextVQA val80.7578.870.572.774.179.7
OCRBench849788639852845864
MME sum2340.992064.11508.02348.42326.8-
MMStar62.33--57.560.763.9
MMMU Pro val37.47----41
VQA v2 val78-----
Ai2d83.8378.976.7---
RealworldQA71.1860.758.8-70.1-
Math Vista (testmini)48.558.644.560.658.268.2
MMT-Bench (test)66.26---63.763.6
MMVet62.355.744.16062.067.1
HallusionBench47.941.93948.150.652.9
MMBench (test)81.778.673.6-83.082.6
Math Vision19.0517.817.416.116.325.07
Math Verse28.783224.125.731.9-
Olympaid Bench1.31.1----
BLINK50.1146.158.35353.2-
MTVQA16.0215.3--25.6-

Key Observations

Strong Multimodal Reasoning: Excels on complex reasoning benchmarks like MMMU (59.78%), outperforming all comparison models.
Document and Visual Intelligence: Demonstrates robust capabilities in document understanding (DocVQA, TextVQA, InfoVQA) and advanced visual reasoning (ChartQA, MMStar, MMVet).
Real-world Applicability: Achieves highest scores on RealworldQA, highlighting practical effectiveness in everyday scenarios.
Efficient and Competitive: Outperforms or matches larger models (Qwen2VL-7B, MiniCPM-V-2.6-8B) despite having fewer parameters.
Balanced Generalization: Maintains consistently high performance across diverse tasks, reflecting broad capabilities.

Conclusion

Shakti-VLM-4B is a highly capable, efficient, and enterprise-ready vision-language model. Its modular training, long-context support, and parameter-efficient architecture make it a standout performer across a wide range of multimodal benchmarks. By leveraging hybrid normalization techniques, enhanced positional embeddings, and instruction-tuned fine-tuning strategies, it sets a new standard for lightweight VLMs, making it ideal for edge deployment and real-world applications.