A 1B-parameter vision-language model optimized for efficient multimodal learning

Introduction

Shakti-VLM-1B is a 1B-parameter vision-language model developed by SandLogic, optimized for efficient multimodal learning with a strong emphasis on data and parameter efficiency. It is designed to support enterprise-scale and edge deployment across vision-language tasks such as document understanding, chart interpretation, OCR-based extraction, and multimodal reasoning. Despite using only 487 billion training tokens, Shakti-VLM-1B achieves competitive performance by combining architectural innovations like QK-Normalization, hybrid normalization, and enhanced positional embeddings.

Model Capabilities

Shakti-VLM-1B is trained to handle a range of vision-language tasks:

Document Visual Question Answering (DocVQA)

Answering questions based on the content of scanned or digital documents.

Text-based Visual QA

Handling tasks like OCR-VQA and TextVQA that involve reasoning over text present in images.

Chart and Diagram Interpretation

Understanding structured visual data like charts and tables.

Multimodal Reasoning

Answering complex questions that require understanding both the image and the accompanying text.

Mathematical and Scientific QA

Reasoning over math or science questions with visual context, e.g., figures, tables, and diagrams.

These capabilities make Shakti-VLM-1B suitable for enterprise scenarios such as automated document workflows, visual form understanding, and structured data extraction.

Architecture

Shakti-VLM-1B follows a standard vision-language model structure with three core components:

Vision Encoder

The vision encoder is a transformer-based model with:

  • 36 layers
  • 1536 hidden dimension
  • 16 attention heads

It uses dynamic patch sizes (from 14×14 to 32×32) to process images of various resolutions.

Stability Improvements:

  • QK-Normalization in the attention mechanism
  • Hybrid normalization strategy (Pre-LayerNorm + RMSNorm)
  • Enhanced Rotary Position Embeddings (RoPE) with 2D positional bias

Projection Layer

Maps visual features from the encoder into visual tokens that can be combined with text embeddings. These visual tokens are concatenated with text and fed to the decoder.

Decoder

Based on the Shakti-2.5B language model, designed to handle long-context sequences (up to 32,768 tokens). Supports multimodal generation and reasoning when both image and text tokens are present.

Dataset Details

Shakti-VLM-1B is trained using a diverse collection of datasets organized across three distinct training stages:

Pre-TrainingFine-Tuning
(Stage 3)
Stage 1Stage 2
  • Dolma (Books subset)
  • The Stack
  • FineWebEdu-dedup
  • OBELICS
  • PDFA
  • LAION-400M
  • PDFA
  • Docmatrix
  • Leopard-instruct
  • MIMIC-IT

Stage 1 (Decoder Pretraining)

Text-only datasets for language modeling and long-context understanding

Stage 2 (Vision-Language Alignment)

Paired image-text datasets for aligning visual and textual embeddings

Stage 3 (Fine-Tuning)

Multimodal tasks including instruction-tuning and human feedback datasets

Training Details

Shakti-VLM-1B uses a three-stage training approach to separate concerns of language understanding, multimodal alignment, and end-to-end task learning.

Stage 1: Decoder Pretraining (Text-only)

Focuses on training the text decoder independently on large-scale text-only data.

  • Vision encoder and projection layers are frozen
  • Rotary position embeddings with dynamic scaling for long sequences
  • Cosine learning rate schedule starting at 3e-4

Stage 2: Vision-Language Alignment

Aligns visual features with language representations.

  • Vision encoder and projection layer trained, decoder frozen
  • Sequence length of 32,768 tokens, image size 448×448
  • Learning rate of 2e-5 with cosine scheduling

Stage 3: End-to-End Fine-Tuning

Trains the full model on multimodal datasets.

  • Includes instruction tuning and in-context learning
  • Uses RLHF and Direct Preference Optimization (DPO)
  • Learning rate of 4e-5 with cosine scheduling, weight decay 0.01

Benchmark Results

Shakti-VLM-1B is evaluated across diverse multimodal benchmarks, frequently matching or exceeding larger models despite having only 1 billion parameters.

BenchmarkShakti-VLM-1BMolmoE-1BInternVL2-1BSmolVLM-2.25BQwen-2VL-2BQwen-2.5VL-3B
MMMU val42.534.936.738.841.153.1
DocVQA test87.9677.781.781.690.193.9
InfoVQA test56.853.950.943.565.577.1
ChartQA test79.567872.962.273.5-
TextVQA val80.7578.870.572.779.779.3
OCRBench798684754701794-
MME sum1910.621782.21794.41801.91872-
MMStar50.1340.239.442.14855.9
MathVista46.23437.744.6-62.3
RealWorldQA64.8260.450.3-62.9-

Key Performance Highlights

  • Multimodal Reasoning: 42.5% on MMMU and 50.13% on MMStar, surpassing all 1B-3B models
  • Document & Text QA: Strong results on DocVQA (87.96%) and TextVQA (80.75%)
  • Chart Understanding: Leads with 79.56% on ChartQA
  • Mathematical Reasoning: Best-in-class 46.2% on MathVista among 1B-3B models
  • Real-world Applications: High score on RealWorldQA (64.82%)

Conclusion

Shakti-VLM-1B is a compact and efficient vision-language model built for real-world applications across document understanding, chart reasoning, OCR, and multimodal question answering. With a thoughtful three-stage training strategy and architectural optimizations, it delivers strong performance while using fewer parameters and less training data compared to larger models.

Despite its 1B parameter size, Shakti-VLM-1B matches or outperforms several models in the 2B-3B range on key benchmarks. Its design makes it well-suited for enterprise and edge deployments that require both accuracy and efficiency in handling text and visual inputs together.

⚡ Faster token generation than competitors with superior efficiency