A 1B-parameter vision-language model optimized for efficient multimodal learning
Shakti-VLM-1B is a 1B-parameter vision-language model developed by SandLogic, optimized for efficient multimodal learning with a strong emphasis on data and parameter efficiency. It is designed to support enterprise-scale and edge deployment across vision-language tasks such as document understanding, chart interpretation, OCR-based extraction, and multimodal reasoning. Despite using only 487 billion training tokens, Shakti-VLM-1B achieves competitive performance by combining architectural innovations like QK-Normalization, hybrid normalization, and enhanced positional embeddings.
Shakti-VLM-1B is trained to handle a range of vision-language tasks:
Answering questions based on the content of scanned or digital documents.
Handling tasks like OCR-VQA and TextVQA that involve reasoning over text present in images.
Understanding structured visual data like charts and tables.
Answering complex questions that require understanding both the image and the accompanying text.
Reasoning over math or science questions with visual context, e.g., figures, tables, and diagrams.
These capabilities make Shakti-VLM-1B suitable for enterprise scenarios such as automated document workflows, visual form understanding, and structured data extraction.
Shakti-VLM-1B follows a standard vision-language model structure with three core components:
The vision encoder is a transformer-based model with:
It uses dynamic patch sizes (from 14×14 to 32×32) to process images of various resolutions.
Maps visual features from the encoder into visual tokens that can be combined with text embeddings. These visual tokens are concatenated with text and fed to the decoder.
Based on the Shakti-2.5B language model, designed to handle long-context sequences (up to 32,768 tokens). Supports multimodal generation and reasoning when both image and text tokens are present.
Shakti-VLM-1B is trained using a diverse collection of datasets organized across three distinct training stages:
| Pre-Training | Fine-Tuning (Stage 3) | |
|---|---|---|
| Stage 1 | Stage 2 | |
|
|
|
Text-only datasets for language modeling and long-context understanding
Paired image-text datasets for aligning visual and textual embeddings
Multimodal tasks including instruction-tuning and human feedback datasets
Shakti-VLM-1B uses a three-stage training approach to separate concerns of language understanding, multimodal alignment, and end-to-end task learning.
Focuses on training the text decoder independently on large-scale text-only data.
Aligns visual features with language representations.
Trains the full model on multimodal datasets.
Shakti-VLM-1B is evaluated across diverse multimodal benchmarks, frequently matching or exceeding larger models despite having only 1 billion parameters.
| Benchmark | Shakti-VLM-1B | MolmoE-1B | InternVL2-1B | SmolVLM-2.25B | Qwen-2VL-2B | Qwen-2.5VL-3B |
|---|---|---|---|---|---|---|
| MMMU val | 42.5 | 34.9 | 36.7 | 38.8 | 41.1 | 53.1 |
| DocVQA test | 87.96 | 77.7 | 81.7 | 81.6 | 90.1 | 93.9 |
| InfoVQA test | 56.8 | 53.9 | 50.9 | 43.5 | 65.5 | 77.1 |
| ChartQA test | 79.56 | 78 | 72.9 | 62.2 | 73.5 | - |
| TextVQA val | 80.75 | 78.8 | 70.5 | 72.7 | 79.7 | 79.3 |
| OCRBench | 798 | 684 | 754 | 701 | 794 | - |
| MME sum | 1910.62 | 1782.2 | 1794.4 | 1801.9 | 1872 | - |
| MMStar | 50.13 | 40.2 | 39.4 | 42.1 | 48 | 55.9 |
| MathVista | 46.2 | 34 | 37.7 | 44.6 | - | 62.3 |
| RealWorldQA | 64.82 | 60.4 | 50.3 | - | 62.9 | - |
Shakti-VLM-1B is a compact and efficient vision-language model built for real-world applications across document understanding, chart reasoning, OCR, and multimodal question answering. With a thoughtful three-stage training strategy and architectural optimizations, it delivers strong performance while using fewer parameters and less training data compared to larger models.
Despite its 1B parameter size, Shakti-VLM-1B matches or outperforms several models in the 2B-3B range on key benchmarks. Its design makes it well-suited for enterprise and edge deployments that require both accuracy and efficiency in handling text and visual inputs together.
⚡ Faster token generation than competitors with superior efficiency