A 4B-parameter vision-language model optimized for enterprise and edge deployment
Shakti-VLM-4B is a 4B-parameter vision-language model developed by SandLogic, designed for tasks involving both visual and textual inputs. It supports a wide range of applications such as document question answering, chart understanding, OCR-based extraction, and general visual reasoning. The model is trained using a three-stage pipeline that separates language modeling, visual alignment, and full fine-tuning. It uses architectural choices like QK-normalization, hybrid layer normalization, and extended positional embeddings to improve training stability and efficiency. Shakti-VLM-4B achieves strong results across standard multimodal benchmarks, while optimized for enterprise and edge deployment.
Shakti-VLM-4B is trained to handle a range of vision-language tasks:
Answering questions based on the content of scanned or digital documents.
Handling tasks like OCR-VQA and TextVQA that involve reasoning over text present in images.
Understanding structured visual data like charts and tables.
Answering complex questions that require understanding both the image and the accompanying text.
Reasoning over math or science questions with visual context, e.g., figures, tables, and diagrams.
These capabilities make Shakti-VLM-4B suitable for enterprise scenarios such as automated document workflows, visual form understanding, and structured data extraction.
Shakti-VLM-4B follows a standard vision-language model structure with three core components:
The vision encoder is a transformer-based model with:
It uses dynamic patch sizes (from 14×14 to 32×32) to process images of various resolutions, supporting flexible input for both dense documents and high-resolution visuals.
Maps visual features from the encoder into visual tokens that can be combined with text embeddings. These visual tokens are concatenated with text and fed to the decoder.
Based on the Shakti-2.5B language model, designed to handle long-context sequences (up to 32,768 tokens). Supports multimodal generation and reasoning when both image and text tokens are present.
Shakti-VLM-4B is trained using a diverse collection of datasets organized across three distinct training stages:
Text-only datasets for language modeling and long-context understanding
Paired image-text datasets for aligning visual and textual embeddings
Multimodal tasks including instruction-tuning and human feedback datasets
| Pre-Training | Fine-Tuning (Stage 3) | |
|---|---|---|
| Stage 1 | Stage 2 | Stage 3 |
|
|
|
Shakti-VLM-4B uses a three-stage training approach to separate concerns of language understanding, multimodal alignment, and end-to-end task learning.
Focuses on training the text decoder independently on large-scale text-only data to build strong language understanding and handle long context lengths.
This stage ensures the decoder can handle complex language tasks before adding visual inputs.
Aligns visual features with language representations by training the vision encoder and projection layer while keeping the decoder frozen.
This enables cross-modal alignment without disrupting thedecoder's pre-learned language modeling.
Trains the full model — encoder, projection, and decoder — together on multimodal datasets to adapt to real-world tasks.
This helps the model generalize to tasks like visual question answering, document analysis, and visual reasoning.
Shakti-VLM-4B is evaluated across diverse multimodal benchmarks, showing strong performance across document understanding, OCR, chart reasoning, and general vision-language tasks.
| Benchmark | Shakti-4B | InternVL2-1B | Phi-3-Vision-4B | MiniCPM-V-2.6-8B | Qwen2VL-7B | Qwen-2.5VL-7B |
|---|---|---|---|---|---|---|
| MMMU val | 59.78 | 47.9 | 46.1 | 49.8 | 54.1 | 58.6 |
| DocVQA test | 92.92 | 89.2 | - | 90.8 | 94.5 | 95.7 |
| InfoVQA test | 77.3 | 67.0 | - | - | 76.5 | 82.6 |
| ChartQA test | 85.56 | 74.4 | 70.9 | 80.1 | 84.3 | 84.9 |
| TextVQA val | 80.75 | 78.8 | 70.5 | 72.7 | 74.1 | 79.7 |
| OCRBench | 849 | 788 | 639 | 852 | 845 | 864 |
| MME sum | 2340.99 | 2064.1 | 1508.0 | 2348.4 | 2326.8 | - |
| MMStar | 62.33 | - | - | 57.5 | 60.7 | 63.9 |
| MMMU Pro val | 37.47 | - | - | - | - | 41 |
| VQA v2 val | 78 | - | - | - | - | - |
| Ai2d | 83.83 | 78.9 | 76.7 | - | - | - |
| RealworldQA | 71.18 | 60.7 | 58.8 | - | 70.1 | - |
| Math Vista (testmini) | 48.5 | 58.6 | 44.5 | 60.6 | 58.2 | 68.2 |
| MMT-Bench (test) | 66.26 | - | - | - | 63.7 | 63.6 |
| MMVet | 62.3 | 55.7 | 44.1 | 60 | 62.0 | 67.1 |
| HallusionBench | 47.9 | 41.9 | 39 | 48.1 | 50.6 | 52.9 |
| MMBench (test) | 81.7 | 78.6 | 73.6 | - | 83.0 | 82.6 |
| Math Vision | 19.05 | 17.8 | 17.4 | 16.1 | 16.3 | 25.07 |
| Math Verse | 28.78 | 32 | 24.1 | 25.7 | 31.9 | - |
| Olympaid Bench | 1.3 | 1.1 | - | - | - | - |
| BLINK | 50.11 | 46.1 | 58.3 | 53 | 53.2 | - |
| MTVQA | 16.02 | 15.3 | - | - | 25.6 | - |
Shakti-VLM-4B is a highly capable, efficient, and enterprise-ready vision-language model. Its modular training, long-context support, and parameter-efficient architecture make it a standout performer across a wide range of multimodal benchmarks. By leveraging hybrid normalization techniques, enhanced positional embeddings, and instruction-tuned fine-tuning strategies, it sets a new standard for lightweight VLMs, making it ideal for edge deployment and real-world applications.