Shakti-100M Small Language Model

A compact language model built for smart devices and edge deployment

Introduction

Shakti-100M is a compact language model built for smart devices like mobile phones, IoT systems, and other edge devices. Unlike large models that need powerful servers and internet access, Shakti-100M runs directly on your device. It offers fast responses, strong privacy, and low power usage making it ideal for offline, real-time applications. With only 100 million parameters, it brings natural language capabilities to low-resource environments, enabling smarter user experiences without cloud dependency.

Model Capabilities

Core Task Handling

Optimized for everyday NLP tasks such as text generation, completion, summarization, and basic question answering.

Lightweight Deployment

Tailored for ultra-low-power devices and offline systems with minimal memory and compute requirements.

Privacy-Preserving

Enables fully on-device processing to ensure data privacy in personal assistants, healthcare apps, and private chat interfaces.

Architecture

Shakti-100M includes 100 million parameters and 10 layers, offering a solid balance between performance and efficiency. It uses a model dimension of 640 and a 2560 FFN (Feed Forward Network) dimension for handling complex language tasks.

Key Architectural Features

Block Sparse Attention: Reduces memory and computation load during long-context processing while preserving accuracy.
Rotary Positional Embeddings (RoPE): Provides effective token position awareness without fixed sinusoidal patterns.
Sliding Window Attention Cache: Enables real-time streaming capabilities.
Pre-Normalization and SiLU Activation: Ensures numerical stability and gradient flow during deep model training.
LayerNorm and Dropout: Used throughout the stack for improved generalization.

Dataset Details

Shakti-100M was trained using a diverse set of lightweight and instruction-focused datasets, carefully selected to ensure strong generalization, efficient task execution, and alignment with user intent—while maintaining a compact footprint suitable for edge deployment.

Pre-Training

Utilized general-purpose corpora such as Common Crawl to establish foundational language understanding across diverse topics and linguistic styles.

Supervised Fine-Tuning (SFT)

Employed instruction-tuned datasets to enhance performance on everyday tasks like summarization, dialogue, and instruction following:

  • Cosmopedia v2
  • OpenHermes-2.5-H4
  • Self-oss-instruct-sc2-H4
  • Instruct-data-basics-smolim-H4
  • Magma-Pro-300K-Filtered-H4
  • Everyday-conversations-llama3.1-2k

Direct Preference Optimization (DPO)

Used preference-labeled data from UltraFeedback Binarized to align outputs with user expectations in a computationally efficient manner, ensuring responsiveness on low-resource and edge devices.

Training Details

Shakti-100M was trained in multiple stages, progressing from general language understanding to specialized instruction following and human preference alignment.

Phase 1: Pre-Training

Trained on diverse general-purpose corpora using a next-token prediction objective with a compact Transformer architecture, Rotary Positional Embeddings (RoPE), and mixed-precision training (FP16 + bfloat16) to build foundational language understanding optimized for low-resource settings.

Phase 2: Supervised Fine-Tuning (SFT)

Fine-tuned on instruction-based datasets focused on everyday tasks such as summarization, simple Q&A, and dialogue to improve task adherence and conversational accuracy.

Phase 3: Direct Preference Optimization (DPO)

Aligned using human preference-labeled outputs from lightweight tasks to enhance relevance and clarity while maintaining computational efficiency suitable for real-time inference on mobile and embedded devices.

Benchmark Results and Comparison

The Shakti-100M model demonstrates strong benchmark performance despite being significantly smaller than many competing models. It consistently matches or outperforms larger models in key evaluations, highlighting the effectiveness of its optimized training process on a carefully curated dataset.

Key Success Factor

A critical factor in Shakti-100M's success is the balanced size of its pre-training dataset. With a 1T token pre-training dataset, Shakti-100M maintains optimal balance, delivering strong performance across diverse tasks while emphasizing the importance of strategic data selection and curation.

BenchmarkShakti-100MBoomer-634MSmolLM-135MSmolLM-360MAMD-Llama-135M
MMLU25.9625.9130.234.423.02
BigBenchHard30.1221.112324.418.71
IFEval24.322.2215.919.822
Hellaswag51.3439.2441.251.830.48
Anli21.3427.5--30.73
Piqa69.262.5768.471.664.20
OpenbookQA37.935.763437.230.73
Truthfulqa (MC2)29.227.57--22.56
WinoGrande61.350.6751.352.850.12
ARC Easy45.862.5742.450.143.64
SQuAD31.557.5--25
MedQA28.31411.0212.3615.57
GPQA14.912.19.891112.4
Bool Q29.422.917.321.323.54
SocialQA23.3414.516.91919.1
CommonsenseQA35.82932.735.322.56
Trivia QA15.32.734.39.17.54
GSM8K9.21.6711.69-
MATH13.923.38141920.64
Humaneval7.8---5.1

Performance Highlights

Strong reasoning: Outperforms larger models on BigBenchHard (30.12%)
Common sense: Excellent performance on Hellaswag (51.34%) and WinoGrande (61.3%)
Knowledge tasks: Competitive results on PIQA (69.2%) and CommonsenseQA (35.8%)
Medical domain: Superior performance on MedQA (28.3%) vs competitors
Efficiency: Achieves strong results with only 100M parameters
Balanced performance: Consistent results across diverse benchmark categories

Conclusion

Shakti-100M is a compact language model designed for efficient performance on edge devices. With only 100 million parameters, it supports common language tasks like summarization and question answering while running entirely on-device. The model uses a lightweight architecture with features like block sparse attention and rotary embeddings to balance speed and accuracy. Through a structured training process including pre-training, supervised fine-tuning, and preference alignment, Shakti-100M delivers reliable results in low-resource environments. Its strong performance across benchmarks shows that small models, when trained effectively, can meet real-world needs without relying on cloud infrastructure.