Shakti LLMs

The largest model in the Shakti SLM series, optimized for conversations, summarization, and edge deployment

Introduction

Shakti-500M is the largest model in the Shakti SLM series with 500 million parameters. It is designed to work well for tasks that involve conversations, summarizing long documents, and answering questions. The model is trained to be used in both cloud and edge environments, including mobile and low-power devices. It uses quantization-aware training so that the same model can run in different formats (like int8 or int4) without losing much accuracy.

Model Capabilities

Multilingual Understanding

The Shakti models leverage a robust multilingual tokenizer trained across a broad linguistic spectrum, enabling accurate parsing and representation of inputs in Indian languages (e.g., Hindi, Kannada, Telugu, Tamil) and global languages (e.g., Spanish, French, German, English). Fine-tuning capabilities allow domain- and language-specific adaptation for precise outputs in multi-regional deployments.

Instruction Following

Optimized on instruction datasets like Cosmopedia and Magma-Pro, the models handle multi-turn dialogue, instructional tasks, summarization, classification, and QA efficiently.

Long-Context Handling

Utilizes Block Sparse Attention and Sliding Window cache to process and attend to sequences up to 4,096 tokens, supporting document QA and long-thread chat scenarios.

Code Understanding and Generation

Benchmarked on HumanEval and instruction-tuned on coding datasets, the model supports function generation, code completion, and syntax correction.

Architecture

Shakti-500M utilizes a 24-layer decoder-only Transformer with a hidden size of 2,048 and 16 attention heads. It incorporates the following enhancements:

Key Architectural Features

•

Block Sparse Attention: Reduces memory and computation load during long-context processing while preserving accuracy.

•

Rotary Positional Embeddings (RoPE): Provides effective token position awareness without fixed sinusoidal patterns.

•

Sliding Window Attention Cache: Enables real-time streaming capabilities.

•

Pre-Normalization and SiLU Activation: Ensures numerical stability and gradient flow during deep model training.

•

LayerNorm and Dropout: Used throughout the stack for improved generalization.

Dataset Details

The Shakti-500M model undergoes pre-training on diverse corpora to develop general language understanding and knowledge across various domains. The supervised fine-tuning (SFT) phase adapts the model for instruction-based applications, enhancing problem-solving, conversational AI, and coding capabilities. RLHF further refines responses through human feedback, ensuring contextual relevance and accuracy.

Pre-Training Corpus

Approximately 2T tokens drawn from diverse and high-quality sources:

Common Crawl
TXT360 (curated instruction corpora)

Supervised Fine-Tuning (SFT) Datasets

Focused on instruction-following and task-specific capabilities:

The Thome
Infinity-Instruct

RLHF Dataset

UltraFeedback with binary preference annotations for model output ranking to improve response quality and alignment with human preferences.

Training Details

Shakti-500M was trained in multiple stages, progressing from general language understanding to specialized instruction following and human preference alignment.

Phase 1: Pre-Training

Conducted on 2T tokens using next-token prediction objective, with mixed-precision training (FP16 + bfloat16). Block-sparse attention and RoPE integrated during this phase to establish foundational language understanding across diverse domains.

Phase 2: Supervised Fine-Tuning (SFT)

Focused on task-specific instruction sets. Tasks span summarization, question answering, and conversational AI to improve the model's ability to follow instructions and generate contextually appropriate responses.

Phase 3: Direct Preference Optimization (DPO)

UltraFeedback-based preference modeling using reward model scoring and PPO fine-tuning to improve response helpfulness and relevance, ensuring outputs align with human expectations and preferences.

Benchmark Results and Comparison

Shakti-500M performs strongly across a variety of tasks, holding its own against both similar-sized and larger models. It was trained using a well-curated and balanced dataset, which helps it make the most of its compact and optimized design. The model handles different challenges reliably and delivers competitive results.

Performance Advantage

While larger models may have an edge in certain areas, Shakti-500M offers a great balance between performance and efficiency, making it ideal for use in low-resource environments like mobile devices, IoT systems, and edge computing.

Benchmark	Shakti-500M	Boomer-1B	Boomer-634M	Qwen2.5-0.5B	Llama 3.2 1B
MMLU	38.90	25.92	25.23	47.5	32.2
BigBenchHard	33.1	28.65	21.11	20.3	30.93
IFEval	36.62	23.81	22.22	27.9	59.5
Hellaswag	68.56	31.66	34.08	52.1	41.2
Anli	40.70	32.57	27.5	26.85	22.56
Piqa	74.59	60.78	62.57	72.50	80.64
Med MCQA	32.61	17.56	37.50	42.5	37.57
OpenbookQA	39.80	22.56	35.76	30.73	37
WinoGrande	60.67	45.79	51.07	56.3	60
SQuAD	71.40	67	57.5	52.94	49.2
Trivia QA	31.11	1.5	0.91	41.6	4.44
GSM8K	9.2	1.67	1	1.69	-
MATH	31.97	-	23.38	19.5	-

Key Performance Highlights

✓Strong reasoning: Excellent performance on Hellaswag (68.56%) and BigBenchHard (33.1%)

✓Knowledge tasks: Competitive results on PIQA (74.59%) and SQuAD (71.40%)

✓Instruction following: Good performance on IFEval (36.62%)

✓Medical domain: Solid results on Med MCQA (32.61%)

✓Mathematical reasoning: Strong performance on MATH (31.97%)

✓Balanced performance: Consistent results across diverse benchmark categories

Conclusion

Shakti-500M stands out as a well-rounded, efficient small language model optimized for real-world use. With strong multilingual support, instruction following, and long-context handling, it excels in tasks like conversations, summarization, question answering, and code generation. Its architecture combines advanced techniques like block sparse attention, RoPE, and quantization-aware training, enabling smooth performance across cloud and edge devices, including mobile and low-power environments. Backed by diverse pretraining data and refined through SFT and RLHF, Shakti-500M delivers reliable and context-aware outputs. While not the largest model, it competes well against both peer and larger models in benchmark evaluations, offering a solid balance of accuracy, efficiency, and deployment flexibility for multilingual and domain-specific applications.