A technical note on DeepSeek

Chinese AI startup DeepSeek has introduced a new generative AI model claiming performance that rivals or surpasses leading US models like GPT-4, but at a fraction of the cost. This challenges the common belief that top-tier AI requires massive computing resources and investment.

Low Cost, High Performance

DeepSeek claims it trained its V3 model for just US$6 million using 2,000 Nvidia H800 GPUs, compared to the US$80-100 million estimated for GPT-4 and 16,000 H100 GPUs used for Meta’s LLaMA 3. Despite the lower cost, benchmarks suggest V3 performs on par or better than GPT-4 in reasoning tasks.

However, the cited US$6 million likely covers only compute costs, not the full expenditure. The secret to V3’s efficiency lies in model design and training data.

Training Data & Reinforcement Learning

DeepSeek has two versions: V3 and the more advanced R1. R1 competes with OpenAI’s o1 and reportedly outperforms it in reasoning tasks. Unlike traditional fine-tuning, DeepSeek uses reinforcement learning (RL), reducing dependence on labelled data while enhancing reasoning ability.

During inference, R1 explains its reasoning in 1 – 2 minutes, providing users with clear, logical outputs. These deliberation outputs were recorded and used to fine-tune V3, significantly boosting its capabilities. Notably, there are allegations that DeepSeek used model distillation, training R1 by querying OpenAI’s o1 at scale and learning from the responses.

Key Innovations Driving Efficiency

DeepSeek’s cost savings and performance stem from several technical breakthroughs:

Automated RL Fine-Tuning: Enhances reasoning while focusing compute resources on valuable training
Advanced Mixture of Experts (MoE): Unlike GPT-4’s 8 experts, DeepSeek uses 256 highly specialised experts, activating only 8 per task for
Lower Precision Computation: Utilises FP8 (8-bit floating point) instead of the standard FP16, reducing compute
Sparsity: Activates just 30–40% of the model per task, lowering memory and compute costs by 2.5 –
Hardware-level Optimisation: Custom pipeline parallelism reduces communication overhead, improving GPU efficiency and cutting training