Member-only story

Multi-Head Latent Attention Is The Powerful Engine Behind DeepSeek

A deep dive Into DeepSeek’s innovative Attention mechanism that makes its LLMs so good

Dr. Ashish Bamania
Level Up Coding

Image generated with DALL-E 3

DeepSeek is all over the news.

Its models have achieved state-of-the-art performance across multiple benchmarks (educational, factuality, math and reasoning, coding) and are competing head-to-head with OpenAI’s o1.

Since its release, nearly $1 trillion of value has been lost in U.S. technology stocks in the S&P 500.

However, many popular beliefs about it are not correct.

A $6 million training cost for DeepSeek-R1 is being peddled all across the internet, but this is far from the truth (the official figures remain undisclosed).

This figure is actually for the official training of DeepSeek-V3 (the predecessor model for R1) and even then excludes the costs associated with prior research and ablation experiments on architectures, algorithms, or data associated with V3.

Even though the costs are not accurately described, DeepSeek-V3 is still trained with significantly fewer resources than the models of the other prominent players in the LLM market.

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Responses (12)

What are your thoughts?

Here’s a story about a groundbreaking architectural change called Multi-Head Latent Attention, which is one reason DeepSeek’s models perform so well at low training resour

I appreciate the thoroughness and clarity in every paragraph.

Thanks for the detailed and thorough writeup. Super helpful!

Fascinating breakdown of Multi-Head Latent Attention in DeepSeek.