Записки CPU designer'a@cpu

Записки CPU designer'a

У SemiAnalysis вышла новая классная статья про DeepSeek:
DeepSeek Debates: Chinese Leadership On Cost, True Training Cost, Closed Model Margin Impacts
Читать на SemiAnalysis

В этой статье разбирается стремительный рост компании DeepSeek и ее влияние на AI-рынок.

Одна из наиболее обсуждаемых тем — действительно ли обучение модели DeepSeek-V3 обошлось всего в $6M.
Авторы статьи утверждают, что реальные затраты гораздо выше:

We believe the pre-training number is nowhere near the actual amount spent on the model. We are confident their hardware spend is well over $500M over the company’s history. To develop new architecture innovations, during the model development, there is a considerable spend on testing new ideas, new architecture ideas, and ablations.

Также рассматриваются технические достижения DeepSeek, такие как Multi-Token Prediction (MTP), Multi-head Latent Attention (MLA) и Mixture-of-Experts (MoE). MTP оптимизирует процесс обучения, а MLA и MoE снижают затраты на инференс и увеличивают производительность моделей, сокращая ненужные вычисления.

Отдельное внимание уделяется ситуации с GPU, инвестициям DeepSeek и High-Flyer в ускорители Nvidia H100/H800, а также влиянию экспортного контроля США на поставки оборудования в Китай.

Все подробности — в статье, а самое интересное, как обычно, спрятано за пейволлом🐱

p.s. В комментариях добавили важное замечание:
"Но подождите, даже в самом пейпере на дипсик ровно это и говорится - что они просто умножили число гпу-часов на 2 бакса:"

Lastly, we emphasize again the economical training costs of DeepSeek-V3, summarized in Table 1, achieved through our optimized co-design of algorithms, frameworks, and hardware. During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pre-training stage is completed in less than two months and costs 2664K GPU hours. Combined with 119K GPU hours for the context length extension and 5K GPU hours for post-training, DeepSeek-V3 costs only 2.788M GPU hours for its full training. Assuming the rental price of the H800 GPU is $2 per GPU hour, our total training costs amount to only $5.576M. Note that the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs associated with prior research and ablation experiments on architectures, algorithms, or data.

Please open Telegram to view this post

VIEW IN TELEGRAM

SemiAnalysis

DeepSeek Debates: Chinese Leadership On Cost, True Training Cost, Closed Model Margin Impacts

The DeepSeek Narrative Takes the World by Storm DeepSeek took the world by storm. For the last week, DeepSeek has been the only topic that anyone in the world wants to talk about. As it currently s…

👍13🔥5👀1

www.tgoop.com/cpu_design/289

3.27K viewsНиколай, edited Feb 3 at 14:25

tgoop.com/cpu_design/289

Create: 2025-02-03
Last Update: 2025-10-25 09:28:32

We believe the pre-training number is nowhere near the actual amount spent on the model. We are confident their hardware spend is well over $500M over the company’s history. To develop new architecture innovations, during the model development, there is a considerable spend on testing new ideas, new architecture ideas, and ablations.

Lastly, we emphasize again the economical training costs of DeepSeek-V3, summarized in Table 1, achieved through our optimized co-design of algorithms, frameworks, and hardware. During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pre-training stage is completed in less than two months and costs 2664K GPU hours. Combined with 119K GPU hours for the context length extension and 5K GPU hours for post-training, DeepSeek-V3 costs only 2.788M GPU hours for its full training. Assuming the rental price of the H800 GPU is $2 per GPU hour, our total training costs amount to only $5.576M. Note that the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs associated with prior research and ablation experiments on architectures, algorithms, or data.

Telegram News

У SemiAnalysis вышла новая классная статья про DeepSeek: