How to store DeepSeek's billions of parameters well? Is GPU high bandwidth memory capacity a bottleneck?Issuing time:2018-01-26 16:40 In today's rapidly developing artificial intelligence technology, the training and inference of large models have become the core driving force for promoting the implementation of AI applications. However, with the exponential growth of model parameter size (such as DeepSeeker R1's 671 billion parameters), the bottleneck problem of storage and computing resources is becoming increasingly prominent. How to find a balance between high performance and low cost has become a key breakthrough for major technology companies. 1、 DeepSeek's Challenge and Innovation: Breaking Through Storage Bottlenecks DeepSeeker R1, as a large model with 671 billion parameters, has attracted widespread attention in the market due to its excellent reasoning ability and low-cost advantages. However, as the complexity of the model increases, the context data that needs to be cached during the inference process (such as KV Cache) grows exponentially, leading to GPU high bandwidth memory capacity becoming a bottleneck. The traditional solution is to expand memory by adding DRAM, but this significantly increases inference costs. DeepSeek has successfully reduced the cost of using large models by an order of magnitude by offloading cached data from GPUs and DRAM to storage arrays through innovative caching technology. This "storage for computing" strategy not only alleviates memory pressure, but also significantly reduces service latency and final usage costs. However, DeepSeek's caching technology places higher demands on storage performance - requiring storage systems to have extremely high bandwidth, low latency, and massive scalability to support fast read and write of large-scale KV Cache. Image source: Network 2、 Huacun Zhigu TGStor A1800: High performance storage designed for AI The Huacun Zhigu TGStor A1800 is the next generation high-performance AI storage designed to meet this demand. Its core features are highly compatible with DeepSeek's technical requirements, providing strong storage support for the training and inference of AI large models. Ultimate performance: meet the efficient read and write requirements of KV Cache TGStor A1800 achieves ultimate performance of 12 million IOPS and 400GB/s bandwidth per frame through NDS direct access to NPU and DataTurbo high-performance file acceleration engines. This performance level can easily meet the read and write requirements of large-scale KV Cache in DeepSeek inference process, significantly reducing service latency. By using a CNC separation architecture and DPU to unload CPU computing power, TGStor A1800 achieves direct data read and write to disk, reducing the burden on CPU and memory and further improving data loading efficiency. Massive expansion: supports smooth evolution of large models TGStor A1800 supports single cluster 512 control Scale out expansion, with EB level storage capacity, which can meet the smooth evolution requirements of trillion/trillion parameter large models. This expansion capability is perfectly combined with DeepSeek's hard disk caching technology, providing powerful storage support for long sequence processing of large models and multi round dialogue scenarios. Through flexible computing card expansion (such as DPU, GPU), TGStor A1800 can also accelerate data processing, support encryption, compression, vector retrieval and other functions offloading, further improving inference efficiency. Intelligent reasoning: optimizing long memory and vector retrieval The TGStor A1800 has a built-in high-performance vector retrieval engine that supports vector data paradigms, which can accelerate the retrieval of large capacity vector knowledge bases and reduce inference "illusions". This feature is highly compatible with DeepSeek's multi round dialogue scenarios, which can significantly reduce inference latency by caching historical dialogue content. By supporting RAG vector database and KVCache technology, TGStor A1800 endows AI large models with long memory capabilities, improving inference accuracy and efficiency.
3、 Collaborative Innovation between DeepSeek and TGStor A1800 The combination of DeepSeek and TGStor A1800 not only solves the storage bottleneck problem in large model inference, but also redefines the cost-effectiveness of AI services through collaborative innovation of storage and computation. Using storage for computation: reducing inference costs DeepSeek's hard disk caching technology offloads KV Cache to storage, while TGStor A1800's high-performance storage capability ensures fast read and write of cached data. This collaborative innovation has reduced the inference cost of large models by an order of magnitude, laying the foundation for the widespread adoption of AI services. Long sequence processing: improving inference efficiency In multi round dialogue and long sequence processing scenarios, the multi-level KV caching mechanism of TGStor A1800 combined with DeepSeek's hard disk caching technology can significantly reduce duplicate calculations in the Prefill stage, reduce inference latency by 78%, and increase single xPU card throughput by 67%. Intelligent reasoning: optimizing user experience Through the vector retrieval engine of TGStor A1800 and the support of RAG vector database, DeepSeek can achieve more accurate inference results, reduce the phenomenon of "illusion", and provide users with a more intelligent and personalized service experience. Image source: Network 4、 Future: Deep integration of storage and computing With the further growth of the parameter scale of large models, the deep integration of storage and computation will become an inevitable trend. Through collaborative innovation of high-performance storage, intelligent caching, and vector retrieval technologies, AI big models will be able to achieve low-cost and efficient inference in a wider range of scenarios, accelerating the arrival of the AI inclusive era. |