Hardware Setup

This build focuses on maximizing memory bandwidth, crucial for running large language models like Deepseek-R1. We'll be using two AMD EPYC CPUs and a massive amount of DDR5 RAM.

<h3>Motherboard</h3>
<p>We need a dual-socket motherboard to accommodate two EPYC CPUs and get access to 24 RAM channels. The <strong>Gigabyte MZ73-LM0 or MZ73-LM1</strong> is ideal. <a href="https://www.gigabyte.com/Enterprise/Server-Motherboard/MZ73-LM0-rev-3x">Gigabyte MZ73-LM0</a></p>

<h3>CPU</h3>
<p>Choose any <strong>AMD EPYC 9004 or 9005 CPU</strong>. Memory bandwidth is the bottleneck, so you don't need top-of-the-line CPUs. The 9115 or even the 9015 are suitable. <a href="https://www.newegg.com/p/N82E16819113865">Newegg EPYC CPUs</a></p>

<h3>RAM</h3>
<p>This is critical. We need <strong>768GB of DDR5-RDIMM</strong> across 24 channels. That means 24 x 32GB modules. <a href="https://www.newegg.com/nemix-ram-384gb/p/1X5-003Z-01FM7">Example RAM Kit</a></p>

<h3>Case</h3>
<p>A standard tower case might not be suitable. The <strong>Enthoo Pro 2 Server</strong> is a good choice because it has screw mounts for a full server motherboard. <a href="https://www.newegg.com/black-phanteks-enthoo-pro-2-server-edition-full-tower/p/N82E16811854127">Enthoo Pro 2 Server</a></p>

<h3>PSU</h3>
<p>The system's power usage is relatively low (under 400W), but you'll need a PSU with enough CPU power cables. The <strong>Corsair HX1000i</strong> works well. <a href="https://www.corsair.com/us/en/p/psu/cp-9020259-na/hx1000i-fully-modular-ultra-low-noise-platinum-atx-1000-watt-pc-power-supply-cp-9020259-na">Corsair HX1000i</a></p>

<h3>Heatsink</h3>
<p>AMD EPYC uses socket SP5, and most SP5 heatsinks are for 2U/4U servers. You may need to source it from eBay or Aliexpress. <a href="https://www.ebay.com/itm/226499280220">Example Heatsink</a></p>

<h3>SSD</h3>
<p>Any 1TB or larger SSD will work. An NVMe drive is recommended for faster loading. Find one yourself!</p>

<p><strong>Important BIOS Tip:</strong> Set the number of NUMA groups to 0 in the BIOS to ensure the model is interleaved across all RAM chips, doubling throughput.</p>

Software Setup

Now, let's get the software running.

<h3>llama.cpp</h3>
<p>Install <code>llama.cpp</code> following the instructions on their GitHub repository: <a href="https://github.com/ggerganov/llama.cpp">llama.cpp GitHub</a></p>

<h3>Deepseek-R1 Model</h3>
<p>Download all files in the <code>Q8_0</code> folder from the following Hugging Face link: <a href="https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main">DeepSeek-R1 GGUF</a>. This will be around 700GB of data.</p>

<h3>Running the Model</h3>
<p>For a quick demo, use this command:</p>
<pre><code>llama-cli -m ./DeepSeek-R1.Q8_0-00001-of-00015.gguf --temp 0.6 -no-cnv -c 16384 -p "&lt;｜User｜&gt;How many Rs are there in strawberry?&lt;｜Assistant｜&gt;"</code></pre>

<p>Once tested, use <code>llama-server</code> to host the model and pass requests from other software.</p>

Performance and Notes

This build achieves a generation speed of 6-8 tokens per second. Note that there's no GPU in this build. While GPUs offer faster generation, they often require significant quantization that can reduce quality, or very expensive cards with >700GB of VRAM for Q8. This CPU-focused build prioritizes quality by using Q8 quantization and large RAM capacity.

Building a Local Deepseek-R1 Powerhouse: A $6,000 Guide

Hardware Setup

Software Setup

Performance and Notes

intermac.dev © 2025 All rights reserved | Build with ❤ by intermac & AI