Running DeepSeek-R1-Distill-Qwen-32B v/s DeepSeek-R1-Distill-Qwen-14B Locally

techman · February 14, 2025, 7:01am

Report on Running Deepseek AI Models: deepseek-r1:32b vs. deepseek-r1:14b

1. Introduction

This report evaluates the performance and capabilities of two AI models, deepseek-r1:32b (20GB) and deepseek-r1:14b (9GB), when run on a specific hardware setup. The primary focus is on their speed, knowledge retention, and resource utilization.

2. Machine Setup

CPU: AMD Ryzen 3600
RAM: 64GB DDR4
GPU: Nvidia RTX 3060 with 12GB VRAM

3. AI Models Overview

deepseek-r1:32b (20GB): A larger model known for advanced
knowledge retention.
deepseek-r1:14b (9GB): A smaller model with less memory but
faster performance.

4. Performance Analysis (Speed)

The deepseek-r1:32b model performed significantly slower than the
deepseek-r1:14b model. In chat mode, it processed approximately
one word per second compared to the faster response times of the
14B model. The terminal outputs from Ollama showed that while both
models took a similar amount of time for prompt evaluation (about
52ms and 2.009s respectively), the 32B model’s evaluation duration
was notably longer, at 13 minutes verses 12.6 seconds.

5. Knowledge Comparison

The deepseek-r1:32b model demonstrated a substantial increase in
knowledge retention compared to the 14B model. When queried about
the “Lemon” program by D. Richard Hipp, the 32B model provided a
detailed and accurate response, whereas the 14B model admitted its
lack of information.

6. Additional Observations (Power Usage and Resource Utilization)

The 32B model consumed less GPU power (45W) compared to the 175W
used by the 14B model. However, it utilized all 12 CPU cores at 80% load, indicating higher CPU usage despite lower GPU consumption.

7. Conclusion and Recommendations

The deepseek-r1:32b model offers superior knowledge retention but
is significantly slower on the given hardware. For users
prioritizing speed without compromising too much on knowledge
depth, the deepseek-r1:14b model may be more suitable. However,
those requiring advanced knowledge might consider upgrading their
hardware to handle larger models more efficiently.

This analysis provides insights into model performance and
resource utilization, aiding in informed decisions for AI
deployment based on hardware capabilities and application
requirements.

Author: Terry, Editor Deepseek-r1:14b

Belfry · February 14, 2025, 8:02am

Thanks for the comparison, Terry. Interesting that the larger model used less GPU power, but more CPU. Is that repeatable and fairly consistent? The exponential difference in the evaluation is fascinating too - 13 minutes vs. 13 seconds! I don’t know enough about it, but I wonder whether the model size (9GB vs 20GB) and 12GB VRAM limit forced the 32b model to fall back to system RAM and CPU processing rather than being able to loaded fully into the GPU.

Cheers,

Belfry

techman · February 14, 2025, 8:49am

Hi Belfry,

As the model is 20GB and the GPU has only 12GB ram, I think its obvious Ollama had to use the CPU’s, but why the GPU power was only 65W instead of 170W? perhaps it was timing speed where the GPU and CPU’s had to run at a similar speed ?

Cheers,
Terry

Belfry · February 15, 2025, 1:05am

Hi Terry,

I think you’re right that it could be a timing issue. The other thing that occurred to me overnight is that it might be more efficient for ollama to load part of the model to the GPU once only and not continually new push data over the PCI-E bus to the GPU if it knows it’s going to have to use the system memory and CPU for the model as well. In that case, the GPU only ever has a smaller “one off” piece of the model to work with (and potentially longer to work with it, if it’s timing related too), which could explain the lower overall GPU power consumption.

It’d be interesting to rerun your experiments with watch -n 0.1 ollama ps running in a second terminal. Based on your results so far, if the smaller 14b model loads 100% into GPU for the whole process (with the corresponding higher GPU power use), and the 32b model loads some proportion into the GPU, I’m curious to see if that ratio of model in the CPU and GPU memory ever changes. If that ratio is fixed throughout, I wonder if ollama is savvy enough to just use CPU cycles for running the model rather than using those cycles to push more data to the GPU memory once the GPU is finished with that smaller piece.

Cheers,

Belfry

techman · February 19, 2025, 4:37pm

Hi Belfry, apologies for my late reply. I had finished testing Linux Distros for the new OS and had to reinstall Ubuntu Server onto a new 1TB NVMe drive for permanent use. Xorg, mouse,sound and all my apps then needed to be installed,

I actually upgraded Starlink to a monthly plan to do this as it seems at least one slot (probably more) became available. Aldi cellphone internet was just too painfully slow to reinstall everything including Deepseek-r1:b32.

Ollama ps
NAME ID SIZE PROCESSOR UNTIL
deepseek-r1:14b ea35dfe18182 11 GB 100% GPU 4 minutes from now

Ollama ps
tp@ubuntu:~$ ollama ps
NAME ID SIZE PROCESSOR UNTIL
deepseek-r1:32b 38056bbcbb2d 22 GB 46%/54% CPU/GPU 3 minutes from now

GPU temperature didn’t exceed 44C, so thermal rollback was not a issue.

Belfry · February 20, 2025, 6:42am

Hi Terry,

Thanks for that. I’ll have to have a dig and see if there’s any further detail in docs about how ollama works, but based on your results so far I think you’re probably spot on with lower power consumption being timing related. On the face of it it looks like it’s loading as much of the model into the GPU as possible (54% of 22GB being 11.88GB), which then spends its time either slowing down to match the CPU or waiting around for the CPU to do its part.

Cheers,

Belfry