$500[ish] AI Build Challenge

Belfry · May 25, 2025, 5:57am

Hi all,

Updated 3 June with final results - See below post

I’m not 100% finished with the below, but I thought I’d pop it up early both to force me to document my progress so far, and in case anyone wanted to discuss it at tomorrow’s meeting.

A few weeks ago I made this comment in the Installing Deepseek-R1:14b AI Locally thread:

That’s been swimming around in my head ever since, and I’ll admit that at the time my comment was based on “the vibe” rather than any sort of hard evidence. As such, I felt it was time to put that idea to the test. I have set out to compare a couple of options around the $500 mark for running LLMs at home. This isn’t at all a scientific test, but more of an experiment to see how accessible local LLMs are without spending huge money on bleeding edge hardware.

Notably, $500 also gets you around 16 months of ChatGPT Plus after accounting for the exchange rate.

I’ve used deepseek-r1:14b in the following tests as it was the model being discussed in the thread. When evaluating “usability”, I’ve used 250 WPM as a reading benchmark (i.e., can the model generate the text faster than I can read it?), and 90 WPM as a typing benchmark (i.e., can the model interpret text faster than I can type it?). DeepSeek’s documentation states that one English character is approximately 0.3 tokens. A quick search on the internet gives me the rules of thumb that words per minute is roughly characters per minute divided by 5 (i.e., average English word length is 5 characters). Therefore the target tokens/second for “usability” are:

Reading:

250x5 = 1250 characters/minute
1250 / 60 seconds = 20.83 characters/second
20.83 characters/second @ 0.3 tokens/character = 6.25 tokens/second

Writing:

90x5 = 300 characters/minute
300 / 60 seconds = 5 characters/second
5 characters/second @ 0.3 tokens/character = 1.5 tokens/second

The target figures in Ollama using deepseek-r1:14b:
Prompt Eval Rate > 1.5 tokens/second (i.e., the model can “read” as fast as the user can type)
Eval Rate > 6.25 tokens/second (i.e., the model can “type” as fast as the user can read)

As noted in the Installing Deepseek AI Locally thread, the spare PC I’d been doing most of the playing around on is an ancient HP Z220 Workstation from mid-2012. This is the i7-3770 version, with 16GB RAM, SATA SSD, and a 4GB Nvidia GTX1050Ti (768 CUDA cores). The Z220 goes on eBay for $150-250, and the 4GB GTX1050Ti for $100-$150. Throw in an SSD upgrade and we’re not far from the $500 mark. This seems a reasonable baseline to start with.

Using Debian 12, Ollama v0.7.1, and the deepseek-r1:14b model I experienced a prompt eval rate of 5.07 tokens/s, and an eval rate of 2.45 token/s.

The second baseline option was a 2020 MacBook Air. These are on eBay at the moment for around the $550-700 mark (hence the “ish” in my $500[ish] challenge). Mine is kitted out with 16GB RAM and is the 8/8 core version so it’s probably on the upper end of that range, if not higher.

Using macOS Sequoia 15.5 and Ollama v0.7.1 and the deepseek-r1:14b model I experienced a prompt eval rate of 1.83 tokens/s, and an eval rate of 6.08 token/s.

I have no idea whether this can be optimised further to use the Apple Silicon NPU. I don’t believe Ollama currently uses the built in NPU on Apple Silicon devices, but the chip itself may be clever enough to offload those types of workloads. I’ll have to do some more reading about this.

The $500 challenge build is a Lenovo ThinkStation P330 G2 I picked up last week. This is the Xeon E-2174G version, with 64GB RAM, SSD, and no GPU (yet). This is also where I forked out the actual cash for my challenge and “ate the dogfood” now that I had a benchmark from the two existing devices.

Using Debian 12, Ollama v0.7.1, and the deepseek-r1:14b model I experienced a prompt eval rate of 8.55 tokens/s, and an eval rate of 3.24 token/s.

I’m awaiting arrival of a second hand Nvidia Tesla P4 8GB card in the coming week or two, which has 2560 CUDA cores. This is double the VRAM than the GTX1050Ti, is a similar generation card (although slightly different CPU), and 233% more CUDA cores. When researching cards, I ended up basically looking for “Nvidia card with max number of CUDA cores > version 5.0, with as much VRAM as possible for around $150” after seeing the supported GPU list. That’ll bring the total build cost for the P330 G2 + Tesla P4 option to around $520.

Summary of results

	HP Z220 + GTX 1050Ti	MacBook Air M1	Lenovo P330 G2 (No GPU)	P330 G2 + Tesla P4
Approx. Cost (As tested)	$400	$550-700+	$360	$520
Prompt Eval Rate	5.07 t/s	1.83 t/s	8.55 t/s	TBA
Eval Rate	2.45 t/s	6.08 t/s	3.24 t/s	TBA

I’ll update the table when I get the P4 card and can rerun the test. However, the MacBook M1 option can “read” and “write” almost as quickly as I can interact with it. It feels a lot more fluid than the Z220, and the P330 G2 without GPU felt pretty close to the “comfortably usable” mark as well. I was a little surprised when I saw the results of ~3.5t/s given how smooth it felt.

Hope that helps anyone wanting to look at general purpose LLMs for home. I don’t think it’s necessary to spend huge dollars unless you needed to run absolutely enormous models. I’ll leave it to you to decide whether a little over a year of a commercial LLM service is better value, or whether the privacy and usability benefits are worth it (e.g., local data, access to uncensored and specialised models). I’d also love to hear from @techman about how far $500 would go on any of the services he’s used, and his experiences with his build!

Cheers,

Belfry

techman · May 25, 2025, 6:40am

As far as my new AI PC goes, I’ve been all talk and no go so far. Maybe the omens and portents will inspire me tomorrow ?

To date I’ve spent:
2x$450 on two RTX3060 cards with 12GB Vram each
1xRyzen 5 5500 CPU @ $130
2x 16GB DDR4 ram to replace 2x64GB DDR4 ($220)ram from this machine which will be my new NAS
1x new unused ethoo pro case I purchased about 5 years ago for $250 ish
1x 650W Corsair PSU I paid $450 for around 2014 ( I bought 6 units at the time)
1x Crucial P3 M.2 NVMe SSD 1TB @ $90
Plus other bits ai already have.

I’m not running anything less than qwen2.5-coder:32b (19GB image) as I consider it the least useful/reliable AI I can tolerate. I have in the interim become used to the slow response so two cards should seem a lot faster.

For serious coding design I use Claude-Sonnet-3.7 via poe.com which has a generous free 3000 points a day. Designing my svd2db Nvim plugin used up 3000 points and I did the last 10% myself.

There are plenty of faster smaller AI’s out there but in my experience they’re a joke and I’d be better off asking my Granny than those things.

So $500 would only be good for a deposit to me, and being a electronics tech, I won’t use any more old secondhand gear as the tech is moving too fast nowadays.
Plus in Australia electricity is now so expensive, old hot Intel CPU’s just can’t compete with Ryzen lower power costs imho.

On a positive note, I bet Chinese AI gear brings the prices right down to affordable levels in a year or so.

Here is a test from my current setup with only one RTX3060 with 12TB vram, (it’s about 1.5 words a second). Two cards should be a lot faster.

>>> why is the sky blue
The sky appears blue due to a phenomenon called Rayleigh scattering. When sunlight enters Earth’s atmosphere, it collides with molecules of gases and
small particles in the air. Sunlight is made up of different colors, each represented by different wavelengths. Blue light waves are shorter than other
colors like red or green.

When light encounters these molecules, it scatters in all directions. The shorter blue wavelengths scatter more easily and at greater angles than
longer wavelengths like red. As a result, when we look up at the sky on a clear day, we see mostly scattered blue light coming from all over the
atmosphere, which makes the sky appear blue to our eyes.

This is why the sky often looks redder during sunrise and sunset—because at these times, sunlight has to pass through more of the Earth’s atmosphere to
reach us. The longer path causes most of the blue light to be scattered away, leaving mainly the reds, oranges, and yellows for us to see.

total duration: 2m17.729984949s
load duration: 32.894837ms
prompt eval count: 51 token(s)
prompt eval duration: 1.443903421s
prompt eval rate: 35.32 tokens/s
eval count: 196 token(s)
eval duration: 2m16.229421158s
eval rate: 1.44 tokens/s

skalyan · May 27, 2025, 3:41am

Wish I’d seen this earlier (before coming to last night’s meeting)! I’m also interested in running LLMs locally, currently using Ollama on macOS. Still trying to think of practical uses for it, though.

Belfry · May 27, 2025, 4:14am

@skalyan What sort of Mac are you running Ollama on, and what models have you tried? I’ve only run the test DeepSeek 14B model in the thread above on my M1 MacBook Air, and also tried out a few models in a VM on my M4 MBA. I also had a brief play with LM Studio in that same VM on the M4, but didn’t do any benchmarks. As I understand it, LM Studio will perform better on macOS due to it using features such as MLX which haven’t yet been implemented into Ollama. From what I can gather, Ollama will use the GPU component of Apple Silicon chips but not the NPU etc. However, I’m still trying to wrap my head around it all, so that may not be accurate. I also have no idea at how good or bad UTM/qemu are at passing through the GPU and NPU to VMs so I’ll probably run the next set of tests on the bare metal notebook and remove that variable entirely.

I got postage tracking details on the Tesla P4 today (yay!) so hopefully it’s only a week or two away. Once it arrives I’ll pop some more detail into the thread and may even run a shootout between the M4 MBA and the $500[ish] challenge build on some of the bigger models.

(I also have zero practical uses for any of this, but it’s an interesting plaything for now!)

mcrilly · May 27, 2025, 10:01pm

Cool project!

This includes updates constantly being applied, the model being fed new data over and over, accessibility from almost any device, voice interaction, and more. Yes, of course I understand the idea of running it locally - privacy for one - but sometimes the SaaS option is the better one, even if it costs (a lot) more years down the line.

Also don’t forget running costs, like electric, and hardware failure.

Belfry · June 3, 2025, 6:40am

The Tesla P4 arrived earlier today. The card doesn’t come with any cooling as it’s meant to be cooled by the airflow in rack mount servers, so I also bought a blower fan and 3D printed bracket from eBay last week so I could run the card in a tower machine. This slots neatly onto the end of the P4 and contains a 4-pin fan which connects directly to the P330’s motherboard and forces air through the P4’s heatsink. I probably could have pulled something from the parts bin and cobbled something together, but this was a neat pre-made solution for US$12.99 posted (exactly AU$20.10 after the exchange rate). I’ve updated the $520 cost to $540 to reflect the addition of the blower fan. Pic attached below.

I’ll also note for anyone going down this route - save yourself going slightly mad going around in circles with unrelated and nondescript errors when trying to get the CUDA kernel modules running for that card - disable Secure Boot .

Disclaimer: There have been updates to Ollama (0.9.0), the DeepSeek R1 14B model, and the Nvidia drivers since my initial tests in my original post. The results will not be directly comparable, but will still provide an idea of the difference in performance.

As expected, the $500 challenge build performed far better with the Tesla P4 8GB card installed.

Using Debian 12, Ollama v0.9.0, and the deepseek-r1:14b model I experienced a prompt eval rate of 18.09 tokens/s, and an eval rate of 5.59 token/s.

For the purpose of comparison only, I ran the same test on my MacBook Air M4 with 32GB RAM. Unsurprisingly, it was the fastest configuration tested. I was shocked to see the difference in prompt eval rate between the M1 and M4. If anyone is interested, I’ll rerun the M1 test using Ollama 0.9.0 and the latest version of the DeepSeek R1:14b model to see whether they’ve been somehow better optimised for Apple M-series processors in the updates last week.

Using macOS Sequoia 15.5, Ollama v0.9.0, and the deepseek-r1:14b model I experienced a prompt eval rate of 53.02 tokens/s, and an eval rate of 10.74 token/s.

Updated - Summary of results

	HP Z220 + GTX 1050Ti	MacBook Air M1	Lenovo P330 G2 (No GPU)	P330 G2 + Tesla P4	MacBook Air M4
Approx. Cost (As tested)	$400	$550-700+	$360	$540	$2899
Prompt Eval Rate	5.07 t/s	1.83 t/s	8.55 t/s	18.09 t/s	53.02 t/s
Eval Rate	2.45 t/s	6.08 t/s	3.24 t/s	5.59 t/s	10.74 t/s

I’ve done no performance tuning with any of these configurations but will probably muck around with the P330 a little more to see if I can squeeze that eval rate to above 6.25 tokens/sec.

Draw your own conclusions from my anecdotal experiment, but hopefully I’ve managed to put some substance behind the “for the cash outlay, a new (second hand/ex lease etc) PC is probably a far better starting point” comment a few weeks ago - one can build something pretty good for the homelab for around $500.

Please let me know if you’d like me to run any specific tests with any specific models. The completed P330 will be moved off the workbench to join the rest of the homelab equipment and will have OpenWebUI (in a Proxmox VM) connected to the Ollama API in the coming weeks as I have some more time.

techman · June 3, 2025, 6:59am

This is from my Ryzen 3600 cpu with 64GB ram and a single Nvidia RTX3060 GPU on this PC

tp@ubuntu:~$ ollama run --verbose deepseek-r1:14b
>>> why is the sky blue
<think>
Okay, so I need to figure out why the sky is blue. Hmm, I've heard this question before, but I'm not exactly sure how it works. Let me think... I
remember something about light and molecules in the air.

Wait, isn't it related to sunlight? Yeah, sunlight comes into the Earth's atmosphere, right? But why does that make the sky appear blue? Maybe it has
something to do with how different colors of light behave when they hit the air.

I think I've heard terms like Rayleigh scattering before. What is that exactly? Scattering of light... Oh, right! When light interacts with particles
in the atmosphere, some of it gets scattered in different directions. But why does this make the sky blue?

So sunlight consists of many colors, each with different wavelengths. Blue and violet have shorter wavelengths compared to red or orange. I remember
that shorter wavelengths are scattered more by the gases and tiny particles in the air.

But wait, why don't we see violet instead of blue since violet has an even shorter wavelength than blue? Maybe our eyes are more sensitive to blue, so
it appears more prominent. Or perhaps the sun emits less violet light compared to blue, which also contributes to the sky being blue.

When the sunlight enters the Earth's atmosphere, the shorter blue wavelengths get scattered in all directions by nitrogen and oxygen molecules. This
scattering is what makes the sky appear blue because our eyes perceive this scattered blue light from all over the sky.

What about during sunrise or sunset? The sky turns red or orange then. That must be because the sunlight has to pass through more atmosphere, so the
shorter blue wavelengths are scattered out, leaving the longer red and orange wavelengths to dominate.

So, putting it together: Sunlight is made up of various colors. As it enters Earth's atmosphere, molecules scatter the light. Shorter blue wavelengths
are scattered more, making the sky appear blue. During sunrise or sunset, when the sun is low on the horizon, the path through the atmosphere is
longer, scattering out the blue and allowing reds to dominate.

I think that's the basic idea behind why the sky is blue. But I'm not entirely sure about all the details, like whether it's Rayleigh or Mie
scattering, but I believe Rayleigh applies to smaller particles and shorter wavelengths.
</think>

The sky appears blue due to a phenomenon known as Rayleigh scattering. Here's how it works:

1. **Sunlight Composition**: Sunlight is composed of various colors, each corresponding to different wavelengths. Blue and violet have shorter
wavelengths, while red and orange have longer ones.

2. **Scattering in the Atmosphere**: As sunlight enters Earth's atmosphere, it interacts with molecules and tiny particles. Shorter wavelengths (blue
and violet) are scattered more effectively by these molecules compared to longer wavelengths.

3. **Perception of Color**: Our eyes are more sensitive to blue light, and the sun emits more blue light than violet. This combination makes the sky
appear blue as the scattered blue light is perceived from all directions.

4. **Sunrise and Sunset**: During these times, sunlight passes through a greater thickness of atmosphere, scattering out shorter wavelengths (blue) and
allowing longer red and orange wavelengths to dominate the sky's color.

In summary, Rayleigh scattering causes shorter blue wavelengths to scatter more in the atmosphere, making the sky appear blue under normal conditions.

total duration:       23.644550464s
load duration:        32.830764ms
prompt eval count:    8 token(s)
prompt eval duration: 244.574817ms
prompt eval rate:     32.71 tokens/s
eval count:           690 token(s)
eval duration:        23.366344412s
eval rate:            29.53 tokens/s

techman · June 3, 2025, 7:02am

Prompt Eval Rate vs Eval Rate
In the context of Large Language Model (LLM) performance evaluation, the terms “prompt eval rate” and “eval rate” refer to different stages of the evaluation process.

The “prompt eval rate” specifically refers to the speed at which the model processes the input prompt. For example, a prompt eval rate of 7.66 tokens per second indicates how quickly the model interprets and processes the given prompt.

On the other hand, “eval rate” generally refers to the speed at which the model generates output tokens. An eval rate of 4.89 tokens per second, for instance, signifies the rate at which the model produces its response after processing the prompt.

Both rates are crucial for understanding the performance of an LLM, as they provide insights into the efficiency of the model’s processing and response generation phases.

Belfry · June 8, 2025, 1:29am

Final update on this project. I’ve been experimenting with several models of several sizes and quantization levels, and it seems that the bottleneck of this system is moving data in and out of memory and across the PCI-E bus. There’s a lot of “hurry up and wait” by the CPU and GPU, and a ton of data moving across the PCI-E bus, particularly for the models which are just ever so slightly larger than available VRAM (in this case > 8GB - for reference deepseek-r1:14b is 9GB). My assumption is that various chunks of data are being moved onto the card for processing due to the whole lot not fitting on the card at once, and the bottleneck is between the CPU/RAM/GPU rather than those components themselves. I’m certainly not an expert on this and have only spent a few hours skimming documentation and forum posts elsewhere to try and get a better understanding of it all. I haven’t played with variables such as context size during my tests, but understand they can have a dramatic impact on performance too. For now I’ve just left everything else as default.

In an attempt to squeeze that throughput of the PCI-E bus and the memory a little higher, I’ve double checked that the RAM timings are correct for the DIMMs in the machine, and have then subsequently been messing around with various combinations of Hyper Threading on/off, threaded interrupts on/off, preemptable kernel on/off, pegging the Tesla P4 at max frequency (i.e., a mild overclock from defaults - nvidia-smi -ac 3003,1531 in my case), and limiting the number of threads that ollama is using. My intent was to try and reduce the time waiting for the PCI-E bus and memory to catch up with the GPU and CPU. Either this is yet another case of the tiniest bit of latency being my enemy, or maybe all I have is a hammer and everything looks like a nail…

I won’t run through an exhaustive set of results, but still using deepseek-r1:14b as the baseline model for testing, the prompt eval rate has varied between 13.03 - 106.86 tokens/sec (-27.97% to +490.71% from my initial results), and the eval rate varied between 4.97 and 6.08 tokens/sec (-11.09% to +8.77% from my initial results).

I’m sure that would vary model to model as well. My point is that there’s potentially a lot of performance left on the table and if you go down this route yourself, invest some time in tweaking the settings to best match your hardware and the specific AI application you’re running on it.

Hopefully that’s inspired a few others in the group to have a play too. I’ll likely throw a few other things onto this machine over the next month or so just to try them out. A few ideas have popped into my head so far - trying out stable diffusion for image generation, processing an RTSP feed of a camera for computer vision applications, or installing Proxmox and dividing the Tesla P4 into vGPUs and hosting several parallel workloads. Let me know if there’s any interest in any of the above, or feel free to suggest anything else that might be a fun experiment during the remainder of June, and I’ll document the results for the group.

Cheers,

Belfry

zeeclor · June 8, 2025, 10:44pm

I went looking for newer versions of Belfry’s Tesla P4 which is now 9 years old.

This is what Perplexity came up with.

Here is a comparison table of key specs and typical costs for the NVIDIA Pascal (Tesla P4), T4, A100, and H100 data center GPUs as of 2025:

GPU Model	Architecture	CUDA Cores	Memory (Type)	Memory Bandwidth	FP32 Perf.	INT8/Tensor Perf.	Power (TDP)	Typical Cost (USD)
Tesla P4	Pascal	2,560	8 GB GDDR5	192 GB/s	5.5 TFLOPS	22 TOPS (INT8)	75 W	~$500–$1,000 (used)
T4	Turing	2,560	16 GB GDDR6	300 GB/s	8.1 TFLOPS	130 TOPS (INT8)	70 W	~$1,880 (new)
A100 (80GB)	Ampere	6,912	80 GB HBM2e	2,039 GB/s	19.5 TFLOPS	1,248 TOPS (INT8)	400 W	$10,000–$18,000
H100 (PCIe)	Hopper	Not listed	80 GB HBM3	2,000–3,900 GB/s	51.2 TFLOPS	3,958 TOPS (INT8)	350–700 W	$25,000–$35,000

Notes:

Tesla P4: Pricing is for used/refurbished units, as new stock is rare.
T4: Listed price is retail/RRP for new units.
A100: Price varies by configuration and market; 80GB PCIe and SXM4 versions are most common.
H100: Price reflects PCIe version; SXM modules are typically more expensive.

These GPUs represent the evolution of NVIDIA’s data center lineup, with each generation offering substantial increases in AI and compute performance, memory capacity, and bandwidth.

… and price.

techman · June 9, 2025, 3:48am

I went the dual RTX3060 Video Card route, so my specs are:

CUDA Cores	Memory 	       Memory Bwidth	Power (TDP)	Cost (AUD new) 
3,584       12GB GDDR6      360 GB/s         170 W         $450

All times 2.

The NVIDIA GeForce RTX 3060 has 3,584 CUDA cores and operates at a base clock of 1.32 GHz with a boost clock of 1.78 GHz. It features 12 GB of GDDR6 memory running at 15 Gbps over a 192-bit bus, providing 360 GB/s of memory bandwidth. The GPU is based on the GA106 chip manufactured by Samsung using an 8 nm process. Its TDP is set at 170 watts.

skalyan · June 11, 2025, 12:45am

I’m running Ollama (and LM Studio) on an M2 MacBook Air. I’ve mainly tried the medium-sized Llama 2 model, as well as DeepSeek, as well as a couple of Japanese models. If I had to pick one, I’d probably go with OLMo, as I like the idea of a language model that’s exclusively trained on public-domain text.

One practical use I’ve thought of for a local language model is to fine-tune it on my e-mails and end up with something that can draft e-mails in my own writing style. I’d probably use PanzaMail for this.

A less practical (but very fun) use of local LLMs is to run this recreation of Clippy (and yes, it does run on a Mac!). Crazy that we now have the technology to make virtual assistants that are actually useful, and yet no one makes animated virtual assistants anymore.

Belfry · June 12, 2025, 8:00am

Thanks for the tip about OLMo - I had not seen that before. I also like the idea of a LLM that’s trained on public domain text and will check it out.

The Clippy project is hilarious too! I’d love to see some LLM integrations with the Clippy and BonziBuddys of the world!

skalyan · June 13, 2025, 6:06am

I didn’t know about BonziBuddy! I did play around with Microsoft Agent for a while, though—my mum helped me to create an application where two characters (a robot and a genie) would tell each other jokes. Would be cool to create such applications using LLMs.

techman · June 13, 2025, 11:48pm

Have you seen the Youtube of two AI’s hosting a video about oil ?

Surely an AI voice debate is the ultimate AI entertainment ?

jdownie · June 16, 2025, 11:44pm

That’s very impressive. I think it’s funny that the female voice’s role seems to be to echo whatever the male voice says at the end of every statement.

techman · July 11, 2025, 7:03am

Here is the same test from a different PC. This one has a MSI mobi, Ryzen5500 CPU with 32GB ram and a Zotac RTX3060 video card. The OS is NixOS-25

[tp@nixos:~]$ ollama run deepseek-r1:14b --verbose

why is the sky blue

… same as above

total duration: 6.216126849s
load duration: 12.635308ms
prompt eval count: 8 token(s)
prompt eval duration: 213.748344ms
prompt eval rate: 37.43 tokens/s
eval count: 193 token(s)
eval duration: 5.989102877s
eval rate: 32.23 tokens/s

Belfry · July 24, 2025, 2:01am

I finally got this machine working again this morning after starting over yesterday using NixOS instead of Debian 12. Thanks to @techman for the help in his thread about his AI build last night. It turns out to not have been a Linux kernel issue, but a CUDA kernel issue. I’ll give my configuration.nix a tidy up and paste it in the NixOS thread later today, but will document the AI build relevant stuff here.

I started from scratch this morning and found that after getting the card going in NixOS without issue, I was running into an issue which only popped up a few weeks ago where NixOS only compiles ollama CUDA support for compute capability 7.5 and above.

As I was tracing things through step by step, -- Using CUDA architectures: 75;80;86;89;90;100;120 stood out in the build logs as the card I’m using is Pascal based (6.1) and absolutely supported by ollama.

I didn’t end up applying the fixes proposed in the GitHub issue, but they put me on the right track. I decided to investigate the concept of NixOS overrides a bit more instead and try to come up with my own fix that I could build into my configuration.nix and save for the future. All my cards I’m experimenting with are older cards based on Pascal 6.1 so I was fine hardcoding this fix and having it commented in the configuration.nix file for me to change if needed in the future.

Given that I only started using NixOS yesterday, my very limited understanding is that the cudaArches variable is being pulled/generated from somewhere and used during the build of the ollama package. Not entirely sure where cudaArches coming from yet - it’s a curiosity for me to explore later! However, the following lines in configuration.nix override the cudaArches variable being used by the CUDA compiler when ollama is being built, and sets it to 61 (for a Pascal / version compute 6.1 based card).

   nixpkgs.config.packageOverrides = pkgs:
       { ollama = pkgs.ollama.override {
          cudaArches = [ "61" ];
         };
       };

One sudo nixos-rebuild switch later, the following pops up during the compile:

-- CUDA Toolkit found
-- Using CUDA architectures: 61
-- The CUDA compiler identification is NVIDIA 12.8.93 with host compiler GNU 14.2.1

Woo!

After that, everything just worked.

Using the now standard “Why is the sky blue?” test on the deepseek-r1:14b model I’ve been using throughout the thread, I get a prompt eval rate of 18.33t/s and eval rate of 5.93t/s. This is a 1.3% and 6.0% increase on the numbers from the Debian 12 based run back on June 8. Probably not statistically significant, and a lot of variables changed (e.g., this used a newer version of ollama 0.9.6 and the deepseek-r1:14b model), so I’m not going to make a claim that NixOS is “faster”, but it’s certainly seems to be harnessing the card’s capabilities in roughly the same way.

This post was more about me dipping my toe into the world of NixOS than it was about the AI build, but I thought I’d post an update with the test results, and also document this somewhere in case someone happens to be following along and looking at using a surplus datacenter card such as the P4.

techman · July 24, 2025, 2:14am

WOW, that’s some serious debugging, you the man Belfry!

Using my two RTX3060’s and the same test using the same AI, I get about a 5x increase over your result :

Belfry · July 24, 2025, 2:20am

Yeah, and you didn’t have to spend 5-6 hours trying to understand the problem, tracing things step by step, and then trying to figure out a solution when almost entirely unfamiliar with the way the underlying OS works just to get 20% of the results . Lesson learned - just buy a new card !

For the record, I’m back on kernel 6.12.39, nvidia driver 570.153.02, and CUDA version 12.8. Thanks for your help last night!