$500[ish] AI Build Challenge

Belfry, it’s a rare pleasure ‘helping’ someone as capable as you!

… And it’s “buy two cards” :wink:

Linux nixos 6.15.7 #1-NixOS SMP PREEMPT_DYNAMIC Thu Jul 17 16:44:05 UTC 2025 x86_64
Driver Version: 570.153.02 CUDA Version: 12.8

Oh coolies, my fav GPU is back in stock at pccasegear.com.

https://www.pccasegear.com/products/54481/asus-geforce-rtx-3060-dual-oc-12gb

Main features:

  • $459 and free shipping
  • 12GB Vram
  • Made by ASUS
  • Dead quiet
  • You get 24GB Vram for $900 with two cards

The most important item (imho) is lots of videoram. Forget GPU’s with 8GB, they’re no use for AI.

The RTX3060 is still a hot performer even among the current models:
https://www.pccasegear.com/products/68041/pny-geforce-rtx-4070-super-verto-dual-oc-12gb

  • $869
  • 12GB Vram

I found reference to this in the source for ‘ollama-cuda’ at

That’s as far down the chain as I went too (linked in my post above). I might go back to cudaPackages in a few weeks and have a dig if I get time. I was meandering through there late last night (and cudaPackages11 to see if an older version of the drivers and CUDA would suit the card better).

With your mention of 12GB and 2 cards, I tried to slip a spare 4GB GTX1050Ti into the machine just to see if I could get it going. I couldn’t get Debian to see both cards, and I had the same issue with NixOS. I suspect it’s probably a hardware issue (limited PCIe lanes or some such) rather than a Linux issue.


Next step will be to take my shiny new configuration.nix and drop the GTX1050Ti into another machine I have here, feed it the config, and see NixOS work its magic to deliver me a working system. I might give that a go next week if I have time. It’d be a good way to test out the whole essence of NixOS → insert configuration → receive reproducible build.

(Edit: actually, I’ve got some free time now. I’ll drop the configuration.nix into the Z220 + GTX1050Ti setup mentioned in my initial post and see what happens. Will post the results into the NixOS thread as that’s where it’ll be more relevant.)

I couldn’t get the GTX1050Ti going in the machine at the same time as the Tesla P4 (I think this is a PCIe lanes issue rather than a Linux or ollama issue), so I’m still operating with the single Tesla P4 8GB card.

However, I still felt as if there was some performance being left on the table and started to experiment with optimising ollama further. I really wanted to squeeze that eval rate to > 6.25 tokens/second (see post #1)!

I appreciate that this thread has become a bit of a blog while I try to wrap my head around it all, but perhaps it’ll be useful as other HLB members start to play with their own self-hosted LLMs!

There are two features listed in the Ollama FAQ which I decided to investigate further this morning - Flash Attention and K/V Cache Quantisation.

I’m still wrapping my head around it all, but my basic understanding of these two parameters is:

  • Flash Attention is a more efficient way of using the GPU’s memory by batching transfers around the GPU (performance enhancing), and
  • K/V Cache Quantisation is reducing the precision of the key/value cache, which then reduces key/value memory use, particularly where large context sizes are involved (see this excellent blog post by the person who implemented the changes in Ollama recently). The results are a performance increase for key quantisation, a performance decrease for value quantisation, however, the net outcome is that the memory use for the cache is roughly halved for q_8 (8-bit integers) and roughly 1/3-1/4 for q_4 (4-bit integers) vs. the default f16 (16-bit floating-point precision). Ultimately that means there’s more memory free to load more of the same model to the GPU, load a larger model, etc.

These two parameters are controlled by the OLLAMA_FLASH_ATTENTION and OLLAMA_KV_CACHE_TYPE variables, respectively.

Setting the K/V Cache parameter depends on having Flash Attention enabled. The blog post referenced above states that there’s no downside to enabling Flash Attention anyway, and that it should end up as an Ollama default. If nothing else, enabling Flash Attention seems like a very worthwhile endeavour.


Back to the P330 challenge build. It’s now running NixOS, so I ran some tests with various combinations of Ollama environment variables set in configuration.nix, with nothing more than a sudo nixos-rebuild switch in between. For testing,I used the same deepseek-r1:14b model I’ve been using for this whole experiment so far.

The defaults with no parameters are added are for Ollama (0.9.6) to use f16 K/V Cache Quantisation and to have Flash Attention disabled. Setting Flash Attention to 1 enables it, and valid options for K/V Cache Quantisation include q8_0 and q4_0.

As the deepseek-r1:14b model loaded, the ollama logs confirmed the changes each time, for example:

ollama[11256]: llama_context: flash_attn    = 0

vs.

ollama[1273]: llama_context: flash_attn    = 1

and

ollama[4202]: llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1, padding = 256
ollama[4202]: llama_kv_cache_unified:      CUDA0 KV buffer size =   544.00 MiB
ollama[4202]: llama_kv_cache_unified:        CPU KV buffer size =   224.00 MiB
ollama[4202]: llama_kv_cache_unified: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB

vs.

ollama[1273]: llama_kv_cache_unified: kv_size = 4096, type_k = 'q4_0', type_v = 'q4_0', n_layer = 48, can_shift = 1, padding = 256
ollama[1273]: llama_kv_cache_unified:      CUDA0 KV buffer size =   166.50 MiB
ollama[1273]: llama_kv_cache_unified:        CPU KV buffer size =    49.50 MiB
ollama[1273]: llama_kv_cache_unified: KV self size  =  216.00 MiB, K (q4_0):  108.00 MiB, V (q4_0):  108.00 MiB

Using the HLB standard “Why is the sky blue?” test, I got the following results:

Flash Attention + KV Cache Experiment Results

Ollama defaults (Flash Attention disabled + f16 K/V Cache) Flash Attention (+ default f16 K/V Cache) Flash Attention + K/V Cache q8_0 Flash Attention + K/V Cache q4_0
Prompt Eval Rate 17.80 t/s 19.83 t/s 21.17 t/s 19.63 t/s
Eval Rate 5.95 t/s 6.27 t/s 6.15 t/s 6.72 t/s

Woo! Enabling Flash Attention gets me to the magic 6.25 tokens/second goal I set myself for the $500 challenge. Given that I’m only using a single prompt (Why is the sky blue?), I don’t think the results reflect the potential performance increase outcomes by changing the quantisation values for the K/V cache (relatively small context anyway), however, the cache quantisation did allow me to squeeze a little more of the 14b model onto my 8GB card rather than relying on the CPU. Therefore, the lower memory usage allowing for a larger amount of GPU processing probably explains the slight performance bump rather than the K/V Quantisation process itself - exactly what was highlighted in smcleod’s blog.

For example (Ollama defaults):

ollama ps
NAME               ID              SIZE     PROCESSOR          UNTIL               
deepseek-r1:14b    c333b7232bdb    10 GB    29%/71% CPU/GPU    27 seconds from now    

vs. (q4_0 K/V quantisation):

ollama ps
NAME               ID              SIZE     PROCESSOR          UNTIL              
deepseek-r1:14b    c333b7232bdb    10 GB    24%/76% CPU/GPU    4 minutes from now   

For those using NixOS, these parameters are controlled in configuration.nix, for example:

  services.ollama = {
    enable = true;
       environmentVariables = {
     OLLAMA_FLASH_ATTENTION="1";
     OLLAMA_KV_CACHE_TYPE="q4_0";
    };
  };

As stated above, KV_CACHE is dependent on FLASH_ATTENTION. There is no reason why you can’t enable FLASH_ATTENTION and not set KV_CACHE at all - that’s exactly what I did in my test above in the table’s second column.

No idea why there are quotes around the 1 for the Flash Attention value - but it works, and that’s what any NixOS specific posts about it state. It probably gets converted to an integer/boolean anyway.

If you’re running Ollama, please let me know what sort of results you get by enabling Flash Attention, and whether the precision trade off of playing with the K/V cache is worth it (or allows loading for a larger model than would ordinarily be available.)

1 Like