Large Language Models, but Small (aka playing with a thin client)

NBN was down for me again today, so I ended up with a spare hour or so offline to embark on something a bit silly.

I’ve got a few tiny Dell Wyse 3040 thin clients at home which I’ve been gradually taking out of service. I picked these up second hand years ago for $39 as they were a reasonable alternative to Raspberry Pis for my various use cases. The internals are an Atom Z8350, with 2GB RAM, 8GB eMMC, 3x USB 2.0, 1x USB 3.0, 2x DisplayPort, and 1x GbE port. Some models have 16GB RAM, and some models have Wi-Fi, but my particular ones don’t. During the peak of the Pi shortage the 3040s were far cheaper (and easier to source) than Pis were, especially once adding on a case and PSU. These ran my Home Assistant instance and a few other small workloads which are now virtualised on a larger Proxmox machine. Power usage on the 3040 maxes out around 4-5W, and they’ll comfortably run Linux.

A size comparison with a Raspberry Pi 2B+ is below:

For a bit of fun, I decided to take one of the spare 3040s and see how far I could get running an LLM. I took a similar approach to my other recent build and started with Debian 12. I abandoned that, as even a minimal install wasn’t leaving me enough room for Ollama plus any of the small models. I switched to DietPi (a cut down variant of Debian) because of its smaller footprint, and also because I already had the iso handy and my NBN was down :joy:.

I eventually managed to shoehorn Ollama 0.9.2 onto the 3040, as well as the deepseek-r1:1.5b model.

Stats from @techman’s usual test question - Why is the sky blue? - which I’ve also come to adopt, are below:

total duration:       2m46.040839878s
load duration:        154.230844ms
prompt eval count:    9 token(s)
prompt eval duration: 3.947752175s
prompt eval rate:     2.28 tokens/s
eval count:           253 token(s)
eval duration:        2m41.933088422s
eval rate:            1.56 tokens/s

It was slooooooooooooooow. But, it worked.

For the sake of comparison, this is the same deepseek-r1:1.5b model running on the P330 G2 Workstation I’ve been playing with over the past month:

total duration:       1.454925236s
load duration:        18.571121ms
prompt eval count:    9 token(s)
prompt eval duration: 47.06159ms
prompt eval rate:     191.24 tokens/s
eval count:           89 token(s)
eval duration:        1.388811999s
eval rate:            64.08 tokens/s

Would I recommend doing this? No - it’s utterly daft. Don’t do this.

Would I recommend the Wyse 3040 and thin clients more generally for extremely power efficient running of small workloads (e.g., PiHole, Adguard Home, ADS-B/AIS receivers, Home Assistant, etc.)? Absolutely - they’re brilliant for that use case, and dirt cheap on eBay too. These 3040s are a bit… quirky due to their UEFI implementation (limited? buggy? not sure. definitely… not quite standard.), but not insurmountably so. The Repurposing Thin Clients site is an incredible resource listing the internals of the various models that will pop up for sale, and covering any issues that one would experience along the way when getting Linux running or modifying them for your use case.

That’s surprisingly cool, I like it a LOT! That’s an affordable AI PC you could give to a young child to use as a educator for those times when the adults avoid them because they ask too many questions :slightly_frowning_face:

Your eval rate: of 1.56 tokens/s is actually faster than the AI I use all the time (eval rate: 1.44 tokens/s) and I have become very used to it. I dont know what I’ll do if my new dual GPU PC is twice as fast!