Local first.
Models run on the laptop, not on someone else's API bill. Privacy is a side-effect; ownership is the point.
York University · Software Engineering · 4th Year
I'm a developer who learns by shipping. My work lives at the intersection of local AI/ML, full-stack engineering, and a quiet obsession with sustainable compute — the literal wattage of every token I generate.
Cloud APIs make compute feel weightless. It isn't. Every inference call is real electrons, real water, real heat. I measure watt-hours per token because what you can't measure, you can't respect. My models run on commodity laptops, quantized into GGUF and ONNX, served through Ollama and llama.cpp — and they're good enough. That's the whole point.
I treat computing efficiency as a critical software feature — not a footnote. A 4-bit model that answers in 200ms at half the draw of its fp16 parent is usually the right call, and nobody else is going to tell you that, because nobody else is paying your power bill.
scroll to advance →
Before anything touches a cloud, it runs on my laptop. Ollama, llama.cpp, Docker, Nginx — the full stack, self-hosted. If it can't survive a laptop, it doesn't ship.
GGUF and ONNX, q4 to q8, profiled against real prompts. A 4-bit model that answers in 200ms is worth more than a fp16 model that times out.
Energy-per-token, not just tokens-per-second. I instrument the draw because efficiency is a feature — the most underrated one in ML.
Containerize, reverse-proxy, document. A deploy is not the end — it's the start of the part where the software actually has to work.
Most LLM benchmarks are liars. They report tokens per second as if throughput were the only axis that mattered. But every token is real electrons, real heat, real water somewhere upstream. When you only optimize for speed, you optimize away the cost you can't see.
I started measuring because I had to. Running quantized models on a laptop battery forces a kind of honesty the cloud never will: the drain is right there, in the top-right corner of your screen, ticking down in real time. A Kill-A-Watt on the wall makes it even harder to lie to yourself.
The headline finding from my own rig: quantization matters more than the leaderboards admit. A q4 GGUF of a 7B model often answers at half the wattage of its fp16 parent, with a quality drop most real workloads won't notice. If you're measuring Wh/token — and not just tok/s — the math inverts in favor of the smaller, quantized model almost every time.
This isn't an argument against big models. It's an argument for measurement. Once you can see the watts, you start making decisions the benchmark-only world can't even formulate as questions. Sustainable compute starts with a number on a screen.
Cloud model leaderboards are optimized for one thing: making you dependent. They collapse an entire decision space — latency, privacy, cost, control, durability — into a single "quality" score, and then ask you to rent back the answer for perpetuity.
Open-weights break that frame. When the weights are on your disk, the model can't be repriced out from under you, can't be deprecated on a vendor's roadmap, and can't log your prompts to someone else's retention policy. The benchmark gap that looks large on a leaderboard often vanishes in practice once you quantize and run locally on the workloads you actually have.
The deeper point is about power. Cloud compute is a landlord relationship. Local compute is ownership. For software that has to keep working in five years — not just demo well next quarter — that distinction is everything.
There is a category of engineering judgment that only forms when the server going down means you getting paged. Self-hosting my own projects — Docker, Nginx, Linux, the whole stack — has taught me more about reliability than any course on distributed systems.
Three rules I've internalized the hard way: containerize everything (no "works on my machine"), reverse-proxy with Nginx so you can swap services without downtime, and document your deploy before you need it at 2am — not after.
This essay is queued for full expansion. The thesis is simple: the best way to learn ops is to be your own SRE.
I'm seeking Software Engineering and Applied ML internships for 2026. If your team builds local-first, privacy-respecting, or performance-obsessed software — I'd like to hear from you.