Skip to content
END-TO-END AI PRODUCTS · NEED → AGENTIC FLOWS → INFRASTRUCTURE

Jan HilgardEND-TO-END AI PRODUCTS · NEED → AGENTIC FLOWS → INFRASTRUCTURE

I build AI products end-to-end.
From the business need down to the infrastructure.

From the business need, through the agentic flows that run them, down to the infrastructure that makes them work: inference, proxies, data. Most people do one layer; I build the whole stack. 79 PRs into vllm-mlx and a self-built LTE proxy pool are how far down that goes. Exit Hosting90 (2020).

01ABOUT

20+ years building tech companies and products

I founded Hosting90 in 2002. Eighteen years of building from garage to a 25-person team to an international exit to WY Group in 2020.

Then I took a year off. Went deeper into the AI/ML stack — local LLM models, agentic workflows, inference infrastructure. Realized there's a massive gap between what AI labs publish and what a solo founder can actually run on their own infrastructure.

That gap is what interests me most right now.

So that's what I do now: I take a business need and build the whole working system around it — the agentic flows that run it and the infrastructure underneath. Most people do one layer; I span the full stack, from need to infra. As CTO at MirandaMedia Group I built and led the technical architecture and AI stack of three production AI products — Advanty, Margly, and Discury — agentic systems that make decisions, call tools, and carry out multi-step work on their own. Today I build independently (Surfaced, R&D) and contribute to vllm-mlx.

That span shows in how I chose the inference stack per workload across those three: Advanty's batch work ran fully on owned inference (Qwen 3.6 on vllm-mlx, Apple Silicon); Margly ran on frontier cloud (Google AI) for the reliability its agent orchestration needed; Discury orchestrated both. The 79 PRs I've merged into vllm-mlx — and the LTE proxy pool I built for data access — are how far down the stack I go when the economics demand it.

Alongside my own work, I do audits of inference economics and agentic workflows for AI startups and tech companies, and I'm open to advisory and fractional-CTO engagements.

An open question I'm working through: I'm building Surfaced to apply GEO (Generative Engine Optimization) — getting cited in Google's AI Overview — but I don't yet know how reproducible that is across different niches. Until I have my own proven case studies, it stays a project in development; a content offering without its own track record is just selling promises.

Based near Prague. Czech and English (written). I publish about LLM economics and infrastructure patterns.

Portrait photo of Jan Hilgard, end-to-end AI product builder
02HOW I BUILD

Most “AI agents” are demos. Production systems need the whole stack underneath.

I build across the whole stack — the agentic flows that run a product, and the inference, proxy, and data infrastructure underneath them. Here's what that looks like at the orchestration layer; the infrastructure proof is below.

  1. 01

    Multi-model orchestration

    Routing requests between local models (Qwen, Gemma) and cloud APIs (Claude, GPT) based on task complexity + cost. Production cost savings of 60–80% vs. pure cloud setup.

    Built into Discury's high-volume agent tasks — owned Qwen 3.6, with frontier cloud models only where task quality demanded the premium.

  2. 02

    Hermes-style tool calling

    No brittle prompt chains. Agent receives a tool set, decides on its own. Requires a strong reasoning model + correct tool granularity. Lessons learned from production deployment.

    Built into Margly for autonomous multi-step orchestration over merchant order, cost, and ad data.

  3. 03

    MCP-native architecture

    Model Context Protocol as foundation for tool integration. Practical patterns for context management, error recovery, and debugging multi-step agents.

  4. 04

    Production failure modes

    Tool calling loops, hallucinated calls, context window poisoning, infinite retry loops. What I've seen in production and how to fix it.

    Patterns derived from building and running three production AI products as CTO.

  5. 05

    Token economics of agentic systems

    Prefill vs generation cost. KV cache reuse. Speculative decoding for agent loops. Practical ROI analyses.

    Why Advanty and Discury were built on owned inference — measured ROI on M3 Ultra vs. cloud API per task class; local inference automatically failed over to public cloud when unavailable.

I write about this regularly. If you have a production agentic workflow that's bleeding tokens or has failure mode issues, get in touch →

03WHAT I BUILD

What I'm building now

SOLO · R&D

Surfaced

AI Search Visibility scanner + GEO content methodology.

Measures where a brand is missing from Google's AI Overview and generates content to fill the gap.

Stack: early R&D, not finalized.

Solo · R&D phase.

More info: jan.hilgard@gmail.com

Built as CTO at MirandaMedia Group

Three production AI products where I built and led the technical architecture and AI stack. Past role — not products of mine today.

Advanty

AI-powered competitive intelligence for marketing agencies.

Agents auto-tag ads, extract hooks, classify CTAs, and tag creatives — all as reliable structured outputs.

Stack: Qwen 3.6 on vllm-mlx (Apple Silicon M3 Ultra). A batch-friendly workload with reliable structured outputs — owned inference made sense economically and operationally.

Built and led the technical architecture and AI stack as CTO at MirandaMedia Group.

Margly

Shoptet e-commerce analytics for online merchants.

AI agents identify margin leaks, recommend pricing changes, auto-tag transactions, and run multi-step orchestration over orders, shipping, ad costs, and returns.

Stack: Google AI (Gemini). Chosen deliberately — Margly's complex multi-step tool calling and autonomous orchestration required frontier-model reliability that open-weights models didn't yet match at this task class.

Built and led the technical architecture and AI stack as CTO at MirandaMedia Group.

Discury

Customer intelligence — mines Reddit, Hacker News, and Product Hunt for pain points, trends, and market gaps.

Discovery and classification agents surface signals at high volume; summarization agents distill the nuance worth acting on.

Stack: hybrid orchestration. Discovery and classification agents on Qwen 3.6 / vllm-mlx (high-volume, batch-tolerant); final summarization and nuance-heavy reasoning on Google AI where the per-token premium was justified by output quality. Routing decided per agent task.

Built and led the technical architecture and AI stack as CTO at MirandaMedia Group.

04PROOF OF DEPTH

How far down the stack I go — when the economics demand it.

When the unit economics demand it, I go all the way down — to the inference layer and to the data/access layer. vllm-mlx (79 merged PRs) and a self-built LTE proxy pool are the two ends of that story: owned inference that made products affordable to run, and a residential-IP scraping stack that makes gated public data reachable. I do MLX out of efficiency necessity, not as a research specialty.

  • vllm-mlx core contributor

    79 merged PRs to open-source LLM inference for Apple Silicon (581+ stars). Primary implementor of Anthropic Messages API (/v1/messages) — the compatibility layer that makes vllm-mlx work with Claude Code and OpenCode.

    Main areas of work:

    • ·KV cache quantization: QuaRot live inference, asymmetric K/V bit quantization for prefix cache, TurboQuant R1 Hadamard rotation for outlier-free MoE weight quantization
    • ·Constrained decoding: JSON schema enforcement, thinking suppression, preamble handling, array-of-objects fixes
    • ·MLLM infrastructure: logits processor context, token duplication fixes, tools/tool_choice in chat templates
    • ·Production reliability: client disconnect detection, in-flight token credit on request abort, generation_tps batch stats
    • ·Streaming: UTF-8-safe incremental decode, tool calls with reasoning parser, leak fixes for Anthropic streaming
    github.com/waybarrios/vllm-mlx
  • Data-access infrastructure (the other end)

    A self-built LTE proxy pool — Raspberry Pis plus consumer MiFi modems on rotating CGNAT residential IPs — that puts scraping traffic on organically residential addresses, with a commercial proxy as hot fallback. Anti-detect scraping across Cloudflare / DataDome / Akamai. Real throughput from production pipelines (10k+ requests/day).

    Read: the LTE proxy pool
  • Production batch inference

    Apple M3 Ultra 256GB as primary inference machine. Workloads with 9:1 prefill/generation ratio (image classification, content tagging, structured extraction). 274 tok/s sustained throughput on Gemma 4 26B-A4B at concurrency 8.

  • Hardware economics

    Real ROI analyses: M3 Ultra vs RTX PRO 6000 Blackwell for different workload types. Cost-per-token calculations across cloud providers vs. owned infrastructure. Payback period modeling for hardware investments.

  • Local LLM deployment patterns

    vLLM, SGLang, llama.cpp, MLX. When to use which stack. Quantization tradeoffs. Multi-model serving. Auto-scaling on bare metal vs Kubernetes.

05HOW I THINK

A few principles I work by

  1. 01

    Cost arbitrage as strategy

    Cost arbitrage is strategy, not preference. Who owns the inference stack competes on different terms than who pays the OpenAI bill. Engineering decision with P&L impact.

  2. 02

    Production > novelty

    Trends are expensive. Working production systems = long-term moat. Six months with one provider > three months chasing every new release.

  3. 03

    Bridge between tech and business

    I spent 18 years running a tech company — where the CEO chair meant understanding code and cash flow at the same time. Today, when I solve architecture, I see P&L consequences. When I talk to investors, I talk about KV cache too. This combination is rare and that's where the value lives.

  4. 04

    Bootstrap by choice, not by default

    I've had an exit. I know what the VC track looks like. I consciously choose bootstrap because for AI infrastructure tooling, profit beats scale. Not dogma — context-aware decision.

  5. 05

    Outcomes > activity

    20 years taught me shipping features ≠ creating value. I measure myself and projects by real outcomes (retention, margin, ARR), not activities (PRs, posts, meetings). This perspective only comes after several building/selling cycles.

If this resonates, we might be on the same wavelength.

06TIMELINE

The journey from the start

  1. TODAY

    Current focus

    Core contributor to vllm-mlx. Building Surfaced (solo, R&D). Open to fractional-CTO and advisory work.

  2. 2025

    vllm-mlx core contributor

    79 merged PRs to vllm-mlx (open-source LLM inference for Apple Silicon, 581+ stars). Authored the Anthropic Messages API compatibility layer that makes vllm-mlx work with Claude Code. Main focus: KV cache quantization (QuaRot, asymmetric, TurboQuant), constrained decoding, MLLM infrastructure, production reliability.

  3. 2026

    Advanty

    Built and led the technical architecture and AI stack of Advanty — AI-powered competitive intelligence for marketing agencies — as CTO at MirandaMedia Group.

  4. 2026

    Margly + Discury

    Built and led the technical architecture and AI stack of Margly (e-commerce profitability analytics for Shoptet) and Discury (customer intelligence platform) as CTO at MirandaMedia Group.

  5. 2023

    Co-founded Lobot.chat

    AI customer-support chatbot for e-commerce. Live today, handed over to the operating team.

  6. 2022

    Shift to AI

    Started working with local LLM models and inference infrastructure.

  7. 2021

    Co-founded GuruWatch

    B2B monitoring dashboard for manufacturers and distributors tracking partner stock and pricing across e-shops. Live today, handed over.

  8. SEPTEMBER 2020

    Hosting90 exit

    Sale of Hosting90 systems s.r.o. to WY Group (operator of Ignum brand). Transaction publicly announced.

    hostingy.net
  9. 2002

    Founded Hosting90

    Start of entrepreneurial journey in hosting and web services. Operated as Hosting90 systems s.r.o. (Company ID 28545711).

07PAST PROJECTS

What I shipped before

CO-FOUNDED · HANDED OVER

Lobot.chat

AI customer-support chatbot for e-commerce — resolves up to 98% of inquiries without a human, recommends products and closes sales. Drops into Shopify, WooCommerce, Magento, PrestaShop or OpenCart via a JS snippet. Co-founded; I owned the technical build. Live today, now run by the team.

lobot.chat
CO-FOUNDED · HANDED OVER

GuruWatch

B2B monitoring dashboard for manufacturers and distributors — tracks partner stock levels and pricing across e-shops, with real-time alerts and historical price trends. Customers include Lenovo, Niceboy and Infinix. Co-founded; owned the data pipeline and infrastructure. Live today, handed over.

www.guruwatch.cz
FAQ

Frequently asked questions

Who is Jan Hilgard?
Jan Hilgard is an end-to-end AI product builder based near Prague, Czech Republic. He takes a business need and builds the whole working system around it — the agentic flows that run it and the infrastructure underneath (inference, proxies, data). His 79 merged PRs to vllm-mlx and a self-built LTE proxy pool show how far down the stack he goes. He founded Hosting90 in 2002 and exited it to WY Group in 2020.
What does Jan Hilgard build?
He builds AI products end-to-end — from the business need, through the agentic flows, down to the inference and data infrastructure. Today he builds independently — Surfaced, an early-stage R&D project for AI search visibility — and contributes to vllm-mlx. As CTO at MirandaMedia Group he built and led the technical architecture and AI stack of three production AI products: Advanty, Margly, and Discury. He's open to advisory and fractional-CTO work.
What is vllm-mlx and what is his role in it?
vllm-mlx is open-source LLM inference for Apple Silicon — a vLLM fork with an MLX backend. Jan has merged 79 PRs, including the Anthropic Messages API compatibility layer that makes it work with Claude Code, plus KV cache quantization and constrained decoding. For him it's proof of depth — how far down the stack he goes when the economics require owned inference — an efficiency necessity, not a research specialty.
How does he decide between owned and cloud inference?
It depends on the workload, and the three products he built as CTO at MirandaMedia Group show the range: Advanty ran fully on owned inference (Qwen 3.6 on vllm-mlx, Apple Silicon); Margly ran on frontier cloud (Google AI) for agent-orchestration reliability; Discury orchestrated both.
Is Jan Hilgard available for work?
He's open to fractional CTO engagements, advisory on inference economics or agentic architecture, short-term technical due diligence, and speaking or podcasts. The best contact is jan.hilgard@gmail.com.

Let's work together

Email or LinkedIn — written communication in Czech or English, same speed.
For calls, I'm strongest in Czech; English calls work best when scheduled with a clear agenda. I usually reply same day.

I'm open to

  • Fractional CTO engagements for AI / infrastructure startups
  • Advisory work where inference economics or agentic architecture decisions are in play
  • Short-term technical due diligence — AI products, inference stacks, scraping infrastructure
  • Speaking and podcasts on production AI infrastructure, owned-inference economics, or the Hosting90 → AI transition

Not currently looking for

  • Full-time relocation roles outside the Czech Republic
  • Projects requiring more than ~20 hours per week