In the super-fast-evolving world of AI, major players are racing to build the strongest models. As a result, models keep getting bigger and more complex. But as many specialists point out, we’re approaching a point of diminishing returns, where larger and more complex models deliver smaller performance gains. Many experts now argue that LLMs (Large Language Models) alone won’t lead us to AGI (Artificial General Intelligence), and that new approaches are needed.
At the same time, some companies are investing in Small Language Models (SLMs), sparking the debate: are SLMs worth the investment if LLMs can do so much more?
LLMs (like ChatGPT, Claude, Gemini, Grok, or LLaMA) are powerful, general-purpose models trained on vast datasets. They excel in complex reasoning, broad context understanding, and creative problem-solving. However, this power comes with trade-offs: high computational costs, slower response times, and heavy infrastructure requirements.
SLMs, on the other hand, are nimble, specialized, and resource-efficient. Models like Mistral 7B or Phi-3 are designed for low-latency performance (faster response times) on edge devices—even smartphones. They are fine-tuned for specific tasks and offer more control (less bias and hallucination), simpler operations and maintenance, and greater transparency. Additionally, due to their smaller size, they are significantly cheaper to train and deploy.
Choosing between LLMs and SLMs depends on the context. But just like in modern software, where we use modules, microservices, and API ecosystems, the real opportunity may lie in combining both approaches. What if the key is not choosing one model, but orchestrating many?
Imagine an AI architecture where the LLM is no longer the sole engine behind every task. Instead, it acts as a context-aware router or orchestrator:
It understands the intent and context of the query
It selects the most appropriate SLM(s) or external tools (e.g. a math library or knowledge base)
It routes the task to the right modules
It aggregates, interprets, or refines the results (and potentially reiterates to same or different SLMs)
It composes a meaningful, final response for the user
This is not science fiction, it is a scalable, efficient paradigm. Just as CPUs hand off work to GPUs or route I/O to dedicated chips, the future of AI could be built on modular intelligence.
This layered architecture opens the door to a thriving AI ecosystem:
Routing layer: A generalist LLM interprets the request and dynamically selects the best execution path
Execution layer: A mix of SLMs and non-AI components (e.g. logic engines, search APIs) handle specific sub-tasks
Feedback loop: The routing layer improves its strategy based on intermediate results
Composable outputs: A final response is assembled from multiple sources
Each module can be independently developed by niche players, think of it as a super-app for AI.
These players might specialize in:
Specific tasks, e.g. image generation, audio generation, OCR, speech-to-text conversion, document format conversion, video generation, math problem solving, statistical analysis…
Specific domains, e.g. financial services, particular programming languages, vendor-specific tools with embedded knowledge/manuals (e.g. SAP, Salesforce…)
This approach allows you to keep your preferred LLM while gaining access to best-of-breed specialist models. While some routing already happens internally within LLMs, exposing this functionality to an open ecosystem would be far more powerful.
LLMs could offer default integrations with selected partners, while also enabling users to plug in third-party routing services, potentially as a paid feature. We could even imagine plugging in internal models in a secure, privacy-aware way, allowing companies to combine an internally controlled model (trained on sensitive or classified data) with the rapid evolution of general-purpose LLMs and specialized SLMs.
The analogy to the emerging trend of payment orchestration platforms is useful. Where merchants used to connect to a single PSP, many now use orchestration platforms that route payments to different PSPs depending on method, cost, or availability. This routing also increases reliability, enabling fallback if one PSP is down or a method is temporarily disabled. The same logic applies to AI orchestration: the router could optimize based on model quality, domain expertise, cost per call, or availability.
Given the resource-intensive nature of AI models, SLMs might adopt dynamic pricing based on load (i.e. higher prices during peak usage). This would allow the LLM to reroute intelligently, improving load balancing and reducing the need for overprovisioning.
This architecture enables organizations to combine cost-efficiency with high performance, retain control over sensitive data, and reduce hallucination and bias—flaws often seen in monolithic LLMs.
The next wave of AI innovation might be powered by collaboration, not competition. It’s similar to the evolution of Fintechs: after initially trying to disrupt traditional banks, many are now embedding their services within the customer layers of those same institutions.
In my view, this collaboration will give rise to a landscape of specialized AI models, much like a modular API ecosystem or super-app, with orchestration at its core. The result will be stronger than the sum of its parts, in contrast to the all-in-one approach pursued by some tech giants like OpenAI.
There may even be a unique opportunity for European players. Instead of trying to catch up in the race for massive models (a race hindered by high energy costs and infrastructure gaps), Europe could lead by creating orchestrated ecosystems of best-of-breed, niche models that are focused, efficient, and highly performant.
The real question isn’t whether to use a large or small model, it’s how to combine them intelligently for the best result.

Comments
Post a Comment