Small language models are the future of agentic AI

87 points by favoboa 11 hours ago

bryant 9 hours ago

A few weeks ago, I processed a product refund with Amazon via agent. It was simple, straightforward, and surprisingly obvious that it was backed by a language model based on how it responded to my frustration about it asking tons of questions. But in the end, it processed my refund without ever connecting me with a human being.

I don't know whether Amazon relies on LLMs or SLMs for this and for similar interactions, but it makes tons of financial sense to use SLMs for narrowly scoped agents. In use cases like customer service, the intelligence behind LLMs is all wasted on the task the agents are trained for.

Wouldn't surprise me if down the road we start suggesting role-specific SLMs rather than general LLMs as both an ethics- and security-risk mitigation too.

automatic6131 6 hours ago

You can (used to?) get a refund on Amazon with normal CRUD app flow. Putting an SLM and a conversational interface over it is a backwards step.
- DebtDeflation an hour ago
  
  Likewise, we've been building conversational interfaces for stuff like this for over a decade using traditional NLP techniques and orchestration rather than LLMs. Intent classification, named entity extraction, slot filling, and API calling.
  Processing returns is pretty standardized - identify the order and the item within it being returned, capture the reason for the return, check eligibility, capture whether they want a refund or a new item, and execute either. When you have a fully deterministic workflow with 5-6 steps, maybe 1 or 2 if-then branches, and then a single action at the end, I don't see the value of running an LLM in a loop, burning a crazy amount of tokens, and hoping it works at least 80% of the time when there are far simpler and cheaper ways of doing it that will work almost 100% of the time.
  - zsyllepsis an hour ago
    
    True, we have been building conversational interfaces with traditional NLP. In my experience, they’ve been fairly fragile.
    Extending the example you gave, nicely packaged, fully deterministic workflows work great in demos. Then customers start going off the paved path. They ask about returning 3 items all at once, or a whole order. They get confused and provide a shipping number instead of an order number. They switch language part of the way through the conversation because they get frustrated by all these follow-up questions.
    All of these absolutely can be handled through traditional NLP, but require system designers to account for them, model the conversation, and design their system accordingly to react accordingly. And suddenly the 5-6 step deterministic workflows with a couple of if-branches… isn’t.
- sillytwice 4 hours ago
  
  Recently I sent a product on guarantee to Amazon for reparation using a tag label and paying 42€, and the next day they cancelled the label of the product (they are investigating why) and the product was rejected in the Amazon store. Now, following indications, I have to wait the product to come back and resend it paying another 42 € that they promise to refund me later. The product is an air conditioned system for an old woman, here in South Spain the temperature is very high (a hot wave), and I think it will take a long time to be repaired (just to match up with their repairing service described below).
  The product maker cecotec, from whom I hope never to buy a product from, uses the following repairing process: you have to create an account to submit your data and then when you try to login into the caretaker page the system announces that the account you created does not exist (I have tried several times, several days, both with mine and my wife email and personal data). Furthermore there is no way someone should take the telephone on cecotec.
  Another step that failed is to ask the Spanish postal service to give me some option to try to send again the product from its current location, stopped expecting sender order, to Amazon Storage avoiding the product come back step: they informed me that they can not respond to my email because the privacy law forbid it, perhaps this is because the product was send with my wife name and address. I don't suppose the air conditioned system to have some personal information attached to it.
  Buddishm says that you can learn from any experience and maybe the karma of this product is to never be repaired and the fate has decided to condemn the old lone woman to suffer the hot wave, perhaps to expire some past life bad karma.
  Fortunately all that comes goes by, and in this case I am happy to be able affording to buy a new machine that produces cold air, so, kind reader, be quiet there is no real problem. Furthermore, this experience can expand my empathy: it could be that for some people life don't work as it should, for them this anecdote is the normal course of actions where one problem calls another. For those a stream of problems is the only repl. To those I, most sincerely, wish peace of mind and hope their fortune reverse.
  In this anecdote or episode human intelligence is not producing correct results and that could create a hope: that those small LLM models enhanced with intelligent agents could provide better support.
  Today, in my current mental state I envision that the contrary could occur: those system could convert the bad things that happens once into the bad things that happens every day. So be careful with what you wish.
  My hope it is that the greatest agents, ourselves, get together to solve whatever problem we have to cope with. But don't fool yourself that require real human deep intelligence and human hard work.
  - sillytwice a minute ago
    
    Solving the stream of problems with a repl (read-eval-print-loop) is like in python:
    while life: problem = read(bureaucracy, bad_luck, bad_ai) response = eval(problem, resources = few) print(response) # Spoiler '404 Solution not found'
- oblio 5 hours ago
  
  From our perspective as users. From the company's perspective? Net positive, they don't need to hire people.
  We're going to be so messed up in a decade or so when only 10-20-30% of the population is employable in decent jobs.
  People keep harping on about people moving on with their lives, but people don't. Many industrial heartlands in the developed world are wastelands compared to what they were: Walloonia in Belgium, Scotland in the UK, the Rust Belt in the US.
  People don't really move on, they suffer, sometimes for generations.
  - thatjoeoverthr 4 hours ago
    
    A CRUD flow is the actual automation, which was already digested into the economy by 2005 or so. PHP is not a guy in the back who types HTML really fast when you click a button :)
    The LLM, here, is the opposite; additional human labor to build the integrations, additional capital for chips, heavy cost of inference, an additional skeuomorphic UI (it self identifies as a chat/texting situation) and your wasted time. I would almost call it "make work".
torginus 6 hours ago

I just had my first experience with a customer service LLM. I needed to get my account details changed, and for that I needed to use the customer support chat.
The LLM told me what sort of information they need, and what is the process, after which I followed through the whole thing.
After I went through the whole thing it reassured me everything is in order, and my request is being processed.
For two weeks, nothing happened, I emailed the (human) support staff, and they responded to me, that they can see no such request in their system, turns out the LLM hallucinated the entire customer flow and was just spewing BS at me.
- dotancohen 5 hours ago
  
  This is reason number two why I always request the service ticket number.
  Reason number one being that when the rep feels you are going to hold them accountable to the point of requesting such a number, you might not be the type of client to pull shenanigans with. Maybe they suspect me of being a cooperate QC agent? Either way, requesting such a number demonstrably reduces friction.
- ttctciyf 6 hours ago
  
  There really should be some comeback for this type of enshAItification.
  We're supposed to think "oh it's an LLM, well, that's ok then"? A question we'll be asking more frequently as time goes on, I suspect.
- koakuma-chan 4 hours ago
  
  You basically have to always use tool_choice="required" or the LLM will derail
- scarface_74 an hour ago
  
  That has nothing to do with sn LLM. Any chah based system whether LLM or not is going to interpret the human input and convert it to a standardized request for backend processing. This is just a badly written system.
- thatjoeoverthr 4 hours ago
  
  The LLM is a smoke bomb they shot in your face :)
- exe34 6 hours ago
  
  That's why I take screenshots of anything that I don't get an email confirmation for.
quietbritishjim 5 hours ago

Air Canada famously lost a court case recently (though the actual interaction happened in 2022) where their chat bot promised a discount that they didn't actually offer. They tried to argue that the chatbot was a "separate legal entity that is responsible for its own actions"!! It still took that person a court case and countless hours to get the discount so it's hardly a victory really.
https://www.bbc.co.uk/travel/article/20240222-air-canada-cha...
- nurettin 5 hours ago
  
  This is why law in it's current form is wrong in every country and jurisdiction.
  We need "cumulative cases" that work like this: you submit your complaints to existing cumulative cases or open a new one, these are vetted by prosecutors.
  They accumulate evidence over time and once it is a respectable sum, a court case is opened (paid by the corporation) everyone receives what they are owed if/when the case is won. If the court case loses, is appealed, and loses again, that cumulative case is banned.
  Cumulative cases would have greater reprocussions to large corporate entities than "single person takes to court for several months to fight for a $40 discount".
  And the people who complain rightfully eventually get a nice surprise in their bank accounts.
scarface_74 an hour ago

That’s not really an LLM problem. Even sentiment analysts is like the ML version of “Hello World”.
At most that could be done with pre LLM chatbots like classic Alexa - or the AWS equivalent Anazon Lex (which has Sentiment Analysis built in)

mike_hearn 4 hours ago

I wonder if NVIDIA are worried about how concentrated their AI customer base is. The paper is more like a personal blog post than a scientific investigation.

Anyway really love the idea but many years of experience with decentralization / security / privacy projects makes me think it probably won't happen. Their description of how to incorporate SLMs at the end gives the game away: it's a description of a large, complex project that requires fairly good data science skills e.g. they just casually suggest you autoclean the data of PII then run unsupervised clustering over the results to prepare data for model fine tuning using QLoRA, and then set up an automated pipeline to do this continuously. Sure. We'll get right on that.

The history of computing is pretty simple: given a choice between spending more on hardware or more on developers, we always prefer more hardware. For NVIDIA this is a good thing modulo the fact that nobody buys their hardware directly because it's too overpowered. But that's the way they've chosen to segment the market. Given a choice between using a sledgehammer to crack a nut, or to make a custom hammer for nut cracking, we're gonna spend the money on outsourced LLMs every time.

Naturally, perhaps in future LLMs will create these SLM factories for us! If you assume software is mostly written by LLMs in future then past experience about expressed preference of software teams might not apply. But we're not there yet.

flowerthoughts 6 hours ago

No mention of mixture-of-exports. Seems related. They do list a DeepSeek R1 distillate as an SLM. The introduction starts with sales pitch. And there's a call-to-action at the end. This seems like marketing with source references sprinkled in.

That said, I also think the "Unix" approach to ML is right. We should see more splits, however currently all these tools rely on great language comprehension. Sure, we might be able to train a model on only English and delegate translation to another model, but that will certainly lose (much needed) color. So if all of these agents will need comprehensive language understanding anyway, to be able to communicate with each other, is SLM really better than MoE?

What I'd love to "distill" out of these models is domain knowledge that is stale anyway. It's great that I can ask Claude to implement a React component, but why does the model that can do taxes so-so also try to write a React component so-so? Perhaps what's needed is a search engine to find agents. Now we're into expensive market place subscription territory, but that's probably viable for companies. It'll create a larger us-them chasm, though and the winner takes it all.

mg 6 hours ago

I wonder how the math turns out when we compare the energy use of local vs remote models from first principles.

A server needs energy to build it, house, power and maintain it. It is optimized for throughoutput and can be used 100% of the time. To use the server, additional energy is needed to send packets through the internet.

A local machine needs energy to build and power it. If it lives inside a person's phone or laptop, one could say housing and maintenance is free. It is optimized to have a nice form factor for personal use. It is used maybe 10% of the time or so. No energy for internet packages is needed when using the local machine.

My initial gut feeling is that the server will have way better energy efficiency when it comes to the amount of calculations it can do over its lifetime and how much energy it needs over its lifetime. But I would love to see the actual math.

danhor 5 hours ago

As the local machine is there anyway, only the increase in energy usage should be considered, while the server only exists for this use case (distributed across all users).
The local machine is usually also highly constrained in computing power, energy (when battery driven) and thermals, I would expect the compute needed to be very different. The remote user will happily choose a large(r) model, while for the local use case a highly optimized (small) model will be chosen.

iagooar 5 hours ago

I think that part of the beauty of LLMs is their versatility in so many different scenarios. When I build my agentic pipeline, I can plug in any of the major LLMs, add a prompt to it, and have it go off to do its job.

Specialized, fine-tuned models sit somewhere in between LLMs and traditional procedural code. The fine-tuning process takes time and is a risk if it goes wrong. In the meantime, the LLMs by major providers get smarter every day.

Sure enough, latency and cost are a thing. But unless you have a very specific task performed at a huge scale, you might be better off using an off-the-shelf LLM.

incrudible 3 hours ago

> In the meantime, the LLMs by major providers get smarter every day.
Are they though? Or are they just getting better at gaming benchmarks?
Subjectively, there has been modest progress in the past year, but I'm curious to hear other anecdotes from people that aren't firmly invested in the hype.
- iagooar an hour ago
  
  If you have used Sonnet 3.5, 3.7 and 4 in the last few months, you know how much the model has improved. I am achieving 3-5x complexity with latest Sonnet as compared to what was possible with the earlier versions.
  They are getting much much better.

janpmz 9 hours ago

One could start with a large model for exploration during development, and then distill it down to a small model that covers the variety of the task and fits on a USB drive. E.g. when I use a model for gardening purposes, I could prune knowledge about other topics.

loktarogar 8 hours ago

Pruning is exactly what you're looking for in a gardening SLM
dotancohen 5 hours ago

In what sense would you need an LLM while gardening for? I'm imagining for problem solving, like asking "what worm looks like a small horse hair". But that would require the LLM to know what a horse hair is. In other words, not a distilled model, but rather a model that contains pretty much anything our gardener's imagination will make analogies out of.

rayxi271828 6 hours ago

Wonder what I'm missing here. A smaller number of repetitive tasks - that's basically just simple coding + some RPA sprinkled on top, no?

Once you've settled down on a few well-known paths of action, wouldn't you want to freeze those paths and make it 100% predictable, for the most part?

fnordpiglet 3 hours ago

An example I’ve done with a small language model is fine tuning a small model for evaluating semi structured user input for whether it’s likely sanctions evasion. The heuristic code to do this due to the fact is got a natural language component is very complex and has a relatively poor precision and recall. A fine tuned model pretty reliably classified the inputs and was pretty resilient to adversarial attacks while heuristics - partially due to their full determinism - were very brittle over time.
These sorts of “heuristics are surprising incapable while the semantic flexibility of language models are powerful” are surprisingly large. Even flexible validation mechanisms that take human entered semi fixed form inputs and reformat them to the expected input in a really reliable way and is much less frustrating to end users. Essentially any situation where abductive logic or natural language comes into the picture a small model does really well.
Both of those things - abductive logic and natural language - were largely unavailable as tools until recently. This pretty nearly rounds out the complete toolkit for making really robust and powerful (and usable) systems. You sacrifice a perceived determinism by admitting abductive logic and non determinism, but in my experience this warm blanket of the mathematically inclined wasn’t particularly robust in reality and systems often deterministically failed in complex and difficult if not impossible ways to avoid that a little bit of abductive reasoning could make it remarkably simpler and more robust.

eric-burel 9 hours ago

Slightly related, on the cooperation between large models and small models (traditional ML) : https://arxiv.org/abs/2409.06857

moqizhengz 6 hours ago

How is SLM the future of AI while we are not even sure about if LMs are the future of AI?

umtksa 4 hours ago

I know this isn’t a question, but more of a general observation about LMs. However, I’d still like to say that a fine-tuned Qwen3 0.6B model can produce more effective and faster results than a raw Gemma 3 12B model. Maybe it’s because I’m not a programmer, but I believe being able to give commands in natural language adds a great deal of flexibility to software.
boxed 6 hours ago

"Future" maybe means "next two months"? :P

sReinwald 4 hours ago

IMO, the paper commits an omission that undermines the thesis quite a bit: context window limitations are mentioned only once in passing (unless I missed something) and then completely ignored throughout the analysis of SLM suitability for agentic systems.

This is not a minor oversight - it's arguably, in my experience, the most prohibitive technical barrier to this vision. Consider the actual context requirements of modern agentic systems:

    - Claude 4 Sonnet's system prompt alone is reportedly roughly 25k tokens for the behavioral instructions and instructions for tool use
    - A typical coding agent needs: system instructions, tool definitions, current file context, broader context of the project it's working in. Additionally, you might also want to pull in documentation for any frameworks or API specs.
    - You're already at 5-10k tokens of "meta" content before any actual work begins

Most SLM that can run on consumer hardware are capped at 32k or 128k contexts architecturally, but depending on what you consider a "common consumer electronic device" you'll never be able to make use of that window if you want inference at reasonable inference speeds. A 7b or 8b Model like DeepSeek-R1-Distill or Salesforce xLAM-2-8b would take 8GB of VRAM at Q4_K_M Quant with Q8_0 K/V cache at 128k context. IMO, that's not just simple consumer hardware in the sense of the broad computing market, it's enthusiast gaming hardware. Not to mention that performance degrades significantly before hitting those limits.

The "context rot" phenomenon is real: as the ratio of instructional/tool content to actual tasks content increases, models become increasingly confused, hallucinate non-existent tools or forget earlier context. If you have worked with these smaller models, you'll have experienced this firsthand - and big models like o3 or Claude 3.7/4 are not above that either.

Beyond context limitations, the paper's economic efficiency claims simply fall apart under system-level analysis. The authors present simplistic FLOP comparisons while ignoring critical inefficiencies:

    - Retry tax: An LLM completing a complex task with 90% success rate might very well become 3 or 4 attempts at task completion for an SLM, each with full orchestration overhead
    - Task decomposition overhead: Splitting a task that an LLM might be able to complete in one call into five SLM sub-tasks means 5x context setup, inter-task communication costs, and multiplicative error rates
    - Infrastructure efficiency: Modern datacenters achieve PUE ratios near 1.1 with liquid cooling and >90% GPU utilization through batching. Consumer hardware? Gaming GPUS at 5-10% utilization, residential HVAC never designed for sustained compute, and 80-85% power conversion efficiency per device.

When you account for failed attempts, orchestration overhead and infrastructure efficiency, many "economical" SLM deployments likely consume more total energy than centralized LLM inference. It's telling that NVIDIA Research, with deep access to both datacenter and consumer GPU performance data, provides no actual system-level efficiency analysis.

For a paper positioning itself as a comprehensive analysis of SLM viability in agentic systems, sidestepping both context limitations and true system economics while making sweeping efficiency claims feels intellectually dishonest. Though, perhaps I shouldn't be surprised that NVIDIA Research concludes that running language models on both server and consumer hardware represents the optimal path forward.

ewuhic 5 hours ago

Why is this a paper and not a blog post. Anyone who thinks it deserves to be a paper is either dumb or snakeoil salesman.

chesterhunt20 3 hours ago

Absolutely agree-small language models are a crucial step toward scalable, efficient agentic AI. Their reduced computational footprint enables faster, on-device reasoning, which is essential for real-time decision-making in autonomous agents. Platforms like Legittmate AI are already exploring how small models can be fine-tuned for secure, task-specific workflows in legal and contract automation. With enhanced fine-tuning and optimization, SLMs are becoming increasingly capable—without the heavy cost and latency of larger models. The future of agentic AI isn't just about power-it's about precision, privacy, and portability.

bglazer 2 hours ago

Posting LLM generated ads on HN is the fastest way for me to lose amy respect or interest in a company