Instead of just generating the next response, it simulates entire conversation trees to find paths that achieve long-term goals.
How it works:
- Generates multiple response candidates at each conversation state
- Simulates how conversations might unfold down each branch (using the LLM to predict user responses)
- Scores each trajectory on metrics like empathy, goal achievement, coherence
- Uses MCTS with UCB1 to efficiently explore the most promising paths
- Selects the response that leads to the best expected outcome
Limitations:
- Scoring is done by the same LLM that generates responses
- Branch pruning is naive - just threshold-based instead of something smarter like progressive widening
- Memory usage grows with tree size, there currently no node recycling
Damnit, I saw MCTS and thought it’d be something neat, but then it’s LLMs because of course every piece of tech news is LLMs.
Making the LLM use an LLM to figure out what to say almost feels like a pretty good tech news shitpost
overthinking how someone might react to what it could say in multiple branches with growing resource usage
This is it, if they can get it to reliably decide to just not say anything at all in the end then I have been fully replaced
Lol same, what I was thinking while reading the features was “wow, they found a way to simulate masking!”
They have finally done it: they’ve figured out a way to make LLMs even heavier computation-wise
I mean LLMs have gotten orders of magnitude more efficient in just the past year, but also using these types approaches might make it possible to use much smaller models, and iterate on the result.
Interesting. I’m not sophisticated enough to judge this particular implementation but the concept of generating entire conversation trees to judge the quality of an output intrigues me for sure and I’d be interested in reading more about it and any research around it. Got any good links for further reading?
I think that’s an interesting approach as well. There are a bunch of research papers on using MCTS with LLMs, a few examples here:
https://arxiv.org/abs/2503.19309
https://arxiv.org/abs/2505.23229
https://arxiv.org/abs/2504.02426
This seems interesting, but 28 queries per response (in the demo shown) is a whole lot of compute