LLM for RL

Dawid Laszuk published on April 25, 2026

2 min, 300 words

I've been thinking about reinforcement learning (RL) much more recently. It's always been on my mind but recently it got much more attention. Multiple reasons. The strongest one hearing calling everything "agent" and then trying to understand what the author means by that. To me, it always comes back to RL definition: an actor which can observe some representation of an actual state and acts on it with a feedback. That's quite dry for most; they usually mean "LLM does something in a loop". I guess that works as well. So, if it works in definitions, does it work in "practise?"

To be honest, I've been missing an opportunity to play around with RL a bit more. This might be the time to play with it. My old project - ai-traineree which I haven't updated in 4 years(!), except small chores - seems to be calling me.

When LLMs were getting popular there were some attempts at using them as RL agents. Some attempts were around playing games with a popular meme of Pokemon Red was the thing. At the time, I thought that's just a waste of time and resources. Way too large a model for such a thing. But, as the time got by and as the definition of "big" has changed, I'm more thinking that it's actually a pretty good idea. Not to use LLM to completely decide on the action to take but augment it via planning or "memory" consolidation.

That's a long way of saying I'm starting to play again with RL. There are a lot of great, tiny models. Fine-tuning is trivial. Augmenting DQN with small language models (SLMs) seems like a potential improvement. Or, if not, a fun side project. I think it's going to teach me something.