reasoning model and in context reinforcement learning

general reasoning capability and long context model are reshaping ai theoretical landscape

after reading ai co-scientist paper¹, I learn something new about in context reinforcement learning (icrl)².

rl and icrl

rl is learning through trial and error. the trajectory is represented by a sequence of (observation, action, reward) tuples. the idea of icrl is putting trajectories across many episodes in the context such that the model would implement some rl algorithms in the forward pass, and a better policy is born without costly parameter updates.

in light of language agent, the boundary of these terms are getting blurry. at the end of the day, learning by trial and error is all about learning from a set of (cause, effect). ReAct³ is a useful mental baseline. the sequence could be reinterpreted:

previous state
planning
action
changed state
reflection

internal/external duality

by definition of agency⁴, separation of internal self and external environment is important.

action is the cause, explicit output. motion or word makes no difference. planning and reflection are internal actions applies to working memory and long term memory. environment change is external state. memory is internal state. the environment could remain the same for a long time while the agent working on internal states.

reasoning and icrl

the most interesting part in ai co-scientist paper is meta review agent.

This agent also enables the co-scientist’s continuous improvement by synthesizing insights from all reviews, identifying recurring patterns in tournament debates, and using these findings to optimize other agents’ performance in subsequent iterations.

few connections could be made:

planning and reflection are special case of reasoning.
planning makes agent act better, which increases the probability of agent having interesting and useful experiences. good experiences would be used for both gradient free in context learning and gradient based learning.
reflection is a form icrl in language space, which is elegantly illustrated by meta review agent.
effective reasoning over long context would benefit general searching and learning process.
faithfulness of reasoning is a critical research problem⁵.

reward hacking

numerical reward by human defined reward function would never solve reward hacking problem because projecting high dimensional semantic space to 1d numerical space would always be too lossy to capture critical nuances in real life. ‘you get what you wish for’ is inevitable under numerical reward framework.

reflection provides better feedback. the multiagent dynamics⁶ could afford hierarchical feedback as emergent property in properly initialized and evolved society of minds. this is the direction of better incentive design⁷ for next generation ai model.

reference: