Rule-Based Reinforcement Learning for Document Image Classification with Vision Language Models

Michael Jungo, Andreas Fischer

公開日: 2025/9/26

Abstract

Rule-based reinforcement learning has been gaining popularity ever since DeepSeek-R1 has demonstrated its success through simple verifiable rewards. In the domain of document analysis, reinforcement learning is not as prevalent, even though many downstream tasks may benefit from the emerging properties of reinforcement learning, particularly the enhanced reason capabilities. We study the effects of rule-based reinforcement learning with the task of Document Image Classification which is one of the most commonly studied downstream tasks in document analysis. We find that reinforcement learning tends to have better generalisation capabilities to out-of-distritbution data, which we examine in three different scenarios, namely out-of-distribution images, unseen classes and different modalities. Our code is available at https://github.com/jungomi/vision-finetune.