黄沛. 连续时间与空间中基于广义优势估计的Actor-Critic算法J. 内江师范学院学报, 2026, 41(4): 29-35. DOI: 10.13603/j.cnki.51-1621/z.2026.04.005
    引用本文: 黄沛. 连续时间与空间中基于广义优势估计的Actor-Critic算法J. 内江师范学院学报, 2026, 41(4): 29-35. DOI: 10.13603/j.cnki.51-1621/z.2026.04.005
    HUANG Pei. Actor-Critic algorithm based on generalized advantage estimation in continuous time and spaceJ. Journal of Neijiang Normal University, 2026, 41(4): 29-35. DOI: 10.13603/j.cnki.51-1621/z.2026.04.005
    Citation: HUANG Pei. Actor-Critic algorithm based on generalized advantage estimation in continuous time and spaceJ. Journal of Neijiang Normal University, 2026, 41(4): 29-35. DOI: 10.13603/j.cnki.51-1621/z.2026.04.005

    连续时间与空间中基于广义优势估计的Actor-Critic算法

    Actor-Critic algorithm based on generalized advantage estimation in continuous time and space

    • 摘要: 本文提出一种基于连续时间强化学习与广义优势(GAE)估计结合的新型Actor-Critic算法,解决了普通连续时间方法策略梯度方差高、收敛不稳定的问题.通过将GAE的多步优势估计推广至连续时间域,重新定义了积分形式的优势函数,并以此优化策略评估(PE)与策略梯度(PG)过程,在降低方差的同时保留连续动态精确性的特点.实验表明,改进后的算法在模拟环境MuJoCo的Ant-v4中表现优秀,收敛速度相同的情况下,奖励方差明显减少.该算法在连续动作空间且奖励稀疏的复杂控制领域展现出显著的应用潜力.

       

      Abstract: This paper proposes a novel Actor-Critic algorithm that integrates continuous-time reinforcement learning with Generalized Advantage Estimation (GAE), addressing the issues of high policy gradient variance and unstable convergence commonly found in conventional continuous-time methods. By extending the multi-step advantage estimation of GAE to the continuous-time domain, we redefine the advantage function in an integral form and subsequently optimize both the policy evaluation (PE) and policy gradient (PG) processes. This approach reduces variance while preserving the accuracy of continuous dynamics. Experimental results demonstrate that the improved algorithm performs excellently in the MuJoCo Ant-v4 simulation environment, achieving a significant reduction in reward variance under the same convergence speed. The proposed algorithm exhibits substantial application potential in complex control domains characterized by continuous action spaces and sparse rewards.

       

    /

    返回文章
    返回