首页> 外文会议>Advanced Maui Optical and Space Surveillance Technologies Conference >Dynamic Sensor Tasking for Space Situational Awareness via Reinforcement Learning
【24h】

Dynamic Sensor Tasking for Space Situational Awareness via Reinforcement Learning

机译:通过加固学习实现空间情境意识的动态传感器任务

获取原文

摘要

Conference Paper This paper studies the Sensor Management (SM) problem for optical Space Object (SO) tracking. The tasking problem is formulated as a Markov Decision Process (MDP) and solved using Reinforcement Learning (RL). The RL problem is solved using the actor-critic policy gradient approach. The actor provides a policy which is random over actions and given by a parametric probability density function (pdf). The critic evaluates the policy by calculating the estimated total reward or the value function for the problem. The parameters of the policy action pdf are optimized using gradients with respect to the reward function. Both the critic and the actor are modeled using deep neural networks (multi-layer neural networks). The policy neural network takes the current state as input and outputs probabilities for each possible action. This policy is random, and can be evaluated by sampling random actions using the probabilities determined by the policy neural network's outputs. The critic approximates the total reward using a neural network. The estimated total reward is used to approximate the gradient of the policy network with respect to the network parameters. This approach is used to find the non-myopic optimal policy for tasking optical sensors to estimate SO orbits. The reward function is based on reducing the uncertainty for the overall catalog to below a user specified uncertainty threshold. This work uses a 30 km total position error for the uncertainty threshold. This work provides the RL method with a negative reward as long as any SO has a total position error above the uncertainty threshold. This penalizes policies that take longer to achieve the desired accuracy. A positive reward is provided when all SOs are below the catalog uncertainty threshold. An optimal policy is sought that takes actions to achieve the desired catalog uncertainty in minimum time. This work trains the policy in simulation by letting it task a single sensor to "learn" from its performance. The proposed approach for the SM problem is tested in simulation and good performance is found using the actor-critic policy gradient method.
机译:会议论文本文研究了光学空间对象(SO)跟踪的传感器管理(SM)问题。任务问题被制定为马尔可夫决策过程(MDP),并使用加强学习(RL)解决。使用演员批评政策梯度方法解决了RL问题。 actor提供了一种策略,该策略是随机的动作,并由参数概率密度函数(PDF)给出。评论家通过计算估计的总奖励或问题的价值函数来评估策略。策略动作PDF的参数使用渐变相对于奖励函数进行优化。评论家和演员都是使用深神经网络(多层神经网络)进行建模的。策略神经网络将当前状态作为输入和输出概率进行每个可能的动作。此策略是随机的,可以通过使用策略神经网络输出确定的概率进行采样随机操作来评估。评论家近似于使用神经网络的总奖励。估计的总奖励用于近似于对网络参数的策略网络的梯度。这种方法用于找到用于估计所以轨道的任务光学传感器的非近视最佳策略。奖励函数是基于降低整体目录的不确定性,以低于用户指定的不确定性阈值。这项工作使用30公里的总位置误差以获得不确定性阈值。这项工作提供了具有负奖励的RL方法,只要任何所以具有高于不确定阈值的总位置误差。这惩罚了更长时间才能达到所需准确性的政策。当所有SOS低于目录不确定性阈值时,提供了积极奖励。寻求最佳政策,以便在最短时间内实现所需的目录不确定性。这项工作通过让单个传感器从其性能“学习”来训练模拟中的策略。在模拟和良好的性能下测试了SM问题的所提出的方法,使用演员 - 批评政策梯度方法找到了良好的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号