This project explored the effectiveness of intrinsic reward systems in reinforcement learning. I experimented with a hummingbird agent in a Unity virtual environment that was trained to collect nectar from flowers. The standard approach only rewards the agent for accomplishing the primary goal (an extrinsic reward system), but I hypothesized that providing additional rewards for behaviors that might lead to the goal could improve training efficiency and performance.
I implemented a virtual environment in Unity where a hummingbird agent navigated to collect nectar from flowers. The agent received observational inputs including its position and orientation in 3D space, as well as a vector between the tip of its beak and the closest flower's nectar hitbox. The baseline model used an extrinsic reward system, where the agent was only rewarded for successfully collecting nectar by positioning its beak inside a flower's nectar hitbox.
I designed a novel location-based intrinsic reward system that supplemented the primary goal with additional rewards for productive exploration. This involved creating triangular prism-shaped triggers arranged in a circular "pizza slice" pattern at the same Z-level as the flowers. When the agent entered these trigger zones, it received a small reward, and the trigger was then disabled for the remainder of that training episode.
The key innovation was rewarding the agent for exploring the Z-level where flowers were located, even before directly interacting with the flowers themselves. This encouraged the agent to stay in productive areas and explore within those areas, potentially discovering more efficient paths to nectar collection.
I trained both the standard extrinsic-only reward model and the enhanced intrinsic+extrinsic reward model using the Unity ML-Agents framework with a Python backend. The training process was managed through an Anaconda environment, and a custom configuration file specified the neural network architecture and hyperparameters. TensorBoard was used to visualize and compare performance metrics between the two approaches over time.
The location-based intrinsic reward system demonstrated faster convergence to optimal behavior compared to the standard extrinsic-only reward approach. Agents trained with the combined reward system showed more efficient exploration patterns, spending more time in areas with potential rewards and less time in unproductive regions. This project highlights the importance of reward shaping in reinforcement learning and demonstrates how well-designed intrinsic rewards can guide agents toward desirable behaviors even before they achieve the primary goal.