Final Report

Video

AcastaCrafterProgressVideo

Project Summary

Our project aimed to develop an AI agent capable of autonomously navigating and performing tasks in a Minecraft environment, specifically focusing on the “MineRLTreechop-v0” environment. Initially, we intended to train an agent to obtain diamonds in a complex, underground setting. However, due to various technical challenges, including compatibility issues with HPC3 and the complexity of the environment, we shifted our focus to mastering the simpler task of chopping trees.

The primary objective became training an agent to efficiently locate and chop down trees in the “MineRLTreechop-v0” environment. This environment provides only visual input (64x64 pixel images) and requires the agent to learn a sequence of actions to navigate and interact with the environment to obtain logs.

The challenge of this task lies in the sparse reward structure and the complexity of visual navigation. The agent must learn to interpret visual cues, make decisions about movement and interaction, and persist through long sequences of actions to receive a single reward. The task is not trivial because it requires the agent to learn complex behaviors from pixel-level data and sparse rewards, making it a suitable problem for AI/ML algorithms.

Approaches

Baseline Approach: Random Actions

Initially, we established a baseline by implementing an agent that took random actions. This approach provided a benchmark for comparing the performance of our trained models. The random agent’s performance was predictably poor, as it rarely succeeded in locating or chopping down trees, highlighting the difficulty of the task without learning.

Proposed Approach: Behavior Cloning and Proximal Policy Optimization (PPO)

Our main approach involved a two-stage training process: behavior cloning (BC) followed by Proximal Policy Optimization (PPO).

  1. Behavior Cloning (BC):
    • We utilized the MineRL dataset, which includes recordings of human players interacting with the environment.
    • We implemented a data preprocessing script (gen_pretrain_data.py) to convert the human player actions into a discrete action space compatible with our agent.
    • We used the preprocessed dataset to train a model using behavior cloning, aiming to initialize the agent with human-like behavior.
    • The BC pre-train was done by using stable-baselines with tensorflow 1.x.
  2. Proximal Policy Optimization (PPO):
    • After the BC stage, we fine-tuned the model using PPO, a reinforcement learning algorithm.
    • We implemented custom wrappers (wrappers.py) to shape the observation and action spaces, making them suitable for our agent.
    • We used stable-baselines3 with PPO, pytorch, and tensorflow 2.x for the RL training.
    • We experimented with different hyperparameters, including learning rates and batch sizes, to optimize performance.

Challenges and Insights

Evaluation

We evaluated our agent’s performance using both quantitative and qualitative methods.

Quantitative Evaluation

MineCraft

Our best agent could collect more than 20 logs in 2_000_000 steps while the random agent (baseline) couldn’t collect any log.

Qualitative Evaluation

References

AI Tool Usage