Files
OmniParser/omnitool/readme.md
2025-02-05 11:11:08 -08:00

4.1 KiB

OmniTool Header

OmniTool

Control a Windows 11 VM with OmniParser + your vision model of choice.

Highlights:

  1. OmniParser V2 is 60% faster than V1 and now understands a wide variety of OS, app and inside app icons!
  2. OmniBox uses 50% less disk space than other Windows VMs for agent testing, whilst providing the same computer use API
  3. OmniTool supports out of the box the following vision models - OpenAI (4o/o1/o3-mini), DeepSeek (R1), Qwen (2.5VL) or Anthropic Computer Use

Overview

There are three components:

omniparserserver FastAPI server running OmniParser V2.
omnibox A Windows 11 VM running in a Docker container.
gradio UI to provide commands and watch reasoning + execution on OmniBox

Notes:

  1. Though OmniParser V2 can run on a CPU, we have separated this out if you want to run it fast on a GPU machine
  2. The OmniBox Windows 11 VM docker is dependent on KVM so can only run quickly on Windows and Linux. This can run on a CPU machine (doesn't need GPU).
  3. The Gradio UI can also run on a CPU machine. We suggest running omnibox and gradio on the same CPU machine and omniparserserver on a GPU server.

Setup

  1. omniparserserver:

    a. If you already have a conda environment for OmniParser, you can use that. Else follow the following steps to create one

    b. Ensure conda is installed with conda --version or install from the Anaconda website

    c. Navigate to the root of the repo with cd OmniParser

    d. Create a conda python environment with conda create -n "omni" python==3.12

    e. Set the python environment to be used with conda activate omni

    f. Install the dependencies with pip install -r requirements.txt

    g. Continue from here if you already had the conda environment.

    h. Ensure you have the weights downloaded in weights folder. If not download them with: for folder in icon_caption_florence icon_detect icon_detect_v1_5; do huggingface-cli download microsoft/OmniParser --local-dir weights/ --repo-type model --include "$folder/*"; done

    h. Navigate to the server directory with cd OmniParser/omnitool/omniparserserver

    i. Start the server with python -m omniparserserver

  2. omnibox:

    a. Install Docker Desktop

    b. Visit Microsoft Evaluation Center, accept the Terms of Service, and download a Windows 11 Enterprise Evaluation (90-day trial, English, United States) ISO file [~6GB]. Rename the file to custom.iso and copy it to the directory OmniParser/omnitool/omnibox/vm/win11iso

    c. Navigate to vm management script directory withcd OmniParser/omnitool/omnibox/scripts

    d. Build the docker container [400MB] and install the ISO to a storage folder [20GB] with ./manage_vm.sh create

    e. After creating the first time it will store a save of the VM state in vm/win11storage. You can then manage the VM with ./manage_vm.sh start and ./manage_vm.sh stop. To delete the VM, use ./manage_vm.sh delete and delete the OmniParser/omnitool/omnibox/vm/win11storage directory.

  3. gradio:

    a. Navigate to the gradio directory with cd OmniParser/omnitool/gradio

    b. Ensure you have activated the conda python environment with conda activate omni

    c. Start the server with python app.py --windows_host_url localhost:8006 --omniparser_server_url localhost:8000

    d. Open the URL in the terminal output, set your API Key and start playing with the AI agent!