Ai_Assistant/SETUP_GUIDE.md
2026-05-24 13:31:30 +02:00

11 KiB

Setup Guide

Prerequisites

Before starting, ensure you have the following installed:


1. Project Setup

Create Virtual Environment

  1. Open VS Code

  2. File → New Window

  3. File → Open Folder (select your project directory)

  4. Press Ctrl+Shift+P to open the command palette

  5. Type Python: Create Environment and select it

    Note: If you don't see this option, install the Python extension:

    • Go to the Extensions sidebar (Ctrl+Shift+X)
    • Search for "Python" and install it
    • Close and reopen VS Code, then try again
  6. Select Venv and choose Python 3.10

  7. Uncheck "Install dependencies from requirements.txt" (we'll do this manually)

  8. Click OK

Install Dependencies

  1. Open a new terminal: Terminal → New Terminal

  2. Verify your virtual environment is active (you should see .venv in the prompt):

    (.venv) F:\your_project_path>
    
  3. Install dependencies using uv (faster) or pip:

    # Option 1: Using uv (recommended - faster)
    pip install uv
    uv pip install -r requirements.txt
    
    # Option 2: Using pip
    pip install -r requirements.txt
    

    This should take 30 seconds to 1 minute depending on your system.


2. API Configuration

Create .env File

Create a .env file in the root directory with the following content:

OPENAI_API_KEY="sk-proj-YOUR_API_KEY"
GROQ_API_KEY="YOUR_GROQ_API_KEY"

Get API Keys

  1. OpenAI API Key:

    • Sign up at OpenAI Platform
    • Add $5 credit (should last 1-2 months for typical usage)
    • Copy your API key to the .env file

    Note: You can customize this to use a local AI model if preferred (streaming code doesn't support this yet, but local model support is planned)

  2. Groq API Key (Free):

    • Sign up at Groq Console
    • Copy your API key to the .env file

3. Configuration

Character Configuration

There are two main configuration files:

A. character_config.yaml

  • Set the AI prompt
  • Configure ASR (Automatic Speech Recognition) context
  • Add reference audio sample (must be 3-10 seconds long)
  • Enter the text spoken in the audio file

B. client/config.js

  • Change the 3D model (VRM_PATH)
  • Adjust mouth audio threshold (MOUTH_THRESHOLD)
  • Place model files in client/models/ directory
  • Update the filename in config
  • Important: Model must be in VRM 1.0 format (export setting in VRoid Studio)

This file also holds the VR_CONFIG block. The most useful knob there is dollyPosition: [x, y, z] — if your avatar looks too small (or too far/too close) when you put on the headset, edit dollyPosition to move the VR rig. Y raises/lowers your virtual eye level, Z is distance from the avatar. Other tunables in the same block: touchRadius (proximity needed to trigger a touch), triggerTouchRadius (extended range while the trigger is held), touchCooldown, and haptic intensity.

C. client/scene/roomConfig.js

Toggles the room/environment around the avatar. enabled: false runs with no room (default). Set enabled: true and edit url / position / rotation / scale to load a GLB. There's a Japanese classroom example pre-filled at the bottom — copy those values over the active block to switch to it. fixDepth: true is a workaround for rooms that z-fight.

D. client/scene/objectsConfig.js

Drops static GLB props into the scene. Each entry needs name (unique key the AI can reference), url (path under client/backgrounds/glb/), and a transform. Two commented-out examples (fish, mattress) show the shape.

Where to find rooms and props: Sketchfab has a huge library of free GLB downloads. Drop the file into client/backgrounds/glb/ and reference it in roomConfig.js or objectsConfig.js.


4. Starting the Servers

  1. Edit start_server.bat
  2. Change the following line to match your GPT-SoVITS installation path:
    set SOVITS_PATH=D:\PyProjects\GPT-SoVITS-v3lora-20250228\GPT-SoVITS-v3lora-20250228
    
  3. Run the script:
    • In terminal: start_server.bat
    • Or double-click the file in File Explorer
  4. Do not close any of the terminal windows that open

Option B: Manual Start

If automatic start doesn't work:

  1. Start the Python server:

    cd server
    python server.py
    
  2. Start the animation server (open a second terminal):

    cd client
    npx vite
    
  3. Open your browser and go to: http://localhost:5173

    You should see a 3D model floating on screen.


5. Running the Chat

  1. Run the main chat script:

    cd server
    python main_chat_v9.py
    

    This is the current voice loop. It records on speech, streams the LLM response, generates TTS per chunk, and plays it back in order. It also runs a background click dispatcher that responds verbally when the avatar is touched (see Section 8).

    MCP tool calls (optional): If you have an MCP server running and a config at ~/MCP_functions/mcp_config.json (or the path set in env var MCP_CONFIG_PATH), the script automatically picks it up and lets the model interleave speech and tool calls. Without that file it just streams text — no extra setup needed.

  2. Troubleshooting: If you encounter issues, run the setup check script:

    cd server
    python check_setup.py
    

    The check now includes click-interaction and walk-to tests in addition to the LLM / mic / TTS / VRM checks.


6. VR Support

The client supports WebXR through the Vite dev server.

  1. Connect your Quest to the PC:
    • Link Cable: plug the headset in, accept the "Allow access" prompt inside the headset, and switch to Quest Link from the Quest menu.
    • Air Link: make sure both devices are on the same network and pair them through the Oculus PC app.
  2. Start the servers (Section 4) and refresh the client at http://localhost:5173 inside the headset's browser, or click the Enter VR button on the desktop client while the headset is active.
  3. If the avatar looks too small, too far, or too low, edit VR_CONFIG.dollyPosition in client/config.js and refresh.

Touch / hand interactions

In VR you can reach out and touch the avatar. Hand proximity within touchRadius triggers a click; squeezing the trigger extends the range to triggerTouchRadius. Each touch:

  • Plays a quick reaction sound + animation on the client (handled by the /send_click_interaction endpoint).
  • Buffers a [the user touched your <region>] action that the chat loop's background dispatcher picks up and forwards to the LLM, so the avatar also reacts verbally.

Tunables live in VR_CONFIG (touchCooldown, hapticIntensity, hapticDuration).


7. Customizing the Scene

The world around the avatar is built from two simple config files:

  • client/scene/roomConfig.js — one room/environment at a time. Set enabled: true and point url at a GLB under client/backgrounds/glb/.
  • client/scene/objectsConfig.js — list of static props loaded into the scene. Each entry needs name, url, position, rotation, scale.

Sketchfab is the easiest place to grab free GLB rooms and props. Download → drop into client/backgrounds/glb/ → register in the relevant config → refresh the browser.


8. Server Endpoints (for reference / scripting)

The Python bridge (server/server.py, port 8001) exposes a small REST + WebSocket API the client and chat loop both use. You can hit any of these from curl / Postman to drive the avatar manually.

Endpoint What it does
POST /talk Play an audio file with lip sync + expression
POST /animate Trigger a Mixamo (.fbx) or VRMA (.vrma) animation. Set animate_type to start_mixamo, start_vrma, or auto to detect from extension
POST /animate_and_talk Combined VRMA + audio with optional delay
POST /set_state Switch the head/eye micro-state: idle, listening, thinking, talking
POST /walk_to Walk the avatar to {x, y, z} at a given speed
POST /stop_movement Cancel walking, return to idle
POST /teleport_to Instant move to {x, y, z} (no walk anim)
POST /set_movement_speed Adjust walking speed on the fly
POST /load_movement_animation Load walk/idle GLB for the movement system (anim_type: "walk" or "idle")
POST /send_click_interaction Touch reaction (called by the client when you click/touch the avatar in VR or with the mouse)
GET /pop_pending_actions Drains buffered click actions — used by main_chat_v9.py to fold touches into the LLM prompt
POST /vr/position / GET /vr/position Push / read the VR headset position
WS /ws WebSocket the client subscribes to for all broadcasts above

server/process/vrm_func/vrm_ping.py and vrm_states_ping.py are thin Python wrappers around the most-used POSTs if you want to drive the avatar from your own scripts. test_vr_positions.py shows a walk_to example.


9. Customization

Facial Expressions

Currently the model's expression defaults to relaxed. You can wire in your own emotion classifier or set it manually.

To change the expression, edit the chunk loop in server/main_chat_v9.py:

for item, item_type in stream_with_functions(messages):
    if item_type == "text":
        text_chunk = item
        tts_text = clean_llm_output(text_chunk)

        # Option 1: plug in an emotion classifier
        # emotion = get_emotion(text_chunk, None, None)
        # expression = map_emotion_to_expression(emotion)

        # Option 2: set manually (current default)
        expression = "relaxed"

Supported VRM 1.0 expressions:

  • happy
  • angry
  • sad
  • relaxed
  • surprised
  • neutral

Summary

  1. Install prerequisites (Python 3.10, VS Code, Node.js, GPT-SoVITS)
  2. Create virtual environment and install dependencies
  3. Configure API keys in .env
  4. Customize character_config.yaml, client/config.js, and the client/scene/* configs
  5. Start servers (automatic or manual)
  6. Run server/main_chat_v9.py
  7. (Optional) Plug in VR via Link / Air Link, tweak VR_CONFIG
  8. (Optional) Drop in rooms / props from Sketchfab
  9. (Optional) Customize facial expressions

For issues, run server/check_setup.py to diagnose problems.