11 KiB
Setup Guide
Prerequisites
Before starting, ensure you have the following installed:
- Python 3.10 - Download from Microsoft Store or python.org
- VS Code - Download here
- Node.js and npm - Download here (includes npx)
- GPT-SoVITS - One-click installer
1. Project Setup
Create Virtual Environment
-
Open VS Code
-
File → New Window
-
File → Open Folder (select your project directory)
-
Press Ctrl+Shift+P to open the command palette
-
Type
Python: Create Environmentand select itNote: If you don't see this option, install the Python extension:
- Go to the Extensions sidebar (Ctrl+Shift+X)
- Search for "Python" and install it
- Close and reopen VS Code, then try again
-
Select Venv and choose Python 3.10
-
Uncheck "Install dependencies from requirements.txt" (we'll do this manually)
-
Click OK
Install Dependencies
-
Open a new terminal: Terminal → New Terminal
-
Verify your virtual environment is active (you should see
.venvin the prompt):(.venv) F:\your_project_path> -
Install dependencies using uv (faster) or pip:
# Option 1: Using uv (recommended - faster) pip install uv uv pip install -r requirements.txt # Option 2: Using pip pip install -r requirements.txtThis should take 30 seconds to 1 minute depending on your system.
2. API Configuration
Create .env File
Create a .env file in the root directory with the following content:
OPENAI_API_KEY="sk-proj-YOUR_API_KEY"
GROQ_API_KEY="YOUR_GROQ_API_KEY"
Get API Keys
-
OpenAI API Key:
- Sign up at OpenAI Platform
- Add $5 credit (should last 1-2 months for typical usage)
- Copy your API key to the
.envfile
Note: You can customize this to use a local AI model if preferred (streaming code doesn't support this yet, but local model support is planned)
-
Groq API Key (Free):
- Sign up at Groq Console
- Copy your API key to the
.envfile
3. Configuration
Character Configuration
There are two main configuration files:
A. character_config.yaml
- Set the AI prompt
- Configure ASR (Automatic Speech Recognition) context
- Add reference audio sample (must be 3-10 seconds long)
- Enter the text spoken in the audio file
B. client/config.js
- Change the 3D model (
VRM_PATH) - Adjust mouth audio threshold (
MOUTH_THRESHOLD) - Place model files in
client/models/directory - Update the filename in config
- Important: Model must be in VRM 1.0 format (export setting in VRoid Studio)
This file also holds the VR_CONFIG block. The most useful knob there is dollyPosition: [x, y, z] — if your avatar looks too small (or too far/too close) when you put on the headset, edit dollyPosition to move the VR rig. Y raises/lowers your virtual eye level, Z is distance from the avatar. Other tunables in the same block: touchRadius (proximity needed to trigger a touch), triggerTouchRadius (extended range while the trigger is held), touchCooldown, and haptic intensity.
C. client/scene/roomConfig.js
Toggles the room/environment around the avatar. enabled: false runs with no room (default). Set enabled: true and edit url / position / rotation / scale to load a GLB. There's a Japanese classroom example pre-filled at the bottom — copy those values over the active block to switch to it. fixDepth: true is a workaround for rooms that z-fight.
D. client/scene/objectsConfig.js
Drops static GLB props into the scene. Each entry needs name (unique key the AI can reference), url (path under client/backgrounds/glb/), and a transform. Two commented-out examples (fish, mattress) show the shape.
Where to find rooms and props: Sketchfab has a huge library of free GLB downloads. Drop the file into
client/backgrounds/glb/and reference it inroomConfig.jsorobjectsConfig.js.
4. Starting the Servers
Option A: Automatic Start (Recommended)
- Edit
start_server.bat - Change the following line to match your GPT-SoVITS installation path:
set SOVITS_PATH=D:\PyProjects\GPT-SoVITS-v3lora-20250228\GPT-SoVITS-v3lora-20250228 - Run the script:
- In terminal:
start_server.bat - Or double-click the file in File Explorer
- In terminal:
- Do not close any of the terminal windows that open
Option B: Manual Start
If automatic start doesn't work:
-
Start the Python server:
cd server python server.py -
Start the animation server (open a second terminal):
cd client npx vite -
Open your browser and go to: http://localhost:5173
You should see a 3D model floating on screen.
5. Running the Chat
-
Run the main chat script:
cd server python main_chat_v9.pyThis is the current voice loop. It records on speech, streams the LLM response, generates TTS per chunk, and plays it back in order. It also runs a background click dispatcher that responds verbally when the avatar is touched (see Section 8).
MCP tool calls (optional): If you have an MCP server running and a config at
~/MCP_functions/mcp_config.json(or the path set in env varMCP_CONFIG_PATH), the script automatically picks it up and lets the model interleave speech and tool calls. Without that file it just streams text — no extra setup needed. -
Troubleshooting: If you encounter issues, run the setup check script:
cd server python check_setup.pyThe check now includes click-interaction and walk-to tests in addition to the LLM / mic / TTS / VRM checks.
6. VR Support
The client supports WebXR through the Vite dev server.
Quest setup (Air Link or Link Cable)
- Connect your Quest to the PC:
- Link Cable: plug the headset in, accept the "Allow access" prompt inside the headset, and switch to Quest Link from the Quest menu.
- Air Link: make sure both devices are on the same network and pair them through the Oculus PC app.
- Start the servers (Section 4) and refresh the client at http://localhost:5173 inside the headset's browser, or click the Enter VR button on the desktop client while the headset is active.
- If the avatar looks too small, too far, or too low, edit
VR_CONFIG.dollyPositioninclient/config.jsand refresh.
Touch / hand interactions
In VR you can reach out and touch the avatar. Hand proximity within touchRadius triggers a click; squeezing the trigger extends the range to triggerTouchRadius. Each touch:
- Plays a quick reaction sound + animation on the client (handled by the
/send_click_interactionendpoint). - Buffers a
[the user touched your <region>]action that the chat loop's background dispatcher picks up and forwards to the LLM, so the avatar also reacts verbally.
Tunables live in VR_CONFIG (touchCooldown, hapticIntensity, hapticDuration).
7. Customizing the Scene
The world around the avatar is built from two simple config files:
client/scene/roomConfig.js— one room/environment at a time. Setenabled: trueand pointurlat a GLB underclient/backgrounds/glb/.client/scene/objectsConfig.js— list of static props loaded into the scene. Each entry needsname,url,position,rotation,scale.
Sketchfab is the easiest place to grab free GLB rooms and props. Download → drop into client/backgrounds/glb/ → register in the relevant config → refresh the browser.
8. Server Endpoints (for reference / scripting)
The Python bridge (server/server.py, port 8001) exposes a small REST + WebSocket API the client and chat loop both use. You can hit any of these from curl / Postman to drive the avatar manually.
| Endpoint | What it does |
|---|---|
POST /talk |
Play an audio file with lip sync + expression |
POST /animate |
Trigger a Mixamo (.fbx) or VRMA (.vrma) animation. Set animate_type to start_mixamo, start_vrma, or auto to detect from extension |
POST /animate_and_talk |
Combined VRMA + audio with optional delay |
POST /set_state |
Switch the head/eye micro-state: idle, listening, thinking, talking |
POST /walk_to |
Walk the avatar to {x, y, z} at a given speed |
POST /stop_movement |
Cancel walking, return to idle |
POST /teleport_to |
Instant move to {x, y, z} (no walk anim) |
POST /set_movement_speed |
Adjust walking speed on the fly |
POST /load_movement_animation |
Load walk/idle GLB for the movement system (anim_type: "walk" or "idle") |
POST /send_click_interaction |
Touch reaction (called by the client when you click/touch the avatar in VR or with the mouse) |
GET /pop_pending_actions |
Drains buffered click actions — used by main_chat_v9.py to fold touches into the LLM prompt |
POST /vr/position / GET /vr/position |
Push / read the VR headset position |
WS /ws |
WebSocket the client subscribes to for all broadcasts above |
server/process/vrm_func/vrm_ping.py and vrm_states_ping.py are thin Python wrappers around the most-used POSTs if you want to drive the avatar from your own scripts. test_vr_positions.py shows a walk_to example.
9. Customization
Facial Expressions
Currently the model's expression defaults to relaxed. You can wire in your own emotion classifier or set it manually.
To change the expression, edit the chunk loop in server/main_chat_v9.py:
for item, item_type in stream_with_functions(messages):
if item_type == "text":
text_chunk = item
tts_text = clean_llm_output(text_chunk)
# Option 1: plug in an emotion classifier
# emotion = get_emotion(text_chunk, None, None)
# expression = map_emotion_to_expression(emotion)
# Option 2: set manually (current default)
expression = "relaxed"
Supported VRM 1.0 expressions:
happyangrysadrelaxedsurprisedneutral
Summary
- ✅ Install prerequisites (Python 3.10, VS Code, Node.js, GPT-SoVITS)
- ✅ Create virtual environment and install dependencies
- ✅ Configure API keys in
.env - ✅ Customize
character_config.yaml,client/config.js, and theclient/scene/*configs - ✅ Start servers (automatic or manual)
- ✅ Run
server/main_chat_v9.py - ✅ (Optional) Plug in VR via Link / Air Link, tweak
VR_CONFIG - ✅ (Optional) Drop in rooms / props from Sketchfab
- ✅ (Optional) Customize facial expressions
For issues, run server/check_setup.py to diagnose problems.