Troubleshooting & Local Developer Guide
LM Studio provides a local model API server conforming to the OpenAI REST specification. When developing browser-based applications that talk to local APIs, security settings and endpoint URLs are the most common points of friction. This guide explains how to properly configure your system.
1. Understanding CORS Blocks
By default, web browsers block pages loaded from one origin (such as this web page) from making HTTP requests to another origin (like your local machine at http://localhost:1234). This security feature is called Same-Origin Policy. To allow browser-based testing tools to communicate with LM Studio, CORS headers must be added by the server.
- LM Studio GUI: Click the Developer tab on the left panel, scroll down to Server Settings, and switch "Enable CORS" to active. This instructs the server to append
Access-Control-Allow-Origin: *headers to all responses. - LM Studio CLI: If running the server via command line, launch it with the cors argument:
lms server start --cors.
2. Model Loading vs. OpenAI Compatibility
LM Studio features two layers of API endpoints:
- OpenAI-Compatible Endpoint (
/v1): Supports standard routes like/v1/chat/completionsand/v1/models. This is great for plugging LM Studio into existing code written for OpenAI. However, it requires a model to be pre-loaded in the LM Studio GUI before requests will succeed. - Native Endpoint (
/api/v1): Provides granular controller access. Using the native endpoints, this tool can check which models are currently loaded, request specific models to load with defined context sizes, and unload them to free up GPU memory.
3. GPU VRAM Allocation during Model Loading
When loading a model via POST /api/v1/models/load, the configurations specified in the payload affect how much GPU memory is claimed:
- Context Length: A larger context window creates a larger KV cache, which scales linearly with the sequence length. If you experience Out-Of-Memory (OOM) failures when loading, try reducing the context length value (e.g., from 8192 to 2048).
- Flash Attention: Speeds up attention calculations and reduces the cache's VRAM footprint. Toggle this on if your GPU engine supports it.