How to run a local LLM with LM Studio inside Wappler?

Hi,

I’d like to integrate a local Large Language Model running in LM Studio with Wappler’s built-in AI features (AI Tools / AI Chat).

Instead of using OpenAI or other cloud providers, I want Wappler to send prompts directly to a model hosted locally via LM Studio’s API.

Has anyone done this before?

  • Which API endpoint from LM Studio should be used in Wappler’s AI provider settings?
  • Do I need to configure a custom AI Provider in Wappler for this?
  • Any tips for handling authentication, model selection, or streaming responses?
  • Are there CORS or HTTPS issues when connecting Wappler to a local LLM server?

If you have experience connecting Wappler to a local LLM, please share the setup steps.

Thanks!

1 Like

In your Wappler options under AI set Chat Provider to Custom.

Then set the Base URL to the place where LM Studio server is running like http://localhost:1234/v1.

LM Studio as a Local LLM API Server | LM Studio Docs
OpenAI Compatibility API | LM Studio Docs

2 Likes

Thanks @patrick :star_struck:

Hi Patric, look at this! Which model do you recommend using in Wappler for a MacBook M4 Pro with 24GB?

You can increase the context level (up to one million tokens in Qwen2.5-7B-Instruct-1M / 14B-Instruct-1M and 128k tokens in Qwen2.5-7B-Instruct) in LM Studio for Qwen models. A quick search should reveal how to do this @AdrianoLuiz

Can You Increase the Context Length?

1. Model-dependent support

  • Some Qwen2.5-Instruct models (like the standard Instruct variants) typically support up to 32 K tokens natively.
  • However, Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M models support up to 1 million tokens—a gigantic increase—using advanced methods like Dual Chunk Attention and length extrapolation techniques. Hugging FaceBind AI IDEQwenSimon Willison’s Weblog

2. Enabling Extended Context via YaRN

  • For Qwen2.5-7B-Instruct, you can extend the context window from 32 K to 128 K tokens by enabling YaRN (a RoPE scaling technique). GitHub
  • However, this currently uses static YaRN (available via vLLM): while it does extend context, it may slightly reduce performance on short-context tasks. GitHub
  • Dynamic YaRN (which adjusts based on input length) is only available via Alibaba Cloud ModelStudio API as of now. GitHub

3. LM Studio capabilities

  • LM Studio allows customizing context length when loading a model. For example:
import lmstudio as lms
model = lms.llm("qwen2.5-7b-instruct", config={
    "contextLength": 8192,
    # other settings...
})
  • You can set contextLength, ropeFrequencyBase, ropeFrequencyScale, and other advanced parameters. LM Studio+1
  • However, some users have encountered errors when trying to load models with very large context lengths—even if the model claims to support them—depending on your hardware constraints. GitHub+1

Summary Table

Model / Setting Native Context Support Extended Support
Qwen2.5-7B-Instruct 32 K tokens Up to 128 K via static YaRN
Qwen2.5-7B-Instruct-1M / 14B-Instruct-1M Up to 1 M tokens Fully supported via DCA, etc.
LM Studio Configuration Depends on hardware Customizable via contextLength, ropeFrequencyScale, etc.

What You Can Do

  1. Check the model's context length in LM Studio using model.getContextLength() before running. LM Studio
  2. If you're using a 1M-capable model, you can indeed set contextLength up to 1,000,000 in your load config—assuming your GPU/VRAM can handle it (e.g., 120 GB+ for 7B-Instruct-1M, 320 GB+ for 14B-Instruct-1M). Simon Willison’s WeblogBind AI IDE
  3. If you're on Qwen2.5-7B-Instruct (standard) and want to push beyond 32K tokens, you can enable YaRN via model config with caution—this may impact short-context performance. GitHub
  4. Test different configurations and monitor memory consumption. If loading fails at very high values, try lowering until stable.
1 Like

thanks

1 Like