How to run a local LLM with LM Studio inside Wappler?

AdrianoLuiz · August 14, 2025, 1:05pm

Hi,

I’d like to integrate a local Large Language Model running in LM Studio with Wappler’s built-in AI features (AI Tools / AI Chat).

Instead of using OpenAI or other cloud providers, I want Wappler to send prompts directly to a model hosted locally via LM Studio’s API.

Has anyone done this before?

Which API endpoint from LM Studio should be used in Wappler’s AI provider settings?
Do I need to configure a custom AI Provider in Wappler for this?
Any tips for handling authentication, model selection, or streaming responses?
Are there CORS or HTTPS issues when connecting Wappler to a local LLM server?

If you have experience connecting Wappler to a local LLM, please share the setup steps.

Thanks!

patrick · August 14, 2025, 1:31pm

In your Wappler options under AI set Chat Provider to Custom.

Then set the Base URL to the place where LM Studio server is running like http://localhost:1234/v1.

LM Studio as a Local LLM API Server | LM Studio Docs
OpenAI Compatibility API | LM Studio Docs

AdrianoLuiz · August 14, 2025, 2:07pm

Thanks @patrick

AdrianoLuiz · August 14, 2025, 2:37pm

Hi Patric, look at this! Which model do you recommend using in Wappler for a MacBook M4 Pro with 24GB?

Cheese · August 14, 2025, 3:01pm

You can increase the context level (up to one million tokens in Qwen2.5-7B-Instruct-1M / 14B-Instruct-1M and 128k tokens in Qwen2.5-7B-Instruct) in LM Studio for Qwen models. A quick search should reveal how to do this @AdrianoLuiz

Cheese · August 14, 2025, 3:02pm

Can You Increase the Context Length?

1. Model-dependent support

Some Qwen2.5-Instruct models (like the standard Instruct variants) typically support up to 32 K tokens natively.
However, Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M models support up to 1 million tokens—a gigantic increase—using advanced methods like Dual Chunk Attention and length extrapolation techniques. Hugging Face Bind AI IDE Qwen Simon Willison’s Weblog

2. Enabling Extended Context via YaRN

For Qwen2.5-7B-Instruct, you can extend the context window from 32 K to 128 K tokens by enabling YaRN (a RoPE scaling technique). GitHub
However, this currently uses static YaRN (available via vLLM): while it does extend context, it may slightly reduce performance on short-context tasks. GitHub
Dynamic YaRN (which adjusts based on input length) is only available via Alibaba Cloud ModelStudio API as of now. GitHub

3. LM Studio capabilities

LM Studio allows customizing context length when loading a model. For example:

import lmstudio as lms
model = lms.llm("qwen2.5-7b-instruct", config={
    "contextLength": 8192,
    # other settings...
})

You can set contextLength, ropeFrequencyBase, ropeFrequencyScale, and other advanced parameters. LM Studio+1
However, some users have encountered errors when trying to load models with very large context lengths—even if the model claims to support them—depending on your hardware constraints. GitHub+1

Summary Table

Model / Setting	Native Context Support	Extended Support
Qwen2.5-7B-Instruct	32 K tokens	Up to 128 K via static YaRN
Qwen2.5-7B-Instruct-1M / 14B-Instruct-1M	Up to 1 M tokens	Fully supported via DCA, etc.
LM Studio Configuration	Depends on hardware	Customizable via `contextLength`, `ropeFrequencyScale`, etc.

What You Can Do

Check the model's context length in LM Studio using model.getContextLength() before running. LM Studio
If you're using a 1M-capable model, you can indeed set contextLength up to 1,000,000 in your load config—assuming your GPU/VRAM can handle it (e.g., 120 GB+ for 7B-Instruct-1M, 320 GB+ for 14B-Instruct-1M). Simon Willison’s Weblog Bind AI IDE
If you're on Qwen2.5-7B-Instruct (standard) and want to push beyond 32K tokens, you can enable YaRN via model config with caution—this may impact short-context performance. GitHub
Test different configurations and monitor memory consumption. If loading fails at very high values, try lowering until stable.

AdrianoLuiz · August 14, 2025, 4:27pm

thanks