Open Source · ComfyUI Custom Node
$ vllm-prompt-node --version 1.0.0

LOCAL
LLM
NODE

A ComfyUI node that connects to your local vLLM server and generates Stable Diffusion prompts on the fly. Wildcard expansion, prefix tags, live preview. No cloud. No API keys. Runs entirely on your hardware.

Installation

01Clone into custom_nodes
cd ComfyUI/custom_nodes
git clone https://github.com/OATH-Studio/comfy-vLLM
02dependency used
requests
03Start vLLM
vllm serve ./models/Qwen2.5-3B \
  --host 0.0.0.0 \
  --port 8765 \
  --served-model-name Qwen2.5-3B
04Restart ComfyUI
# Find the node under utils/llm
# Wire: vLLM Prompt → CLIPTextEncode → KSampler

Workflow Connection

vLLM Prompt Node
CLIPTextEncode [positive]
KSampler

The single combined_prompt output is already assembled as prefix, generated_text — wire it straight into CLIPTextEncode.

Features

01

Auto Model Detection

Queries /v1/models at runtime. No model name to configure — swap models in vLLM and the node picks it up automatically.

02

Wildcard Expansion

Use {red|blue|green} syntax anywhere in your prompt. Multiple wildcards resolve independently before hitting the model.

03

Prefix Tags

A dedicated prefix field for quality anchors like "masterpiece, best quality". Prepended to output, never sent to the model.

04

Live Preview

The node face shows a three-section breakdown: prefix, raw generated text, and the final combined string after each run.

05

Retry Logic

Configurable retry count handles empty responses, timeouts, and transient vLLM errors without failing the whole workflow.

06

Completions Format

Uses structured completion prompts with stop sequences so the model returns tags only — no preamble, no sign-off, no filler.

Wildcard Syntax

Wildcards are resolved before the prompt reaches vLLM. Each run picks a different combination, so you can generate a wide variety of prompts from a single template.

Input Prompt

A {red|blue|green} dragon,
{breathing fire into the sky|
 coiled around a mountain peak in a storm|
 diving into a glowing ocean abyss|
 rearing up against a blood moon}

After Expansion → vLLM

A blue dragon, diving
into a glowing ocean abyss

Combined Output

masterpiece, best quality, highres,
blue dragon, deep ocean, bioluminescent
glow, ancient scales, cinematic
underwater light rays, epic fantasy

Node Inputs

InputTypeDefaultDescription
promptSTRINGInstruction sent to vLLM. Supports {wild|card} syntax.
prefixSTRINGmasterpiece…Quality tags prepended to output. Not sent to the model.
hostSTRINGlocalhostvLLM server host.
portINT8765vLLM server port.
max_tokensINT128Maximum tokens to generate.
temperatureFLOAT0.7Sampling temperature. Lower = more consistent.
retriesINT3Retry attempts on empty or failed responses.

Model Guide

Tested with the Qwen2.5 family. The node works with any model vLLM can serve, but instruction-following quality varies significantly with size.

Qwen2.5-0.5B
Too small — unreliable instruction following
Qwen2.5-1.5B
Usable — occasional filler
Qwen2.5-3B
Recommended — clean output, reliable format
Qwen2.5-32B
Best quality — overkill for most workflows

Completion Prompt Format

### Stable Diffusion prompt tags (comma separated, no sentences):
Input: <your expanded prompt>
Output: ← model continues here, stops at first newline

Stop sequences [\n, ###, Input:] prevent the model from running past a single line. If output is still conversational, lower temperature to 0.3–0.5 or upgrade to a 3B+ model.

FAQ

Do I need an API key or internet connection?

+

No. Everything runs locally. The node connects to your vLLM instance over localhost. No data leaves your machine.

Which vLLM version is required?

+

vLLM 0.4 or later. The node uses the standard OpenAI-compatible /v1/completions endpoint which has been stable since 0.4.

Why not use the chat completions endpoint?

+

The completions endpoint with a structured prompt and stop sequences gives more predictable tag-only output from small models. Chat completions work better on larger models but smaller ones tend to be more conversational that way.

Can I run multiple models?

+

The node queries /v1/models and uses the first result. If you run multiple vLLM instances on different ports, add multiple nodes to your workflow — one per port.

Why does the model still return conversational responses?

+

The model is likely too small. Qwen2.5-0.5B and 1.5B struggle with strict format following. Try 3B or larger, and drop temperature to 0.3–0.5.

Does the node cache responses?

+

No. IS_CHANGED returns NaN so ComfyUI re-runs the node every execution. Each generation calls vLLM fresh. If you want deterministic output, lower temperature toward 0.

OATH Studio

Need Custom
AI Tooling?

This node is a small example of what we build. We design and develop custom AI pipelines, local inference tooling, ComfyUI integrations, and production workflows for studios and independent creators who want control over their stack.

  • Local LLM integration and prompt engineering
  • Custom ComfyUI nodes and workflow automation
  • vLLM / Ollama deployment and optimisation
  • End-to-end AI image and video pipelines
  • On-premise — your hardware, your data
Get In Touch

This project uses

RuntimevLLM 0.4+
Endpoint/v1/completions
Model detection/v1/models
Stop sequences\n · ### · Input:
Wildcard enginePython re · recursive
ComfyUI hooksIS_CHANGED · OUTPUT_NODE
Dependenciesrequests

Free · Open Source · MIT License

YOUR MODEL.
YOUR PROMPTS.

Built by OATH Studio. We make open tools for AI artists and studios, and take on custom development work for teams who need something specific. No cloud dependencies. No subscriptions.