Generate Content

WiseRouter provides two primary endpoints for generation: completions and chat completions.

Both are OpenAI-compatible and powered by vLLM.

POST

/completions

Best for simple generation tasks. Takes a single text prompt and continues it.

POST

/chat/completions

Better for conversational interactions. Supports multiple messages with roles (user/assistant) and maintains conversation context.

📝Examples

Completion Example

curl -X POST "https://api.wiserouter.ai/v1/completions" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $YOUR_API_KEY" \ -d '{ "prompt": "Write a poem about coding in Python:", "max_tokens": 400, "temperature": 0.7, "repetition_penalty": 1.1 }'

Response:

{ "id": "cmpl-2h38kn309nvb1", "object": "text_completion", "created": 1703262150, "model": "redacted", "choices": [{ "text": "In lines of indented grace,\nPython slithers into place.\nWith modules, functions, and loops galore,\nBuilding programs we can't ignore.\n\nSimple syntax, readable and clear,\nMaking complex tasks appear near...", "index": 0, "logprobs": null, "finish_reason": "length" }], "usage": { "prompt_tokens": 8, "completion_tokens": 400, "total_tokens": 408 } }

Chat Completion Example

curl -X POST "https://api.wiserouter.ai/v1/chat/completions" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $YOUR_API_KEY" \ -d '{ "messages": [ { "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "Write a Python function to calculate Fibonacci numbers." } ], "temperature": 0.7, "max_tokens": 400, "repetition_penalty": 1.1 }'

Response:

{ "id": "chatcmpl-20osd348shx1", "object": "chat.completion", "created": 1703262150, "model": "redacted", "choices": [{ "message": { "role": "assistant", "content": "Here's a Python function to calculate Fibonacci numbers recursively:\n\ndef fibonacci(n):\nif n <= 1:\nreturn n\nreturn fibonacci(n-1) + fibonacci(n-2)\n\n# Example usage:\nfor i in range(10):\nprint(f'fibonacci({i}) = {fibonacci(i)}')" }, "logprobs": null, "finish_reason": "stop" }], "usage": { "prompt_tokens": 33, "completion_tokens": 89, "total_tokens": 122 } }

Note: to prevent reverse engineering, the model used is redacted by default.

⚙️Request Parameters

Required Parameters for /completion:

prompt

The prompt to generate completions for.

Required Parameters for /chat/completion:

messages

Array of message objects containing:

  • role: "system", "user", or "assistant"
  • content: message text

Optional Parameters for both:

router_preference

Set a routing preference for model selection:

  • balanced (default) - Optimal balance between powerful and efficient
  • powerful - Favors more powerful models for best results but uses more credits
  • efficient - Prioritizes lighter models when possible to minimize credit usage

max_tokens

Maximum number of completion tokens to generate per output sequence. Defaults to 16.

temperature

Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling. Defaults to 1.0.

repetition_penalty

Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values greater than 1 encourage the model to use new tokens, while values less than 1 encourage the model to repeat tokens. Defaults to 1.0.

top_p

Float that controls the cumulative probability of the top tokens to consider. Must be in [0, 1]. Set to 1 to consider all tokens. Defaults to 1.0.

stream

Whether to stream back partial progress. Defaults to false.

Check out the vLLM documentation for a full list of sampling parameters.

💡Tips for Best Results

  • Use lower temperature values for more factual and focused responses
  • Use higher temperature values for more creative and varied outputs
  • Use repetition_penalty to reduce repetitive text in longer generations
  • Be specific and clear in your prompts - the better the context you provide, the better the results
  • While models can handle various context lengths, shorter, focused contexts often yield better performance. Consider breaking complex tasks into smaller chunks when possible