Givon AI API models

One contract for every model: { type, model, input }. Each model has its own input schema, token price, and ready-to-use cURL, Python, and JS snippets. Generation runs asynchronously.

Video

24
Gemini Omni Flashfrom 20 tokens
gemini-omni-videoGoogle

Google multimodal model that builds video from text, images, and video, and can edit an existing clip conversationally. Use it when you need to transform footage or mix inputs rather than generate a shot from scratch. Native audio, up to 4K.

Grok Imagine Videofrom 1 tokens/s
grok-imagine-videoxAI

Fast short-form video with native synced audio and strong prompt following. It can continue from the last frame, making scene stitching easier. 480p/720p.

Grok Imagine Video 1.5from 2 tokens/s
grok-imagine-video-1.5xAI

xAI image-to-video: animates a single source frame with native audio and strong prompt following, with clips up to 15 seconds. Ranked #1 on the image-to-video arena.

Hailuo 2.3from 8 tokens
hailuo-2.3MiniMax

Best-in-class facial acting, emotions, and micro-expressions, plus believable body-motion physics. Use it for emotional face-focused shots.

HappyHorse 1.0from 6 tokens/s
happyhorse-1.0Alibaba

Alibaba's top video model: produces a clip with synced audio and lip sync in one pass. Use it for cinematic multi-shots with prepared voiceover, from text, a frame, references, or source-video edits. 720p/1080p.

HeyGen Avatar IV3 tokens/s
heygen-photo-avatarHeyGen

Talking avatar from a single photo: the model reads vocal tone and rhythm, then builds lifelike expressions and hand gestures. Sync from text or an existing voiceover.

Kling 2.6from 3 tokens/s
kling-2.6Kling

Native audio in a single pass: speech, ambience, and effects are generated directly in-frame without separate dubbing. Use it for budget clips and talking heads when multi-scene control is not needed.

Kling 2.6 Motion8 tokens/s
kling-2.6-motionKling

Affordable motion-control: transfers movement from a video reference to your character. Use it for simpler motion when 3.0-tier precision is not required.

Kling 3.0from 3 tokens/s
kling-3.0Kling

Kling flagship: up to 15 seconds and 4K, stable character identity across scenes, multi-scene direction, and native multilingual audio.

Kling 3.0 Motion7 tokens/s
kling-3.0-motionKling

Transfers recorded movement, dance, or gestures from a video sample to your full-body character while locking face identity and capturing complex motion. Use it when choreography fidelity and appearance consistency matter.

Kling 3.0 Omnifrom 6.4 tokens/s
kling-3.0-omniKling

Multi-scene video with native audio: transfers a character's appearance and voice from a video sample into new scenes, though audio must be disabled when that video sample is used. Use it for coherent narratives with one hero.

Kling Avatar 2.0from 6 tokens/s
kling-digital-humanKling

Animates a person from a photo with voiceover: lip sync, natural expressions, and gestures. Useful when you need a speaking or singing presenter from one portrait.

Kling Lip-Sync6 tokens/s
kling-lip-syncKling

Retargets lip motion in an existing video to a new audio track. Use it when the video is already shot and you only need dubbing, localization, or speech replacement.

Kling O1from 5.5 tokens/s
kling-o1Kling

Combines up to 7 angles of one subject through Elements and keeps its appearance strictly consistent through the entire clip. Use it for character turnarounds, recurring heroes, and product demos.

Seedance 2.0from 4.5 tokens/s
seedance-2.0BytePlus

Follows director-style commands such as angle, camera motion, and shot changes through text, with audio generated in one pass. Use it for cinematic reference-guided shots up to 1080p.

Seedance 2.0 Fastfrom 3.5 tokens/s
seedance-2.0-fastBytePlus

The same cinematic feel and camera control, but noticeably faster for iterations and volume. Native audio and references, up to 720p.

Seedance 2.0 Fast Relaxedfrom 4.5 tokens/s
seedance-2.0-fast-relaxedBytePlus

Fast mode with less strict moderation for iterating on complex reference scenes with images, video, and audio. Native audio, up to 720p.

Seedance 2.0 Relaxedfrom 5.5 tokens/s
seedance-2.0-relaxedBytePlus

Less strict moderation mode for Seedance 2.0, useful when the standard check blocks complex character and reference scenes with images, video, or audio. Native audio and clips up to 1080p.

SwitchX Videofrom 9 tokens/s
switchx-videoBeeble

Changes the background, object, or lighting in existing footage from text, one reference, and an optional mask while preserving the subject, shape, motion, and expressions. Duration comes from the source video; output is 720p or 1080p.

Veo 3.1from 14 tokens
veo-3.1Google

Google's flagship model for premium cinematic shots: up to 4K video with native synced audio including dialogue, sound effects, and ambience out of the box. Up to 3 references keep character and style stable.

Veo 3.1 Fastfrom 2 tokens/s
veo-3.1-fastGoogle

The same sharpness up to 4K and native audio as the flagship, but noticeably faster and cheaper. A workhorse for iterations and most production tasks.

Veo 3.1 Litefrom 3 tokens/s
veo-3.1-liteGoogle

The most affordable Veo tier: up to 1080p without 4K and native audio that can be toggled. Use it for high-volume social content when 4K is unnecessary.

Wan 2.7 R2Vfrom 6 tokens/s
wan-2.7-r2vAlibaba

Uses up to 5 references, including images, video, or audio, to lock hero appearance and voice across shots for episodic content with consistent characters.

Wan 2.7 Videofrom 6 tokens/s
wan-2.7-videoAlibaba

Video generation and editing in one engine: from text, from a photo, with a target final frame, or by editing an existing clip from a description. Up to 1080p.

Images

11
ChatGPT Images 2.0from 1 tokens
gpt-image-2OpenAI

OpenAI image model that reasons about composition: excellent text rendering across dozens of languages and close instruction following. Use it for infographics, slides, multilingual posters, and full-image edits in 1K, 2K, or 4K.

Grok Imagine2 tokens
grok-imaginexAI

Base xAI image tier: generate and edit full images from text without masks, and compose from several references. Use it for quick concepts and conversational edits when Pro-level precision is not required.

Grok Imagine Pro2 tokens
grok-imagine-proxAI

Higher tier of Grok Imagine: more detail, cleaner in-frame text, and stronger composition control from detailed prompts. Use it when the base tier is not sharp enough.

Nano Banana2 tokens
nano-bananaGoogle

Entry tier in Google's image family: the most affordable 1K image generation. Dialog editing and reference blending make it useful for volume work and quick drafts.

Nano Banana 2from 3 tokens
nano-banana-2Google

Near-flagship Google quality at Flash speed: up to 4K, clean text, and consistent characters from references. Use it when you want Pro-level output faster and cheaper.

Nano Banana Profrom 6 tokens
nano-banana-proGoogle

Google's flagship image model: maximum detail and the sharpest in-frame text in the family. Use it for complex brand scenes from up to 8 references and multi-object compositions up to 4K.

Seedream 4.5from 2 tokens
seedream-4.5BytePlus

Cinematic lighting and stable character identity across generations. Use it for product catalogs, character sheets, and reference-guided edits; a reliable workhorse with 2K/4K output and up to 14 references.

Seedream 5from 2 tokens
seedream-5BytePlus

Reasons over complex prompts and can search the web, assembling multi-object scenes and topical visuals. Supports example-based reference edits and output up to 3K.

SwitchX Imagefrom 6 tokens
switchx-imageBeeble

Beeble relighting and compositing: transfers an object, background, or light from text, one reference, and an optional mask onto the source photo with physically consistent lighting instead of generating from scratch. 720p and 1080p.

Wan 2.7 Imagefrom 3 tokens
wan-2.7-imageAlibaba

Portrait-first image model: control facial features, makeup, and hairstyle through references. Use it for avatars, beauty assets, and consistent character series up to 2K.

Wan 2.7 Image Profrom 6 tokens
wan-2.7-image-proAlibaba

Wan's 4K tier with prompt reasoning: follows complex multi-step instructions and in-frame text more accurately, including tables and formulas. Use it for demanding deliverables such as posters and packaging.