Givon AI API models
One contract for every model: { type, model, input }. Each model has its own input schema, token price, and ready-to-use cURL, Python, and JS snippets. Generation runs asynchronously.
Video
24gemini-omni-videoGoogleGoogle multimodal model that builds video from text, images, and video, and can edit an existing clip conversationally. Use it when you need to transform footage or mix inputs rather than generate a shot from scratch. Native audio, up to 4K.
grok-imagine-videoxAIFast short-form video with native synced audio and strong prompt following. It can continue from the last frame, making scene stitching easier. 480p/720p.
grok-imagine-video-1.5xAIxAI image-to-video: animates a single source frame with native audio and strong prompt following, with clips up to 15 seconds. Ranked #1 on the image-to-video arena.
hailuo-2.3MiniMaxBest-in-class facial acting, emotions, and micro-expressions, plus believable body-motion physics. Use it for emotional face-focused shots.
happyhorse-1.0AlibabaAlibaba's top video model: produces a clip with synced audio and lip sync in one pass. Use it for cinematic multi-shots with prepared voiceover, from text, a frame, references, or source-video edits. 720p/1080p.
heygen-photo-avatarHeyGenTalking avatar from a single photo: the model reads vocal tone and rhythm, then builds lifelike expressions and hand gestures. Sync from text or an existing voiceover.
kling-2.6KlingNative audio in a single pass: speech, ambience, and effects are generated directly in-frame without separate dubbing. Use it for budget clips and talking heads when multi-scene control is not needed.
kling-2.6-motionKlingAffordable motion-control: transfers movement from a video reference to your character. Use it for simpler motion when 3.0-tier precision is not required.
kling-3.0KlingKling flagship: up to 15 seconds and 4K, stable character identity across scenes, multi-scene direction, and native multilingual audio.
kling-3.0-motionKlingTransfers recorded movement, dance, or gestures from a video sample to your full-body character while locking face identity and capturing complex motion. Use it when choreography fidelity and appearance consistency matter.
kling-3.0-omniKlingMulti-scene video with native audio: transfers a character's appearance and voice from a video sample into new scenes, though audio must be disabled when that video sample is used. Use it for coherent narratives with one hero.
kling-digital-humanKlingAnimates a person from a photo with voiceover: lip sync, natural expressions, and gestures. Useful when you need a speaking or singing presenter from one portrait.
kling-lip-syncKlingRetargets lip motion in an existing video to a new audio track. Use it when the video is already shot and you only need dubbing, localization, or speech replacement.
kling-o1KlingCombines up to 7 angles of one subject through Elements and keeps its appearance strictly consistent through the entire clip. Use it for character turnarounds, recurring heroes, and product demos.
seedance-2.0BytePlusFollows director-style commands such as angle, camera motion, and shot changes through text, with audio generated in one pass. Use it for cinematic reference-guided shots up to 1080p.
seedance-2.0-fastBytePlusThe same cinematic feel and camera control, but noticeably faster for iterations and volume. Native audio and references, up to 720p.
seedance-2.0-fast-relaxedBytePlusFast mode with less strict moderation for iterating on complex reference scenes with images, video, and audio. Native audio, up to 720p.
seedance-2.0-relaxedBytePlusLess strict moderation mode for Seedance 2.0, useful when the standard check blocks complex character and reference scenes with images, video, or audio. Native audio and clips up to 1080p.
switchx-videoBeebleChanges the background, object, or lighting in existing footage from text, one reference, and an optional mask while preserving the subject, shape, motion, and expressions. Duration comes from the source video; output is 720p or 1080p.
veo-3.1GoogleGoogle's flagship model for premium cinematic shots: up to 4K video with native synced audio including dialogue, sound effects, and ambience out of the box. Up to 3 references keep character and style stable.
veo-3.1-fastGoogleThe same sharpness up to 4K and native audio as the flagship, but noticeably faster and cheaper. A workhorse for iterations and most production tasks.
veo-3.1-liteGoogleThe most affordable Veo tier: up to 1080p without 4K and native audio that can be toggled. Use it for high-volume social content when 4K is unnecessary.
wan-2.7-r2vAlibabaUses up to 5 references, including images, video, or audio, to lock hero appearance and voice across shots for episodic content with consistent characters.
wan-2.7-videoAlibabaVideo generation and editing in one engine: from text, from a photo, with a target final frame, or by editing an existing clip from a description. Up to 1080p.
Images
11gpt-image-2OpenAIOpenAI image model that reasons about composition: excellent text rendering across dozens of languages and close instruction following. Use it for infographics, slides, multilingual posters, and full-image edits in 1K, 2K, or 4K.
grok-imaginexAIBase xAI image tier: generate and edit full images from text without masks, and compose from several references. Use it for quick concepts and conversational edits when Pro-level precision is not required.
grok-imagine-proxAIHigher tier of Grok Imagine: more detail, cleaner in-frame text, and stronger composition control from detailed prompts. Use it when the base tier is not sharp enough.
nano-bananaGoogleEntry tier in Google's image family: the most affordable 1K image generation. Dialog editing and reference blending make it useful for volume work and quick drafts.
nano-banana-2GoogleNear-flagship Google quality at Flash speed: up to 4K, clean text, and consistent characters from references. Use it when you want Pro-level output faster and cheaper.
nano-banana-proGoogleGoogle's flagship image model: maximum detail and the sharpest in-frame text in the family. Use it for complex brand scenes from up to 8 references and multi-object compositions up to 4K.
seedream-4.5BytePlusCinematic lighting and stable character identity across generations. Use it for product catalogs, character sheets, and reference-guided edits; a reliable workhorse with 2K/4K output and up to 14 references.
seedream-5BytePlusReasons over complex prompts and can search the web, assembling multi-object scenes and topical visuals. Supports example-based reference edits and output up to 3K.
switchx-imageBeebleBeeble relighting and compositing: transfers an object, background, or light from text, one reference, and an optional mask onto the source photo with physically consistent lighting instead of generating from scratch. 720p and 1080p.
wan-2.7-imageAlibabaPortrait-first image model: control facial features, makeup, and hairstyle through references. Use it for avatars, beauty assets, and consistent character series up to 2K.
wan-2.7-image-proAlibabaWan's 4K tier with prompt reasoning: follows complex multi-step instructions and in-frame text more accurately, including tables and formulas. Use it for demanding deliverables such as posters and packaging.