Description
The Vision Interface Asset (IA) is an off-the-shelf component of Intuiface that enables you to submit an image to the OpenAI GPT Large Language Model (LLM) (specifically, GPT-4 Turbo) and then ask questions about its contents. GPT sends its answer to the running experience. This IA uses the OpenAI Chat Completion API, accessing the same LLM used by ChatGPT.
Uses of the Vision IA include making your experiences more accessible to the visually impaired or crafting detailed descriptions of images on the fly for your interactive exhibition or catalog. You could even use the Whisper IA to capture audience questions about an image in your experience.
Adding the Vision IA into your experience
The Vision Interface Asset can be added to any experience using the "Add an interface asset" option located within Composer's Interface Assets panel.
Configuring the Vision IA
To use the Vision Interface Asset, you must assign it an OpenAI API key. An API key is acquired by creating an OpenAI account and then purchasing tokens. (Each GPT prompt and response consumes tokens, and those tokens have a cost.)
Once you have an account and purchased tokens, you can find your key on the OpenAI API Key page.
This key should be pasted into the "OpenAI API key" property for the Vision Interface Asset:
Creating and sending a prompt
To send an image and question about that image, you need to call the Vision IA's "Analyze image" action, which has five parameters:
-
Detail
Control how the model processes the image and generates its textual understanding. It can have three possible values:- auto (the default): looks at the image input size and decides if it should use the low or high setting
- low: a low-res 512px x 512px version of the image will be uploaded. This enables the API to return faster responses and consume fewer input tokens for use cases that do not require high detail.
- high: starts with a low-res image and then analyzes detailed crops of 512px squares based derived from the original image. It takes longer to process but yields more accurate, detailed results.
-
Maximum Tokens
A limit you can place on the length of the response. See OpenAI's definition of tokens, which can be summarized as "a piece of a word". An OpenAI rule of thumb is that 100 tokens represent about 75 words. This IA defaults to a value of 20 tokens. -
Image
Can contain either 1) an image URL, 2) a local image file path, or 3) a base64 string. The latter is a technique used to represent binary data using text. -
Prompt
A question about the contents of the specified image. -
Model
The GPT model version you'd like to use. This defaults to gpt-4-vision-preview, but we suggest changing this value to gpt-4-turbo, as the default value is only intended for testing purposes.
Additional triggers and actions
Triggers
-
Response received
Raised when a response has been received following the use of the "Analyze image" action.
- read-only parameter: Response
This parameter contains the response.
- read-only parameter: Response
-
Error message received
Raised when an error is returned instead of a response.
- read-only parameter: Error message
This parameter contains the error message.
- read-only parameter: Error message
Actions
-
Set OpenAI API key
Sets the OpenAI API key
Comments
0 comments
Please sign in to leave a comment.