[V1.2.3] Add dedicated OCR and video analysis actions (#155)#196
Open
[V1.2.3] Add dedicated OCR and video analysis actions (#155)#196
Conversation
Collaborator
|
The current approach for video understanding can be improved. Currently, only Google's model/API supports video understanding (Qwen and Moonshot do too, but we haven't set them up as providers yet). Here is what you can do for now: If the user has Google API key set up, then use the Gemini API to perform video understanding; if a Google API key is not set up, the agent can fall back to using your current approach (which might be expensive and not as effective). You can refer to the I haven't tested the PR yet. Once this is improved, I will test the PR. |
…d_video, OpenCV as fallback
…exceptions in describe_image_bytes
…r and understand_video
Collaborator
|
…model - Merge generate_multimodal_multi_image into generate_multimodal (image_bytes_list param) - Add json_mode param to describe_image_bytes; describe_image_ocr now a thin wrapper - understand_video pulls model from get_vlm_model() with gemini-1.5-pro fallback - Add test suites: gemini_client_multimodal, vlm_json_mode, ocr_wrapper, understand_video_model
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #155
What changed
perform_ocrunderstand_videoInternalActionInterfaceopencv-python-headlessfor keyframe extractionfrom __future__ import annotationsVerification
Ran:
PYTHONPATH=. python3 -m pytest tests/test_step1_vlm_interface.py tests/test_step2_internal_action_interface.py -v --tb=shortPYTHONPATH=. python3 -m pytest tests/test_step3_perform_ocr_action.py -v --tb=shortPYTHONPATH=. python3 -m pytest tests/test_step4_understand_video_action.py -v --tb=shortResult: