Skip to content

[V1.2.3] Add dedicated OCR and video analysis actions (#155)#196

Open
AlanAAG wants to merge 6 commits intodevfrom
feature/ocr-video-actions
Open

[V1.2.3] Add dedicated OCR and video analysis actions (#155)#196
AlanAAG wants to merge 6 commits intodevfrom
feature/ocr-video-actions

Conversation

@AlanAAG
Copy link
Copy Markdown
Collaborator

@AlanAAG AlanAAG commented Apr 15, 2026

Closes #155

What changed

  • Added dedicated OCR support via perform_ocr
  • Added video analysis support via understand_video
  • Extended VLM interface for OCR and multi-frame video understanding
  • Added bridge methods in InternalActionInterface
  • Added action-layer implementations for OCR and video
  • Added tests for VLM, bridge, OCR action, and video action
  • Added opencv-python-headless for keyframe extraction
  • Added Python 3.9 compatibility fixes with from __future__ import annotations

Verification

Ran:

  • PYTHONPATH=. python3 -m pytest tests/test_step1_vlm_interface.py tests/test_step2_internal_action_interface.py -v --tb=short
  • PYTHONPATH=. python3 -m pytest tests/test_step3_perform_ocr_action.py -v --tb=short
  • PYTHONPATH=. python3 -m pytest tests/test_step4_understand_video_action.py -v --tb=short

Result:

  • 59/59 tests passing

@AlanAAG AlanAAG requested review from ahmad-ajmal and zfoong April 15, 2026 09:32
@zfoong
Copy link
Copy Markdown
Collaborator

zfoong commented Apr 16, 2026

The current approach for video understanding can be improved. Currently, only Google's model/API supports video understanding (Qwen and Moonshot do too, but we haven't set them up as providers yet).

Here is what you can do for now: If the user has Google API key set up, then use the Gemini API to perform video understanding; if a Google API key is not set up, the agent can fall back to using your current approach (which might be expensive and not as effective). You can refer to the generate_image action and see how it is done.

I haven't tested the PR yet. Once this is improved, I will test the PR.

@ahmad-ajmal
Copy link
Copy Markdown
Collaborator

  1. generate_multimodal_multi_image looks like it could be folded into generate_multimodal, the only difference is accepting a list of images instead of one. Could we update generate_multimodal to handle both cases? That way we avoid the duplicated payload/token logic.
  2. describe_image_ocr seems to repeat most of describe_image_bytes (provider routing, token counting, cleanup). Would it work to call describe_image_bytes with the OCR system prompt and json_mode=False?
  3. The model is hardcoded as "gemini-1.5-pro" in understand_video.py, should this pull from config (e.g. get_vlm_model()) to stay in sync with the rest of the VLM settings?

…model

- Merge generate_multimodal_multi_image into generate_multimodal (image_bytes_list param)
- Add json_mode param to describe_image_bytes; describe_image_ocr now a thin wrapper
- understand_video pulls model from get_vlm_model() with gemini-1.5-pro fallback
- Add test suites: gemini_client_multimodal, vlm_json_mode, ocr_wrapper, understand_video_model
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants