[V1.2.3] Add dedicated OCR and video analysis actions (#155) by AlanAAG · Pull Request #196 · CraftOS-dev/CraftBot

AlanAAG · 2026-04-15T09:32:49Z

Closes #155

What changed

Added dedicated OCR support via perform_ocr
Added video analysis support via understand_video
Extended VLM interface for OCR and multi-frame video understanding
Added bridge methods in InternalActionInterface
Added action-layer implementations for OCR and video
Added tests for VLM, bridge, OCR action, and video action
Added opencv-python-headless for keyframe extraction
Added Python 3.9 compatibility fixes with from __future__ import annotations

Verification

Ran:

PYTHONPATH=. python3 -m pytest tests/test_step1_vlm_interface.py tests/test_step2_internal_action_interface.py -v --tb=short
PYTHONPATH=. python3 -m pytest tests/test_step3_perform_ocr_action.py -v --tb=short
PYTHONPATH=. python3 -m pytest tests/test_step4_understand_video_action.py -v --tb=short

Result:

59/59 tests passing

zfoong · 2026-04-16T04:06:14Z

The current approach for video understanding can be improved. Currently, only Google's model/API supports video understanding (Qwen and Moonshot do too, but we haven't set them up as providers yet).

Here is what you can do for now: If the user has Google API key set up, then use the Gemini API to perform video understanding; if a Google API key is not set up, the agent can fall back to using your current approach (which might be expensive and not as effective). You can refer to the generate_image action and see how it is done.

I haven't tested the PR yet. Once this is improved, I will test the PR.

…d_video, OpenCV as fallback

…exceptions in describe_image_bytes

…r and understand_video

…ability guard

ahmad-ajmal · 2026-04-16T17:51:52Z

generate_multimodal_multi_image looks like it could be folded into generate_multimodal, the only difference is accepting a list of images instead of one. Could we update generate_multimodal to handle both cases? That way we avoid the duplicated payload/token logic.
describe_image_ocr seems to repeat most of describe_image_bytes (provider routing, token counting, cleanup). Would it work to call describe_image_bytes with the OCR system prompt and json_mode=False?
The model is hardcoded as "gemini-1.5-pro" in understand_video.py, should this pull from config (e.g. get_vlm_model()) to stay in sync with the rest of the VLM settings?

…model - Merge generate_multimodal_multi_image into generate_multimodal (image_bytes_list param) - Add json_mode param to describe_image_bytes; describe_image_ocr now a thin wrapper - understand_video pulls model from get_vlm_model() with gemini-1.5-pro fallback - Add test suites: gemini_client_multimodal, vlm_json_mode, ocr_wrapper, understand_video_model

feat: add OCR and video analysis actions (#155)

25f21c1

AlanAAG requested review from ahmad-ajmal and zfoong April 15, 2026 09:32

AlanAAG added 4 commits April 16, 2026 18:05

improvement: use Gemini native video API as primary path in understan…

fa0284e

…d_video, OpenCV as fallback

fix(vlm): remove response_format json_object from byteplus, re-raise …

6915894

…exceptions in describe_image_bytes

fix(actions): split action_sets string into proper list in perform_oc…

125cff4

…r and understand_video

fix: wire independent VLM provider/model/key resolution and add avail…

247ee92

…ability guard

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1.2.3] Add dedicated OCR and video analysis actions (#155)#196

[V1.2.3] Add dedicated OCR and video analysis actions (#155)#196
AlanAAG wants to merge 6 commits intodevfrom
feature/ocr-video-actions

AlanAAG commented Apr 15, 2026

Uh oh!

zfoong commented Apr 16, 2026

Uh oh!

ahmad-ajmal commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AlanAAG commented Apr 15, 2026

What changed

Verification

Uh oh!

zfoong commented Apr 16, 2026

Uh oh!

ahmad-ajmal commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants