Skip to content

feat: explore local computer-use agent backend for university workflows #17

Description

@SebastianBoehler

Idea

tue-api-wrapper currently exposes semantic study-system operations through unofficial API and parsing routes. For authenticated university systems where scraping or private endpoint use is brittle or questionable, explore a local computer-use agent backend that keeps the same high-level interface but performs actions through the user's own browser or native UI.

Example user intents:

  • "Search IMA for this course and show whether it conflicts with my schedule."
  • "Build my next-semester timetable from these modules."
  • "Prepare course registration for this seminar, then ask me before submitting."

Working assumption

Do not implement this as a generic screen-clicking bot. Treat computer use as one possible local backend behind existing typed contracts:

  • keep Python/API/server contracts as the semantic layer
  • add a local UI-agent executor only where direct APIs are unavailable or inappropriate
  • prefer deterministic browser/DOM/accessibility state over blind vision coordinates
  • use screenshots for verification and fallback, not as the primary source of truth when structured state exists

Relevant prior art

  • farzaa/clicky: public repo appears to be the older open-source version. It uses ScreenCaptureKit screenshots, voice input, Claude vision / Anthropic computer-use-style coordinate grounding, and cursor pointing. The latest commercial Clicky seems to be closed source, so the public repo should not be treated as the full current implementation.
  • jasonkneen/openclicky: closer to the current native-control pattern. It uses native macOS APIs such as ScreenCaptureKit, CGWindowList, CGEvent, Accessibility APIs, local Codex integration, and computer-use backend selection.
  • trycua/cua / cua-driver: strongest reference for background macOS control. It exposes MCP/CLI tools for app/window discovery, window screenshots, AX trees, element-index actions, and pid/window-targeted input without stealing focus.
  • OpenAI computer use / Anthropic computer use: model loop can suggest actions from screenshots, but the app still needs a reliable local driver to execute and verify actions.

Useful links:

Proposed architecture sketch

  1. Keep the existing user-facing operations as typed contracts, for example search_courses, get_schedule, prepare_registration, submit_registration.
  2. Add a backend selection layer:
    • direct wrapper/parser backend where current integrations are reliable
    • browser automation backend for web portals where DOM state is accessible
    • native computer-use backend for portals or apps that require local authenticated UI interaction
  3. Run all credentialed flows locally:
    • use the user's browser session, local sidecar, or local app runtime
    • do not route student credentials through hosted services
    • avoid hosted Cloud Run for authenticated university actions
  4. Build portal-specific workflow recipes:
    • IMA/course search
    • ALMA/registration-like flows if applicable
    • timetable extraction and conflict checking
    • confirmation-gated submit flows
  5. Record structured evidence for every step:
    • current URL/app/window target
    • parsed DOM or AX state when available
    • screenshot path or hash for verification
    • extracted fields and confidence
    • explicit user confirmation for write actions

First milestone

Build a read-only spike, not a registration bot:

  • command or API route: "search a course in the university portal and return normalized schedule data"
  • user must already be logged in locally, or the workflow pauses for manual login
  • no mock data and no hidden fallback data
  • output should reuse existing JSON contracts where possible
  • verify by comparing extracted fields against visible UI state

Candidate success check:

  • Given a known course query, the local agent opens or reuses the portal session, searches the course, extracts title/time/location/instructor where visible, and returns structured JSON plus evidence of the source screen.

Safety and product constraints

  • Require explicit confirmation before any write action: enroll, unregister, submit, send, delete, or modify.
  • Surface portal errors verbatim and stop; do not invent fallback data.
  • Make the action log inspectable before submission.
  • Keep credentialed sessions local to the user's machine.
  • Treat UI automation as potentially subject to university terms. This is not automatically safer than private endpoint access; it should be user-initiated, local, auditable, and limited in scope.
  • Prefer official APIs or documented export routes whenever they exist.

Open questions

  • Which portal should be the first target: IMA, ALMA, Moodle/Campus, or timetable export?
  • Is Cua Driver acceptable as an optional local dependency, or should the first spike use Playwright/browser automation only?
  • Should this live behind the existing API server, the Electron sidecar, the CLI, or a separate local-only agent command?
  • What is the minimum consent UX before registration-type actions?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions