CatchMe: The Missing Layer
of Personal Agent

March 2026

Bingxi Zhao Chao Huang

Data Intelligence Lab, The University of Hong Kong

AI agents are growing remarkably capable. They write code, orchestrate workflows, browse the web, and manage files, all through increasingly powerful tool-use and skill-composition frameworks. Yet amid this rapid progress, a fundamental gap remains: agents know almost nothing about the person they serve.

An agent can call a hundred APIs, but it cannot answer "What was I working on before lunch?" or "When did I last read that arxiv paper?"; it simply has no memory of the user's daily digital life. Skills, tools, and harness architectures can augment what an agent does; but understanding who it works for requires something these systems cannot provide: knowledge of the user's habits, context, and unspoken intent.

We argue that personal memory is the missing infrastructure layer for truly adaptive AI agents. Without it, agents adapt to tasks; with it, agents adapt to people.

The Gap in Current Memory Systems

Agents Operate in CLI; Users Live in GUI

The resurgence of CLI-based agent interaction is quietly reshaping the boundary between human and machine. As agents increasingly operate through terminal commands, code execution, and API calls, the graphical interface is no longer a shared workspace; it is becoming exclusively the user's domain. The tabs you browse, the documents you read, the paragraphs you highlight — this is context that no agent can observe through its own tool-use. And yet it is precisely this context that determines what the user cares about, what they are working on, and what they need help with. Personal GUI activity is, in a real sense, the last mile of information that agents cannot reach on their own.

Statistics Without Semantics

One family of existing approaches focuses on logging which applications you use and for how long. These systems produce useful time-tracking dashboards, but they capture no meaning. You can see that you spent three hours in your code editor — but not what you were editing, why you switched to the browser, or what you found there. The data is flat, aggregated, and disconnected from the semantic fabric of your day. It tells you where your time went, but not what happened.

Data Without Structure

Another family goes to the opposite extreme: continuous screen recording that captures everything on screen, then relies on vector-based retrieval, embedding OCR'd frames or text chunks and matching them against queries by similarity. This approach inherits fundamental limitations: semantic similarity is not the same as relevance; fixed-size chunking breaks contextual integrity; and the infrastructure cost of continuous recording, embedding, and indexing is substantial. You get a haystack of data, but the needle-finding mechanism is blunt.

The Flat-Stream Assumption

Beneath these differences, both paradigms share a deeper structural deficit: they treat the user's digital activity as a flat stream: a chronological sequence of events or frames with no inherent organization. But human computer use is not flat. It has sessions that begin and end, tasks that nest within tasks, focus shifts that carry meaning, and a natural hierarchy of applications, windows, and actions. When you switch from reading a paper to checking Slack and then return, that return carries context — context that a flat log discards entirely. A meaningful memory system should preserve this structure, not flatten it away.

Our thesis: personal memory needs structure, not similarity. Instead of embedding everything into a vector space, we organize raw events into a navigable hierarchy that an LLM can reason over, much like a reader skims a table of contents before diving into the relevant chapter.

CatchMe: Design Overview

CatchMe is an always-on, fully local memory system that runs quietly in the background, captures what you see and do, and organizes everything into a hierarchical activity tree that both humans and AI agents can query in natural language. The design rests on three pillars.

A Living Activity Tree

Rather than logging events as a flat sequence, CatchMe structures your day into a five-level tree (Day → Session → App → Location → Action) that mirrors the natural rhythm of how people use computers. Sessions are split by idle gaps; within each session, activity is grouped by application, then by the specific tab, file, or window you were focused on, and finally by the individual actions (typing, clicking, scrolling) you performed there.

The tree's structural hierarchy is built by deterministic rules (window switches, idle timeouts, temporal clustering) with no LLM involved in deciding where a node belongs. But the content of each node is not raw event data: as events accumulate within a node, an LLM summarizes them into concise natural-language descriptions. These summaries propagate bottom-up: actions are summarized first, then parent locations, then sessions, so that every level of the tree carries semantically meaningful information, not just structural labels.

Crucially, this entire process is live and incremental. The tree does not get rebuilt from scratch at the end of the day. As you work, new events continuously flow in; the organizer extends the current session or opens a new one in real time; and once a node is "closed" (e.g., you switched away from an app), it is immediately enqueued for LLM summarization. The tree is always growing, always being enriched, a living document of your day that stays current minute by minute.

Fig 1. A snapshot of the activity tree. Gray nodes are completed and summarized; the orange dashed branch is the currently active session, growing as new events arrive.

Tree Reasoning for Fragmented Personal Data

Personal activity data is inherently fragmented (short bursts of typing, scattered clicks, fleeting tab switches) and buried in noise. Embedding-based retrieval struggles here: the chunks are too small and too heterogeneous to produce meaningful similarity signals. A keystroke sequence and the screenshot it relates to live in entirely different modalities; a 3-second glance at Slack has no semantic overlap with the deep-work session it interrupted.

CatchMe takes a different path: LLM-based reasoning over the tree structure. The hierarchy gives the LLM a reliable navigation scaffold, showing it which day, which session, which app to look in, while the LLM's reasoning filters out irrelevant noise within each branch, extracting only the details that matter. The tree ensures the search direction is right; the LLM ensures the signal-to-noise ratio is high.

This is a test-time approach: there is no pre-built knowledge graph or embedding index. The LLM reasons on demand, scoping by time, browsing the tree's table of contents, drilling into promising branches, and inspecting raw evidence (text or screenshots via vision) only when needed. If the context is insufficient, it iterates. This makes the system robust to the messiness of real-world activity data in a way that static retrieval cannot match.

Critically, the LLM is not limited to a single path through the tree. At every level (days, sessions, apps, locations) it can select multiple nodes simultaneously. When it reaches a Location, the system can read the file content directly from disk and inspect underlying action evidence: keystrokes, mouse events, and screenshots via vision. This multi-branch, multi-level traversal enables cross-temporal retrieval: evidence scattered across different days and different applications is collected in a single reasoning pass and synthesized into one coherent answer.

Fig 2. Retrieval walkthrough for "What did I change in retrieve.py this week?" The LLM scopes to relevant days (steps 1–2), drills to matching Cursor locations (3–4), and inspects file content and action evidence across two days to produce a single answer (5).

Plugging into the Agent Ecosystem

Modern agent frameworks are rich with capabilities. Tools let agents execute shell commands, browse the web, and manipulate files. Skills package reusable workflows: code generation, search, browser automation. Knowledge bases provide RAG over documents, embeddings, and conversation history. Together, these give an agent a powerful repertoire of what it can do and what it can know.

But there is a category of context that none of these provide: who the agent is working for. No tool can tell the agent what you were reading ten minutes ago. No skill can retrieve which files you had open this morning. No knowledge base contains the fleeting Slack message you glanced at before switching to your editor. This personal, real-time, GUI-level context — the texture of your actual digital life — is invisible to every standard agent capability.

CatchMe fills this gap. It is an agent-native CLI tool designed to plug directly into any framework (Cursor, Claude Code, OpenClaw, or any other harness). The integration surface is a single command that any agent can invoke as a skill. When a user sends a request, the agent calls CatchMe to retrieve personal context — what the user was doing, reading, and writing — and folds that understanding into its reasoning. The agent doesn't need to understand trees or retrieval pipelines; it asks a question in natural language and gets a grounded answer.

Wherever there are people, there is GUI interaction; and wherever there is GUI interaction, there is personal context waiting to be understood. By fitting into the agent loop as a lightweight plugin, CatchMe makes it possible for any workflow, any application, any agent to draw on that context — transforming agents that adapt to tasks into agents that adapt to people.

Fig 3. CatchMe in the agent stack. Existing capabilities (tools, skills, knowledge) sit inside the framework; CatchMe sits below, passively capturing GUI activity and feeding personal context upward via a single CLI call.

Bridging the Gap

Mouse-Event-Driven Screenshots

Continuous screen recording is expensive and generates enormous amounts of redundant data. CatchMe takes a different approach: screenshots are triggered by user actions. Every mouse click captures a full-screen image annotated with a crosshair, and every scroll session captures start and end frames. Each screenshot is stored as a full-screen overview and a zoomed detail crop around the action point. This makes every captured frame meaningful: it corresponds to a moment the user actively engaged with the screen, not an idle background frame.

Time-Interval Clustering for Noise Reduction

Raw events (individual keystrokes, mouse movements, clipboard changes) are noisy and voluminous. CatchMe clusters them into coherent action units based on temporal proximity: events that occur close together within the same location are grouped into a single Action node. LLM summarization operates on these clusters as the minimal unit, not on individual events. This dramatically reduces both noise and cost: the system summarizes meaningful activity chunks, not raw keystroke logs.

Preserving GUI Semantics in Tree Structure

The tree hierarchy naturally encodes relationships that flat event streams lose: which application was being used, within which file or URL, during which work session. The App → Location → Action nesting captures the topology of GUI use (the fact that you had three tabs open in Chrome, were editing two files in Cursor, and briefly checked Slack in between) as first-class structural information, not metadata to be reconstructed at query time.

Brief Capture for Transient Context Switches

Users frequently switch away from their primary window for a few seconds: checking a notification, glancing at a reference, copying a value. CatchMe's organizer detects these short-lived focus events and records them as briefs attached to the parent window span, rather than promoting them to full tree nodes. This preserves the relationship between a fleeting glance and the main task without polluting the tree structure with noise.

Rule-Based Time Filtering

Before any LLM call, the retrieval pipeline applies deterministic time-range filtering: temporal cues in the query ("this morning", "last Tuesday afternoon") are parsed and used to prune the tree down to only the relevant day and session nodes. This rule-based first pass eliminates the majority of the search space at zero cost, ensuring that the more expensive LLM reasoning operates on a tightly scoped context.

Test-Time Reasoning Over Pre-Built Indices

Unlike systems that construct a knowledge graph or embedding index at ingestion time, CatchMe defers all semantic reasoning to query time. The tree is structural scaffolding; meaning is extracted on demand by the LLM navigating it. This has a practical advantage: there is no stale index to maintain, no re-embedding when the model improves, and no upfront cost proportional to data volume. The system scales with query count, not data size.

Conclusion

The next generation of personal AI agents will be defined not by what tools they can use, but by how well they understand the person using them. CatchMe contributes to this vision by providing an always-on, lightweight, fully local memory layer that captures the user's digital footprint and organizes it into a structure that LLMs can reason over.

The core insight is that personal memory is a navigation problem, not a search problem. By structuring activity into a hierarchical tree and using LLM-based reasoning to traverse it, CatchMe achieves grounded, context-aware retrieval without vector databases, embedding pipelines, or pre-built knowledge graphs — all in ~200 MB of RAM, running entirely on the user's machine.

Your computer should remember everything. Now it can.

References

HKUDS, "LightRAG: Simple and Fast Retrieval-Augmented Generation," 2024. github.com/HKUDS/LightRAG
VectifyAI, "PageIndex: Vectorless, Reasoning-based RAG," 2025. github.com/VectifyAI/PageIndex
Alibaba, "PageAgent: In-page GUI Agent for Web Interfaces," 2025. github.com/alibaba/page-agent
Volcengine, "MineContext: Proactive Context-Aware AI Partner," 2025. github.com/volcengine/MineContext
HKUDS, "nanobot: Ultra-Lightweight Personal AI Assistant," 2025. github.com/HKUDS/nanobot
ActivityWatch Contributors, "ActivityWatch: Open-Source Automated Time Tracker," 2024. github.com/ActivityWatch/activitywatch
Screenpipe Contributors, "Screenpipe: AI Memory for Your Screen," 2024. github.com/mediar-ai/screenpipe
Yuka-friends, "Windrecorder: Personal Memory Search Engine for Windows," 2024. github.com/yuka-friends/Windrecorder
Arkohut, "Pensieve: Privacy-First Passive Recording," 2024. github.com/arkohut/pensieve
OpenRecall Contributors, "OpenRecall: Open-Source Digital Memory," 2024. github.com/openrecall/openrecall
Gurgeh, "Selfspy: Daemon for Recording Computer Activity," GitHub. github.com/gurgeh/selfspy

CatchMe: The Missing Layerof Personal Agent

The Gap in Current Memory Systems

Agents Operate in CLI; Users Live in GUI

Statistics Without Semantics

Data Without Structure

The Flat-Stream Assumption

CatchMe: Design Overview

A Living Activity Tree

Tree Reasoning for Fragmented Personal Data

Plugging into the Agent Ecosystem

Bridging the Gap

Mouse-Event-Driven Screenshots

Time-Interval Clustering for Noise Reduction

Preserving GUI Semantics in Tree Structure

Brief Capture for Transient Context Switches

Rule-Based Time Filtering

Test-Time Reasoning Over Pre-Built Indices

Conclusion

References

CatchMe: The Missing Layer
of Personal Agent