-
Notifications
You must be signed in to change notification settings - Fork 137
Add basic page observation infrastructure #164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -83,6 +83,8 @@ p + dl.props { margin-top: -0.5em; } | |
| <pre class="link-defaults"> | ||
| spec:html; type:dfn; | ||
| text:form-associated element | ||
| text:browsing context group set | ||
| text:unique internal value | ||
| </pre> | ||
|
|
||
| <h2 id="intro">Introduction</h2> | ||
|
|
@@ -439,6 +441,94 @@ The <dfn>synthesize a declarative JSON Schema object algorithm</dfn>, given a <{ | |
| } | ||
| </pre> | ||
|
|
||
| <h2 id="interaction-with-agents">Interaction with agents</h2> | ||
|
|
||
| <h3 id="event-loop">Event loop integration</h3> | ||
|
|
||
| A web site's functionality is exposed to [=agents=] as tools that live in a [=Document=]'s [=event | ||
| loop=], that get registered with the APIs in this specification. | ||
|
|
||
| The [=user agent=]'s [=browser agent=] runs [=in parallel=] to any [=event loops=] associated | ||
| with a {{ModelContext}} [=relevant global object=]. Steps running on the [=browser agent=] get | ||
| queued on its <dfn>AI agent queue</dfn>, which is the result of [=starting a new parallel queue=]. | ||
|
|
||
| Conversely, steps queued *from* the [=browser agent=] onto the [=event loop=] of a given | ||
|
domfarolino marked this conversation as resolved.
|
||
| {{ModelContext}} object (i.e., the "main thread" where JavaScript runs) are queued on its [=relevant | ||
| global object=]'s [=AI task source=]. | ||
|
|
||
| <h3 id="observations">Page observations</h3> | ||
|
|
||
| In-page [=agents=] implemented in JavaScript can "observe" the tools that a page offers by using the | ||
|
domfarolino marked this conversation as resolved.
|
||
| {{ModelContext}} APIs directly, and any other platform APIs to obtain necessary context about the | ||
| page in order to actuate it appropriately. | ||
|
|
||
| The [=browser agent=], on the other hand, does not run JavaScript on the page. Instead, it obtains a | ||
| view of the page's tools and any other relevant context by getting an [=observation=]. An | ||
| <dfn>observation</dfn> is an [=implementation-defined=] data structure containing at least a <dfn | ||
| for=observation>tool map</dfn>, which is a [=map=] whose [=map/keys=] are [=Document/unique ID=]s, | ||
| and whose [=map/values=] are [=tool definitions=]. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would think this needs to be doubly keyed by (document_id, tool_name) at a minimum?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That'd only be necessary if within a single Document, you could have multiple tools with the same name. But that shouldn't be possible, so keying on document ID means all of the tools associated with that document will be unique. Across multiple Documents, you can have duplicate tool names, but that should be taken care of. Does that sound right or am I missing something? |
||
|
|
||
| Note: An [=observation=] is usually a "snapshot" distillation of a page being presented to the user, | ||
| along with any other state the [=user agent=] believes is relevant for the [=browser agent=]; this | ||
| often includes screenshots of the page, not just a DOM serialization. See [Annotated Page Content | ||
| (APC)](https://chromium.googlesource.com/chromium/src.git/+/main/third_party/blink/renderer/modules/content_extraction/readme.md) | ||
| in the Chromium project for an example of what might contribute to an observation. | ||
|
|
||
| <hr> | ||
|
|
||
| <div algorithm> | ||
| To <dfn>perform an observation</dfn> given a [=top-level traversable=] |traversable|, run these | ||
|
domfarolino marked this conversation as resolved.
|
||
| steps: | ||
|
|
||
| 1. [=Assert=]: This algorithm is running in the [=browser agent=]'s [=AI agent queue=]. | ||
|
|
||
| 1. [=Assert=]: |traversable|'s [=navigable/active document=] is not [=Document/fully active=]. | ||
|
|
||
| 1. Let |observation| be a new [=observation=]. | ||
|
|
||
| 1. Let |flat descendants| be the [=Document/inclusive descendant navigables=] of |traversable|'s | ||
| [=navigable/active document=]. | ||
|
|
||
| 1. [=list/For each=] [=navigable=] |descendant| of |flat descendants|: | ||
|
|
||
| 1. Let |document| be |descendant|'s [=navigable/active document=]'s. | ||
|
|
||
| 1. Let |id| be |document|'s [=Document/unique ID=]. | ||
|
|
||
| 1. Set |observation|'s [=observation/tool map=][|id|] = |document|'s [=relevant global | ||
| object=]'s {{Navigator}}'s [=Navigator/modelContext=]'s [=ModelContext/internal context=]'s | ||
| [=model context/tool map=]'s [=map/values=], which are [=tool definitions=]. | ||
|
|
||
| 1. Perform any [=implementation-defined=] steps to add anything to |observation| that the [=user | ||
| agent=] might deem useful or necessary, besides just populating the [=observation/tool map=]. | ||
| This might include annotated screenshots of the page, parts of the accessibility tree, etc. | ||
|
|
||
| 1. Perform any [=implementation-defined=] steps with |observation| and the [=browser agent=], to | ||
| expose the |observation|'s [=observation/tool map=] to the [=browser agent=] in whatever way it | ||
| accepts. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I expect this will also update the tool map exposed to other frames in listTools() once that is spec'ed out? Otherwise I am wondering how we will measure interop since the side effect is only exposed to the browser agent.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually performing an observation will not have any web-observable effects. When I spec the Promise-returning So in short, this is not the map that |
||
|
|
||
| Note: Despite the name of this API (i., Web*MCP*), this specification does not prescribe the | ||
| format in which tools are exposed to the [=browser agent=]. Browsers are free to distill and | ||
| expose tools via Model Context Protocol, other proprietary "function calling" methods, or any | ||
| other way it deems appropriate. | ||
|
|
||
| Advisement: Implementations are expected to convey to the [=browser agent=] any relevant | ||
| security information associated with [=tool definitions=], such as the originating [=origin=], | ||
| among other things, so that the backing model has an idea of the different parties at play, and | ||
| can most safely carry out the end user's intent. | ||
|
|
||
| </div> | ||
|
|
||
| Each {{Document}} object has a <dfn for=Document>unique ID</dfn>, which is a [=unique internal | ||
| value=]. | ||
|
|
||
| The times at which a [=browser agent=] [=performs an observation=] are [=implementation-defined=]. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We may need to be more proscriptive here; if tools are registered or unregistered, or frames enter or leave the frame tree, those should trigger a re-observation.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree. My plan is this: when |
||
| A [=browser agent=] may [=parallel queue/enqueue steps=] to the [=AI agent queue=] to [=perform an | ||
| observation=] given any [=top-level browsing context=] in the [=user agent=] [=browsing context | ||
| group set=], at any time, although implementations typically reserve this operation for when the | ||
| user is interacting with a [=browser agent=] while web content is in view. | ||
|
|
||
|
|
||
| <h2 id="security-privacy">Security and privacy considerations</h2> | ||
|
|
||
| <!-- | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're using the AI task source in the renderer/Document's event loop, I'd like to keep the browser side equivalent also prefixed with "AI", to mirror this. Is that OK?