UCSF Brain Health Registry

Introduction on how to use AI tools at UCSF

Pedro Pinheiro-Chagas, PhD

UCSF NeuroAI Lab
Department of Neurology, Fein Memory and Aging Center
Bakar Computational Health Sciences Institute
Center for Intelligent Imaging (ci2)
Weill Institute for Neurosciences

Slidesneuroailab.ucsf.edu/talks/intro-ai-tools

Scan for slides

Definition

What is agentic AI?

Agentic AI does not just reply with text. It breaks a goal into steps, uses real tools (your files, terminal, browser, data), acts, checks the result, and repeats until the task is done.

Assistant

Chatbot, like ChatGPT

You ask, it answers, one turn at a time
Returns text and suggestions
You copy the output and do the work yourself
Lives in the chat box, with no access to your systems
Best for drafting, explaining, brainstorming

Coworker

Agentic AI

You give a goal, it plans the steps
Takes actions: runs code, edits files, queries data, calls tools
Works inside your environment: repo, terminal, apps
Observes results and self-corrects in a loop
Carries the task end to end, then reports back

For real work, the shift is from AI that tells you how to AI that does it with you: fewer copy-pastes, more finished tasks.

What the data shows

The shift to agentic AI

Agentic coding is diffusing fast and broadly: Codex's share of output tokens climbs across job functions, not just engineering. Fig. 3, share of output tokens by worker type · Johnston, D., Holtz, D., Martin Richmond, A., Ong, C., Tambe, P., & Chatterji, A. (2026). The Shift to Agentic AI: Evidence from Codex. OpenAI. arXiv:2606.26959

**(a)** Organizational users, by inferred job title

What you can use today

AI resources at UCSF

Available now

ChatGPT Enterprise

Outlook integration: connect Outlook Calendar and Outlook Email
Read only for now (no creating events, no reading attachments)
Plus Projects, GPTs, Apps, and Deep Research
Limited credits, with a new weekly cap on the heaviest users
Want more? Associate your account with a speedtype

Real agentic

UCSF Versa API + Codex

OpenAI models via Microsoft Azure, delivered through UCSF Versa
Versa is UCSF's secure platform for PHI-safe inference
Needs a speedtype, includes about $200 / month of credit
Unlocks Codex, both the CLI and the desktop app
Truly agentic: reads and edits your local files
Runs in your environment or VM, multi-step tasks end to end

PHI-safe

Both ChatGPT Enterprise and the Versa API are cleared for PHI data. Inputs are not retained for model training.

The decade of AI Agents

LLMs excel at complex tasks: board exams, benchmarks (including diagnosis)

Huge potential to revolutionize clinical care, programmed for specific use cases

Massive jagged intelligence: gap between model strengths and weaknesses

Harness closes the gap (planning, context, verification)

Hallucinations are declining, but persist

Scientific and empirical grounding

No continual learning: model's knowledge lags current medical information

Access to updated information (e.g. FDA, clinical trials, Dx criteria, treatment)

Limited context window and sample inefficiency

Effectively compress information (vector databases, knowledge graphs)

LLMs lack stats metrics for predictions or classifications, limiting implementation and trust

Combine with traditional ML models (quantitative, interpretable features)

Most studies on genAI for diagnostics use simulated or journal-curated data

Need for real data and real-world use cases: clinical validation

A cautionary example

Polished, convincing, and wrong

An AI tool generated this graphical abstract of a new paper. It looked authoritative, but the numbers were wrong.

AI-generated · results incorrect AI-generated graphical summary with incorrect results

Hallucinations look real. The summary assigned values to the wrong groups and invented key results, yet looked polished enough to accept and share. As models improve, these errors get harder to spot.
Wrong model for the task. Image generators optimize for visual appeal, not quantitative fidelity. Use a language model to build code-based visuals (e.g. HTML) from the real numbers, trading polish for accuracy.
AI literacy in research. One interface hides many different models. Know which fits which task, where it fails, and treat every output as a draft to verify against the source.

Using AI agents

Two ways to run the same agent

Same model, same tools. What changes is the surface you work in: a terminal or a desktop app. Pick the one that fits how you work.

Terminal

Codex CLI

Lives in your terminal, right inside the repo
Scriptable and composable with your shell and tools
Runs anywhere your code does: laptop, SSH, server, VM
Fast and keyboard-driven, ideal for automation

Desktop app

Codex App

Point-and-click, no terminal required
Visual diffs and step-by-step review of every change
Lower barrier to entry, easy to start and supervise
Delegate tasks and watch them run in a clean UI

Same agent underneath. Your skills, config, and file access travel with you across both.

Plan mode

Plan first, then execute

For big or ambiguous tasks, ask the agent to plan before it acts. It explores read-only, proposes a step-by-step plan, and only changes files once you approve.

You

Help me organize this messy dataset. Dozens of spreadsheets are scattered across different folders, file names are not standardized, some sheets duplicate columns, and a few use different units. I want one clean, documented table.

Reach for plan mode when

the task spans many files or steps
the requirements are fuzzy
mistakes would be costly

Proposed plan read-only until you approve

1Inventory every spreadsheet across the folders: path, sheet, size, modified date.

2Profile headers and sample rows; flag duplicate columns, mixed units, and encodings.

3Propose a schema: one canonical column set with clear types and units.

4Normalize each sheet to the schema; standardize units, parse dates, drop duplicates.

5Merge and validate: combine into one table; reconcile row counts against the sources.

6Deliver a clean table, a data dictionary, and a short README of decisions.

Nothing changes until you approve. Edit or reorder any step, then the agent executes it.

Customizing the agent

Skills: teach the agent your way

A skill is a small markdown file that tells the agent how to do a specific task, with reusable instructions and helper scripts. You can even ask the agent to write the skill itself.

Name + whena description that tells the agent when to reach for it

Instructionsplain-English markdown: conventions, structure, do's and don'ts

Scriptsoptional helper code the agent can run on its own

Portablelives in your repo, version-controlled, shareable with the team

---
name: frontend_prototype
description: Polished single-file HTML
  in a clinical-blue editorial design system.
---

## Design tokens
--slate: #052049;   /* navy text */
--clay:  #1C75BC;   /* clinical blue */

## Slides / Decks
Full-screen scroll-snap, one idea per slide,
arrow-key navigation, fixed counter.

skills/frontend_prototype/SKILL.md

The agent can write the very skill that powers it. frontend_prototype was generated this way, and now it builds every slide you're seeing.

Engineering the agent

From prompt to harness engineering

Prompt engineering is about the words you send. Harness engineering is about the system around the model: the tools it can call, the loop it runs, and the services it can reach.

Tool definition

Give it actions

Three parts: the function, the schema the model sees, and a prompt telling it when to call.

# 1. the function (your code)
def search_patients(query):
    return db.search(query)

# 2. declared to the model
{"name": "search_patients",
 "description": "find patients",
 "parameters": {"query": "str"}}

# 3. tell the model to use it
system: "Call search_patients
to look up patient records."

Agentic loop

Let it iterate

The model plans, calls a tool, reads the result, and repeats until the goal is met.

Plan ↓ Call a tool ↓ Observe the result ↻ repeat until done

MCP servers

Connect the world

A standard way to plug in outside tools and data, so the agent reaches beyond the chat.

Files Database Web GitHub Calendar

The model is the engine; the harness makes it drive. Today's gains come from better tools, loops, and connectors, not just better prompts.

MCP servers

Really just connectors and plugins

MCP (Model Context Protocol) is an open standard for connecting an agent to outside tools and data. Each server is really a connector, or plugin: add one, and the agent gains a new capability, like apps for a phone or extensions for a browser.

Add a server, gain a capability. You have already used one: the Outlook connector in ChatGPT Enterprise.

Worked example

Screening research candidates

A real NeuroAI Lab tool: read every applicant CV, score it against our hiring criteria, and justify each decision, then post the ranked results back to our UCSF GitHub.

Read CVsPDF to text

→

Score vs criteriaLLM on UCSF Versa

→

Structured resultmet · confidence · evidence

→

Share resultsranked CSV + GitHub

Candidate	Academic	Clinical	Modality	Technical	Score
Candidate A	✓ high	✓ med	✓ high	✓ high	4/4
Candidate B	✓ high	✗ low	✓ med	✓ high	3/4
Candidate C	✓ med	✗ low	✓ med	✗ low	2/4

Evidence (Academic, Candidate A): “PhD in Neuroscience; 14 peer-reviewed publications, several in high-impact journals.” UCSF Versa · PHI-safe

The same building blocks. Criteria are the spec, the system prompt sets the rubric, a typed schema makes every result structured and explainable, and a loop scales it to every applicant.

A skill in action

Responding to reviewers

A skill for the work researchers dread: point-by-point reviewer and grant responses. It drafts one comment at a time, grounds every reply in the manuscript's tracked changes, and verifies each citation before using it.

What the skill enforces

One comment at a time, wait for approval
Quote the real tracked changes, never invent
Verify every citation online (APA)
Diplomatic tone: concede, then explain
No em dashes, no AI-tells
Notes to co-authors for open items

Reviewer 2 · Comment 2

“The subgroup analyses seem underpowered. Please justify the sample sizes.”

We thank the reviewer for this helpful point. We agree these analyses are exploratory, now say so explicitly, and report 95% confidence intervals throughout. We added the following to the Methods:

“Subgroup analyses were prespecified but exploratory; given the limited subgroup sizes, estimates are reported with 95% confidence intervals and interpreted with caution.”

Stops here. Waits for your approval before Comment 3.

A skill captures how you work. Your format, your voice, and your integrity rules, applied the same way every time.

Paperclip + Codex

Search papers from inside the agent

Paperclip is a biomedical literature connector and CLI. Once installed, Codex can search papers, regulatory documents, and clinical trials, read the source text, and bring back citations instead of relying on memory.

1 · Install once

Local CLI + Codex skill

Run the Paperclip installer, sign in, then install the project skill and select Codex.

curl -fsSL https://paperclip.gxl.ai/install.sh | bash
paperclip install  # select Codex

.agents/skills/paperclip/SKILL.md

2 · Ask naturally

Use /paperclip in Codex

Start a fresh Codex session in the project, then mention the skill in the prompt. The agent loads the Paperclip docs and chooses the right commands.

You

using /paperclip, find papers about Alzheimer's and pTau217 in Black populations

searchread full textextract linescite

3 · Agent loop

From query to evidence

Codex can iterate over the corpus, inspect the papers it finds, and summarize only what is supported by sources.

1Search PMC, preprints, FDA, or trials.

2Read abstracts, methods, results, and figures.

3Return a concise answer with line-level citations.

Same Codex workflow, larger library. Paperclip turns biomedical literature into a tool the agent can call, inspect, and cite.

MCP in practice

Your Jira, straight from Codex

Remember MCP servers, the connectors from earlier? Here is one, live. Point Codex at Atlassian Jira and it can read your board: ask for your open tickets in plain English and get them back with working links.

1 Add the connector

# ~/.codex/config.toml
[mcp_servers.atlassian]
command = "npx"
args = ["-y", "@aashari/mcp-server-atlassian-jira"]

[mcp_servers.atlassian.env]
ATLASSIAN_SITE_NAME  = "your-site"
ATLASSIAN_USER_EMAIL = "you@ucsf.edu"
ATLASSIAN_API_TOKEN  = "ATATT3x••••••••"  # from id.atlassian.com

Drop the block in, then restart Codex.

2 Add a skill for consistency

---
name: jira
description: Always link every ticket,
  show its status, confirm before writes.
---
One format, every time:
[KEY](…/browse/KEY) · summary · status

skills/jira/SKILL.md

3 Just ask, in plain English

Codex querying Jira through the Atlassian MCP server: one prompt returns 12 open tickets, each with a link

One prompt → 12 open tickets, each linking straight to Jira.

This is an MCP connector at work. No exports and no custom API code. The same pattern from earlier, now pointed at Jira. Swap in GitHub, Confluence, or a database the same way.

Wrapping up

Pick one task. Start this week.

The tools are already here and they are PHI-safe. The fastest way to learn is to point one at a piece of real work and watch what it does.

What to remember

Agentic AI works with you. It plans, acts, checks, and repeats, instead of just answering.
Use UCSF's PHI-safe tools. ChatGPT Enterprise and the Versa API keep data protected and out of training.
Treat every output as a draft. Polished does not mean correct. Verify against the source.
The harness beats the prompt. Tools, loops, and connectors are where the real gains come from.
Skills capture how you work. Encode your process once, reuse it the same way every time.

Start this week

Request access to ChatGPT Enterprise and UCSF Versa.
Pick one repetitive task you already do often.
Ask the agent to plan first, then approve before it runs.
Keep PHI on Enterprise or Versa only, never consumer ChatGPT.
Check the result against the original data before you trust it.

Thank you. Pedro Pinheiro-Chagas, PhD
UCSF NeuroAI Lab · Weill Institute for Neurosciences