To define the data infrastructure and informational boundaries for the AI agent. This includes identifying and structuring data sources, ensuring data quality and compliance, and explicitly stating what the agent should and should not attempt. A robust data strategy is foundational for agent reliability, accuracy, and trust.
Document all data sources needed for core agent functionality:
- Internal systems (CRM, support platforms, databases)
- External APIs and public datasets
- File/document repositories (PDFs, Google Drive, SharePoint)
Source: Gemini + Perplexity data readiness assessments
Assess the quality, availability, and compliance of each data source:
- Freshness and update frequency
- Access controls and authentication
- Sensitivity and compliance (GDPR, CCPA)
Tools: Schema validation, null value scans, privacy scoring
Plan how data will be ingested, transformed, and fed to the agent:
- ETL/ELT pipelines
- Chunking for RAG (text splitting, embedding)
- Live vs. batch updates
Tools: LlamaIndex, LangChain, Airbyte, Prefect
Decide how information will be structured and queried:
- Vector stores (Pinecone, FAISS, Weaviate)
- Hybrid search (semantic + keyword)
- Metadata tagging and filters
Critical for Retrieval-Augmented Generation (RAG)
Define what the agent should NOT attempt:
- Unsupported data types (e.g., image/video if not configured)
- Domains outside scope
- Escalation paths for ambiguous or risky requests
Ensures safe, reliable, and explainable behavior
Prepare for edge cases and system errors:
- What happens if data is missing?
- When does the agent escalate to a human?
- What fallback messages are used?
Source: Responsible AI and MLOps safety practices
Ensure data handling aligns with laws and internal policies:
- Consent handling, PII controls
- Anonymization, retention, audit logs
- Third-party data licensing review
Source: CCPA/GDPR templates + Microsoft Purview / AWS Lake Formation
- Data Quality Score (completeness, freshness, structure)
- RAG Retrieval Relevance Accuracy
- Compliance Audit Pass Rate
- Escalation Accuracy Rate (false negatives avoided)
- Data Connectors: Airbyte, LlamaHub, Zapier, APIs
- Pipeline Tools: Prefect, Dagster, LangChain loaders
- Vector Stores: Pinecone, Weaviate, Chroma, FAISS
- Compliance: Microsoft Purview, AWS Macie, DataDog
- Data Inventory Spreadsheet
- Readiness Scoring Matrix
- Chunking & Embedding Planner
- Capability Boundaries Matrix
- Failure Mode Worksheet
- Escalation Mapping Guide
- Governance Checklist
| Track | Flow | Outcome |
|---|---|---|
| Weekend Warrior | Use ready-made datasets or low-risk sources | Lightweight RAG setup with safe fallback rules |
| Startup | Custom pipeline for 1–2 internal systems + privacy checklist | Validated data inputs and scoped capabilities |
| Enterprise | Full governance model + layered indexing strategy | Secure, compliant, scalable knowledge foundation |
- LlamaIndex Docs
- LangChain Chunking & Loading
- FAISS & Vector DB Guides
- Microsoft Purview Governance
- GDPR/CCPA Compliance Resources
- Input: System architecture and integration targets from Module 4
- Feeds into: Module 6 (Interaction Design), Module 8 (Performance Evaluation), Module 9 (Deployment)