Agent Platform Training & Evaluation Pipeline
Kubeflow Pipelines v2 · Production‑grade MLOps on GCP
Production‑grade ML training, evaluation, gating, and conditional deployment pipeline on Google Agent Platform using Kubeflow Pipelines (KFP v2). Enforces model quality, tracks lineage, and deploys only validated models.
Project Summary
AI + MLOps + Cloud Platform Engineering
Category
AI/ML · MLOps · Platform Engineering
Industry
Cross‑industry Enterprise AI Platform
MLOps Focus
Training · Evaluation · Gating · Conditional Deploy
Key Technologies & Concepts
ML/AI platform primitives
Problem & Objective
Why this pipeline exists
Problem
Manual, notebook‑driven ML workflows lack reproducibility, governance, automated evaluation gates, and production discipline. No structured way to enforce model quality before deployment in GCP.
Objective
Build a production‑grade, automated ML training/evaluation pipeline on GCP that enforces quality gates, tracks lineage, and conditionally deploys models to Agent Platform endpoints using native MLOps primitives.
Solution & Architecture
Agent Platform native orchestration
Overview
Agent Platform Pipelines (KFP v2) orchestrates data preparation, model training (RandomForest), evaluation (ROC, confusion matrix, accuracy), quality gating, conditional deployment to Agent Platform Endpoints, and scheduled retraining.
Our platform automates machine learning workflows using Components as modular building blocks for specific tasks. These are orchestrated via a DSL (Domain Specific Language), which serves as the instruction manual for connecting them, while Conditions provide the "if‑then" logic to ensure the pipeline makes smart, real‑time decisions during execution.
@dsl.component or @component · @dsl.pipeline · with dsl.Condition(accuracy > 0.8):
Skills & Technologies
ML/platform engineering stack
Primary (Advanced)
- MLOps Architecture
- Agent Platform Pipelines / KFP v2
- Cloud AI Platform Engineering
- Production ML Workflow Design
Secondary
- Kubeflow Pipelines SDK
- scikit‑learn · Agent Platform SDK
- GCS · IAM · Workload Identity
- GitHub Actions (CI trigger)
Languages & DevOps
Pipeline Execution & Governance
Conditional gates, lineage, scheduling
Execution
- Manual / CI trigger → Agent Platform Pipeline run
- KFP v2 components: data prep, training, evaluation, deploy
- Artifacts stored in GCS, metrics in Agent Platform Metadata
Governance
- Explicit evaluation gate (accuracy/ROC threshold)
- Conditional pipeline branch: deploy only if gate passes
- Model versioning in Agent Platform Model Registry
- IAM least‑privilege + Workload Identity Federation
Challenges & Resolutions
Wiring KFP v2 components → Agent Platform Pipelines: used native KFP interfaces.
ROC/metrics logging: sanitized inputs for Agent Platform metrics APIs.
Conditional gates: pipeline condition with threshold check.
Model format for serving: packaged as Agent Platform-compatible artifact.
Notebook to production: refactored into pipeline components.
GCP CI/CD · Architecture & YAML Mapping
Pipeline‑2 model training, evaluation, and governance constructs
| Architecture Block | GCP CI/CD / MLOps Construct (Pipeline‑2 – Modelling) | YAML / Pipeline Spec Mapping |
|---|---|---|
| Source Repository | GitHub (modeling / pipelines repo) | repository, workflow.checkout |
| Source Trigger | Manual / CI trigger (GitHub Actions or local notebook execution) | on.workflow_dispatch, on.push, notebook_runtime |
| CI Runner | GitHub Actions Linux Runner (ubuntu-latest), optional for CI-driven runs | jobs.pipeline.runs-on: ubuntu-latest |
| Build / Pipeline Execution | Agent Platform Pipelines (KFP v2: Data → Train → Evaluate → Condition) | pipelineSpec.root, pipelineInfo.name, deploymentSpec |
| Training Orchestration | Agent Platform Pipelines (KFP v2) | @dsl.pipeline, tasks.train |
| Data Processing | Agent Platform Pipeline Component (Pandas + Scikit‑Learn preprocessing) | @dsl.component, components.data-prep |
| Model Training | RandomForestClassifier training pipeline / managed training runtime | components.train.container.image, args, model_output |
| Model Evaluation | Pipeline component for ROC, Confusion Matrix, Accuracy | components.evaluate.outputs.metrics, classificationMetrics |
| Artifact Storage | Google Cloud Storage (datasets, model artifacts, metrics JSON) | pipeline_root: gs://..., artifact_uri, metrics_path |
| Container Registry | Artifact Registry (Agent Platform managed serving container) | image.repository, image.tag |
| Model Registry | Agent Platform Model Registry (governed model versions) | components.upload-model, model.display_name, version_aliases |
| Approval Gate | Pipeline Condition (metric threshold gate for deployment) | with dsl.Condition(accuracy > 0.8), threshold |
| Security & Auth | GCP Service Account + IAM (least privilege for pipelines) | service_account, roles/aiplatform.user, roles/storage.objectAdmin |
| Secrets / Config | Environment variables + GCP IAM, optionally Secret Manager | env.PROJECT_ID, env.REGION, env.BUCKET_URI, secretEnv |
| Monitoring & Logs | Agent Platform Pipelines UI + Cloud Logging | pipeline_job_name, logging.enabled |
| Lineage & Governance | Agent Platform Pipelines lineage + Model Registry versions | metadata, metrics, artifact.uri |
| Infrastructure Backend | Agent Platform Managed Pipelines (no separate IaC needed) | managed_pipeline: true, location |
Pipeline‑2 standardizes reproducible training workflows, centralized GCS artifacts, metric logging (ROC, confusion matrix, accuracy), and governed model versioning for controlled promotion toward deployment.
Complete Project Details
All content from the Pipeline‑2 PDF
Project Summary
- Project Name: AI‑GCP Pipeline‑2 – Agent Platform Training & Evaluation Pipeline
- One‑Line Description: Production‑grade ML training, evaluation, gating, and conditional deployment pipeline on Google Agent Platform using Kubeflow Pipelines (KFP v2).
- Category: AI + MLOps + Cloud Platform Engineering
- Industry: Cross‑industry (Enterprise AI Platform / MLOps Infrastructure)
- Domain: Machine Learning Platform Engineering / AI Model Lifecycle Automation
Key Words
- Agent Platform Pipelines (KFP v2 Orchestration)
- Kubeflow Pipelines SDK (Pipeline as Code)
- Agent Platform Training Jobs (Managed Training Runtime)
- Agent Platform Metadata Store (Lineage & Governance)
- Google Cloud Storage (Datasets, Models, Metrics Artifacts)
- Artifact Registry (Training / Inference Containers)
- Service Accounts & IAM (Least‑Privilege MLOps Security)
- Workload Identity Federation (GitHub → GCP Auth)
- Conditional Pipelines (Evaluation Gate → Deploy)
- Agent Platform Model Upload (Model Registry Equivalent)
- Agent Platform Endpoints (Online Inference Targets)
- Pipeline Scheduling (Agent Platform Pipeline Scheduler)
- Cloud Logging (Training / Pipeline Logs)
- ML Governance (Metadata, Metrics, Model Lineage)
Problem Solved
Manual, notebook‑driven ML workflows lack reproducibility, governance, automated evaluation gates, and production deployment discipline. There was no structured way to enforce model quality before deployment in GCP.
Primary Objective
Build a production‑grade, automated ML training and evaluation pipeline on GCP that enforces quality gates, tracks lineage, and conditionally deploys models to Agent Platform endpoints using platform‑native MLOps primitives.
Solution & Architecture
Implemented an Agent Platform Pipelines (KFP v2) based ML pipeline that performs data preparation, model training, evaluation (ROC, confusion matrix, accuracy), quality gating, conditional deployment to Agent Platform Endpoints, and scheduled retraining.
Our platform automates machine learning workflows using Components as modular building blocks for specific tasks. These are orchestrated via a DSL (Domain Specific Language), which serves as the instruction manual for connecting them, while Conditions provide the if‑then logic to ensure smart, real‑time decisions during execution.
- Representation:
@dsl.componentor@component;@dsl.pipeline;with dsl.Condition(accuracy > 0.8): - Cloud Platform: Google Cloud Platform (Agent Platform)
- Components: Agent Platform Pipelines, KFP v2 Components, managed training runtime, endpoints, GCS, Artifact Registry, Agent Platform Metadata Store, Service Accounts + IAM
- Reliability: managed training jobs, serverless orchestration, GCS persistence, idempotent re‑runnable steps, and conditional deployment gates
AI / DevOps Details
- Focus: Supervised ML training + MLOps automation (training, evaluation, gating, deployment)
- Implemented: RandomForestClassifier training pipeline; Data → Train → Evaluate → Gate → Deploy; ROC, confusion matrix, accuracy logging; conditional deployment logic; scheduled retraining pipelines
- CI/CD / Orchestration: GitHub Actions, Kubeflow Pipelines v2, Agent Platform Pipelines, optional Artifact Registry for containerized components
Monitoring, Logging & Optimization
- Agent Platform Pipelines UI for observability
- Cloud Logging for job‑level logs
- Agent Platform Metadata Store for metrics + lineage
- Model KPI logging with accuracy thresholds for gating
Skills & Technologies Used
- Primary: MLOps Architecture, Agent Platform Pipelines / KFP v2, Cloud AI Platform Engineering, Production ML Workflow Design — Advanced
- Secondary: Kubeflow Pipelines SDK, scikit‑learn, Agent Platform SDK (Python), Google Cloud Storage, GitHub Actions
- Languages: Python (primary), YAML (configuration / pipeline specs where applicable)
- Cloud & DevOps: Google Agent Platform, GCS, Artifact Registry, IAM / Service Accounts, GitHub Actions, Workload Identity Federation
Challenges & Resolutions
- Wiring KFP v2 components correctly with Agent Platform Pipelines → used KFP v2 native component interfaces
- ROC / metrics logging compatibility → sanitized ROC inputs to satisfy metrics APIs
- Conditional deployment gates → implemented explicit evaluation gates with pipeline conditions
- Model artifact formats for serving → packaged models to match serving container expectations
- Notebook‑level code to production pipeline → converted notebook workflows into pipeline‑native components
GCP Production‑Grade Implementation Details
Architecture: Agent Platform Pipelines → Training → Evaluation → Conditional Deployment → Endpoints; artifact persistence in GCS; lineage in Agent Platform Metadata Store.
- High‑level flow: GitHub Trigger → Agent Platform Pipeline Execution → Data Prep Component → Training Component (Agent Platform Training) → Evaluation Component (ROC / Accuracy / Confusion Matrix) → Quality Gate → Conditional Deployment to Agent Platform Endpoint → Scheduled Re‑training
- Architecture implemented on GCP: Raw Data → Agent Platform Pipeline (Data Prep → Train → Evaluate → Gate) → Model Artifacts (GCS) → Agent Platform Model Registry → Approved Model for Deployment
- Top lane — Training & Evaluation Path: Raw Dataset (GCS / External Source) → Data Preparation Component → Custom Training Job → Model Evaluation → Quality Gate → Model Upload
- Bottom lane — Experiment Tracking & Lineage: Training & Evaluation Runs → Pipelines Lineage / Experiments → Metrics, Artifacts, Parameters stored in GCS → Governed Model Versioning in Model Registry
- The main project document has detailed view.
Assets & References
- GitHub / Repository Link: https://github.com/Rajesh-Arigala/vertex-ai-mlops-kfp2
- Notebook: Vertex_AI_kfp2_pipeline.ipynb
- Weblink: https://rajesharigala.com/mlops/ai4/ai4.2
- Proof Link: later
Study Material
- Public: Official documentation of KFP, YAML file for GCP, Python SDK; downloadable PDF if available
- Restricted: KFP file specific, Colab Google specific; downloadable PDF with access limited to authorised users
Pipeline‑2 Summary
Production‑grade model development and orchestration on Google Cloud using Agent Platform Pipelines (KFP v2) to automate data preparation, model training, evaluation, and quality gates. This layer standardizes reproducible training workflows, centralized artifact storage in GCS, metric logging into experiments, and governed model versioning via the Model Registry, enabling controlled promotion of validated models toward deployment.
Assets & References
Code, diagrams, study material
Repository
Full training/evaluation pipeline code, components, and deployment specs.
vertex-ai-mlops-kfp2Study Material Resources
Official docs, restricted KFP guides, Colab notebooks
Request Study Material