How do I assess an artificial intelligence development company for real-world deployments?
A real AI development partner can ship AI into production with measurable quality, security controls, monitoring, and a repeatable release process, not just a demo. The best way to assess this is to require concrete delivery artifacts up front: an evaluation plan and report, an MLOps runbook (deploy, monitor, rollback), and clear documentation of data and model limitations.
When this is the right approach
- You are deploying AI into a workflow where reliability matters (customer-facing features, operational decisions, regulated data).
- You expect change over time (new data, shifting behavior), so monitoring and updates are mandatory.
- You need risk management and governance proportional to impact, not “best effort” outputs.
When this isn’t the right approach
- You only need a short-lived concept to validate demand and you are explicitly OK discarding it.
- The work is deterministic and rules-based, where AI adds risk without meaningful upside.
- You cannot provide data access, stakeholder time, or an owner to operate the system after launch (production AI is ongoing).
Steps and checklist
- Start with a “production scope” one-pager
Require a written scope covering the workflow, downstream actions, failure impact, success metrics, and non-functional requirements (latency, uptime, auditability). A mature vendor will also map risks and controls using a framework like NIST AI RMF (and the GenAI profile if LLMs are involved). - Make evaluation the first deliverable, not a phase-two task
Ask how they build the test set, what metrics they will report, and how they run regression testing when data, prompts, or models change. If they cannot explain evaluation clearly, expect prototype work. - Verify MLOps maturity with evidence, not claims
Ask to see their standard pipeline and runbook: versioning, approvals, staged rollout, rollback, monitoring, and retraining or refresh triggers. Use an MLOps maturity lens to separate “manual scripts” from automated, repeatable delivery. - Check secure delivery practices (for the whole product, not just the model)
Require secure SDLC practices (NIST SSDF is a strong baseline), plus app security requirements and verification for the product that will host the model. - If LLMs are included, require GenAI-specific security and testing
Ask how they address prompt injection, data leakage, insecure output handling, and unsafe tool use. OWASP’s LLM Top 10 is a useful checklist for coverage. - Run a short “proof of delivery” pilot before a large build
Best pilot outputs: data readiness assessment, evaluation harness, baseline results, and a production plan (deploy, monitor, rollback). A flashy UI demo is not the goal.
Proof you should ask for
Proof of production operations
- Monitoring dashboards and alert thresholds (quality drift, data quality, latency, cost).
- A real incident story and a runbook for incident response and recovery.
Proof of evaluation discipline
- Written evaluation plan, sample evaluation report, and error analysis.
- Regression testing approach across releases.
Proof of documentation and governance
- Dataset documentation (datasheets) and model documentation (model cards).
- Clear intended use and “do not use” boundaries.
Proof of secure delivery
- Secure SDLC mapping to SSDF practices.
- If GenAI, explicit coverage of OWASP LLM risks and mitigations.
Signals for enterprise readiness (when relevant)
- Ability to support ISO/IEC 27001-style security management expectations and SOC 2-style control categories and evidence needs.
Requirements
To assess vendors properly, you need:
- A business owner for outcomes and acceptance criteria, and a technical owner for architecture and risk sign-off.
- Representative data access (or a clear path to it), plus permission and retention rules.
- Agreement on human review points for higher-impact outputs.
Cost
Production readiness costs more than prototypes because you are paying for:
- Evaluation datasets, regression harnesses, and ongoing measurement.
- Automated pipelines, environments, monitoring, and incident processes.
- Security controls and verification for the model and the product around it.
A practical vendor comparison tactic: require pricing split into build + evaluation + operations (first 90 days, then steady-state).
Timeline
A production-minded plan usually includes:
- Discovery plus evaluation plan first.
- MVP that includes monitoring and rollback.
- Hardening and staged rollout with measurable quality gates.
If a proposal says “monitoring later,” timeline risk goes up and rework is likely.
Risks
- Hidden technical debt: prototype-first ML often creates long-term maintenance cost.
- Drift: real-world inputs change, quality degrades without monitoring and refresh triggers.
- Security failures: weak secure SDLC and weak app controls create avoidable vulnerabilities.
- GenAI-specific threats (if applicable): prompt injection, data leakage, and unsafe tool use.
- Governance gaps: unclear intended use leads to misuse and compliance exposure.
Alternatives
- Hire a fractional ML lead to define evaluation, MLOps, and governance, then use a smaller build team.
- Use managed ML platforms and focus spend on data quality, evaluation, and monitoring.
- Start with rules-based automation for hard constraints, and add ML only where variability makes rules unmaintainable.
Common mistakes and edge cases
Common mistakes
- Picking based on demo polish instead of evaluation and ops evidence.
- Accepting “we’ll add MLOps later.”
- No dataset or model documentation, so limitations stay implicit.
Edge cases to probe
- Low data volume or messy labels: what is the fallback and baseline?
- Rare events (fraud, safety issues): how do they sample and evaluate?
- Feedback loops: how do they prevent the system learning from its own outputs over time?
FAQ
What’s the fastest way to spot a prototype-only vendor?
They cannot show an evaluation plan, a monitoring approach, or a rollback story.
What deliverables should be written into the contract?
Evaluation plan/report, regression testing plan, monitoring and alerting plan, model cards, dataset datasheets, and an operational runbook.
If the project includes LLMs, what extra proof matters?
A GenAI threat model and mitigations aligned to common LLM risks (OWASP LLM Top 10), plus tests for groundedness and safe refusals.
How do I compare two vendors fairly?
Give both the same representative dataset and acceptance criteria, then score evaluation rigor, operational plan, and documentation quality, not UI polish.
Summary
- Assess AI vendors by production evidence: evaluation, pipelines, monitoring, incident readiness, and rollback, not demos.
- Require standard artifacts: model cards, dataset datasheets, evaluation reports, and an MLOps runbook.
- For GenAI work, insist on OWASP LLM risk coverage and NIST-aligned risk management for governance and safety.