Use Cases
Code Generation Evaluations
Programmatically test the quality of LLM-generated code using AVM
Objective: Validate LLM-generated code across diverse inputs by executing it in AVM’s secure, isolated environments.
Code Generation Evaluations
An eval is a test harness that assesses code produced by an LLM. AVM enables you to generate code via an LLM, run it securely on a mesh of peer-operated nodes, and verify output correctness against expected results.
Scenario: Robust Code Validation
You need to ensure that functions transforming CSV to JSON handle edge cases and schema variations before deployment.
Solution: AVM-Powered Eval
- Generate Code
Prompt an LLM to produce transformation functions. - Execute Securely
Use AVM’srunPython
tool to run untrusted code in sandboxed containers. - Assert Results
Compare outputs against predefined JSON schemas in the same workflow.
Example (TypeScript)
Next Steps
- Expand to multi-language evals (JS, Rust).
- Integrate into CI/CD pipelines via AVM API.
- Benchmark multiple LLMs under identical test harnesses.