Previously eval run across modes would use different dataset shuffles.
This change enables a strict apples to apples perf comparison of the
different khoj modes across the same (random) subset of questions by
using a dataset seed per workflow run to sample questions
- CI runners don't have GPUs
- Pytorch related Nvidia cuda packages are not required for testing,
evals or pre-commit checks.
- Avoiding these massive downloads should speed up workflow run.
The github run_eval workflow sets OPENAI_BASE_URL to empty string.
The ai model api created during initialization for openai models gets
set to empty string rather than None or the actual openai base url
This tries to call llm at to empty string base url instead of the
default openai api base url, which obviously fails.
Fix is to map empty base url's to the actual openai api base url.
- Control auto read webpage via eval workflow. Prefix env var with KHOJ_
Default to false as it is the default that is going to be used in prod
going forward.
- Set openai api key via input param in manual eval workflow runs
- Simplify evaluating other chat models available over openai
compatible api via eval workflow.
- Mask input api key as secret in workflow.
- Discard unnecessary null setting of env vars.
- Control randomization of samples in eval workflow.
If randomization is turned off, it'll take the first SAMPLE_SIZE
items from the eval dataset instead of a random collection of
SAMPLE_SIZE items.
Sets env vars to empty if condition not met so:
- Terrarium (not e2b) used as code sandbox on release triggered eval
- Internet turned off for math500 eval
Reaching >94% in research mode on SimpleQA. When answers can be
researched online, it becomes too easy. And the FRAMES eval does a
more thorough job of evaluating that use-case anyway.
- Specify E2B api key and template to use via env variables
- Try load, use e2b library when E2B api key set
- Fallback to try use terrarium sandbox otherwise
- Enable more python packages in e2b sandbox like rdkit via custom e2b template
- Use Async E2B Sandbox
- Parallelize file IO with sandbox
- Add documentation on how to enable E2B as code sandbox instead of Terrarium
- Evaluate khoj on random 200 questions from each of google frames and openai simpleqa benchmarks across *general*, *default* and *research* modes
- Run eval with Gemini 1.5 Flash as test giver and Gemini 1.5 Pro as test evaluator models
- Trigger eval workflow on release or manually
- Make dataset, khoj mode and sample size configurable when triggered via manual workflow
- Enable Web search, webpage read tools during evaluation