STEM/Coding Experts Needed: Build Research Tasks for AI Evaluation
I need help building realistic, terminal-based STEM research tasks used to evaluate frontier AI models (GPT, Gemini, etc.).
What you'll build:
A self-contained coding task that looks like real research work (analyzing datasets, running simulations, validating hypotheses, comparing methods). Not a textbook problem.
Each submission must include:
instruction.md (workflow, inputs, outputs, success criteria)
Reproducible Docker environment with data
Oracle solution (solve.sh) that fully solves the task
Deterministic tests for verification
task.toml metadata
All packaged into one zip
Quality bar:
Multi-step, research-grade workflow
Hard enough that frontier models fail more than 80% of the time
Oracle passes local tests 3 out of 3 times
Objectively verifiable outputs
No LLM-generated content allowed
Who's a fit:
STEM background (biology, chemistry, physics, ML, data science, etc.) with strong Python and Docker skills.
Payout: $100 per accepted submission.