Join our team!
Coders/Software Engineers:
Harp: This is a very sophisticated coding project which involves stumping models.
Code Human: This project puts expert annotators in the driver’s seat. They prompt language models to take real, agent-like actions inside an existing codebase—work that closely mirrors what the models are being evaluated on in practice. For each task, two models attempt the same objective. The annotator compares the executions, picks the stronger one, and explains—clearly and precisely—why it wins. If you're someone who’s sharp with code, judgment, and written reasoning, this is exactly the kind of work for you!
PR Writer w/ Feedback: Evaluate an AI model as a software engineer by having it implement a scoped task in a real git codebase, then iteratively reviewing and refining its work—like a PR—until it meets production standards. Assess not just correctness, but engineering quality: design, tests, edge cases, commits, and review readiness, with structured comparative feedback on model performance.
Other:
Prism: Create prompts with rubrics to stump 2 out of the 3 leading SOTA models.
ATC Transcription: We’re looking for contributors with strong transcription skills; aviation experience is a plus.