The current state-of-the-art on HumanEval is LDB (O1-mini, based on seed programs from Reflexion). See a full comparison of 142 papers with code.
In addition to EvalPlus leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, ...
Nov 28, 2023 · Well it is a specialized model with training data made out of 87% code and 13% natural language (Text books).
Discover amazing ML apps made by the community.
HumanEval: Coding Leaderboard Comparison of pre-trained proprietary and open-source models for code generation.
The leaderboard for the HumanEval benchmark ranks models based on the innovative pass@1, pass@10, and pass@100 metrics. These metrics offer a more ...
Jun 18, 2024 · We open-source all the artifacts of BigCodeBench, including the tasks, test cases, evaluation framework, and leaderboard. You can find them as ...
Apr 30, 2024 · We re-evaluated the accuracy of three agents that have been claimed to occupy top spots on the HumanEval leaderboard: LDB, LATS, and Reflexion.
EvalPlus Leaderboard · Big Code Models Leaderboard · Chatbot Arena Leaderboard · CrossCodeEval · ClassEval · CRUXEval · Code Lingua · Evo-Eval · HumanEval.
The LeaderBoard is a demo for evaluating and comparing the performance of language models on code generation tasks.