Researchers from AMD and Johns Hopkins University have developed Agent Laboratory, an artificial intelligence framework that automates core aspects of the scientific research process. The system uses large language models to handle literature reviews, experimentation, and report writing, producing both code repositories and research documentation. The framework has demonstrated an 84% reduction in research costs compared to existing autonomous methods while maintaining research quality standards.
The system processes research ideas through a three-stage pipeline, with researchers providing feedback at each phase. In the initial phase, agents independently gather and analyze research papers. This transitions to a collaborative stage where agents plan experiments and prepare datasets.
The final phase automates the experimentation process and generates detailed research documentation. Testing with various language models showed that the framework powered by o1-preview produced optimal results. “Our generated machine learning code matched state-of-the-art performance benchmarks”, noted the research team in their findings. The researchers confirmed that human oversight during each stage played a vital role in improving the final output quality.
The framework integrates with established tools including arXiv for literature access, Hugging Face for model implementations, and Python for experimentation, along with LaTeX for documentation.
The modular design ensures compute flexibility, accommodating diverse resource availability while maintaining efficiency in generating high-quality research artifacts,
explains the development team.
Source: mle-solver
Agent Laboratory implements MLE-Solver, a component that converts research directions into functional machine learning code through an iterative refinement process. The system maintains a collection of top-performing programs that continuously improve based on task instructions, command specifications, and accumulated knowledge.
Agent Laboratory researchers conducted a comprehensive evaluation of three language models – gpt-4o, o1-mini, and o1-preview – to assess their capabilities in autonomous research generation. The o1-preview model demonstrated superior performance in perceived usefulness (4.4/5) and report quality (3.4/5), while scoring 2.9/5 in experimental quality. O1-mini achieved the highest experimental quality rating at 3.2/5 and maintained steady performance across all metrics. The gpt-4o model recorded the lowest overall scores, with 2.6/5 in experimental quality, though maintaining a strong usefulness rating of 4.0/5.
The performance analysis of Agent Laboratory reveals gpt-4o as the most efficient model, completing workflows in 1165.4 seconds at $2.33 per run, compared to o1-mini’s 3616.8 seconds at $7.51 and o1-preview’s 6201.3 seconds at $13.10. While gpt-4o demonstrated 3-5x faster execution in experiments and report writing, all models maintained high reliability with success rates above 95%. Report writing emerged as the most resource-intensive phase, with o1-preview showing the highest cost at $9.58 per report.
The most exciting part is running experiments: The core task here is handled by a component called mle-solver, which autonomously generates machine learning code, runs experiments, and iteratively refines code,
notes Muratcan Koylan.
I just had o1 write a major cancer treatment project based on a very specific immunological approach. It created the full framework of the project in under a minute, with highly creative aims,
reports Derya Unutmaz, Professor and biomedical scientist.
The cost-efficiency aspects have particularly impressed data science professionals. Hazm Talab, a data scientist, observes:
“Very impressive to see such significant cost reductions in research through the use of LLMs with the Agent Laboratory framework”.
Learn more about Agent Laboratory’s technical implementation, documentation, and source code on GitHub.