aviary.labbench
LAB-Bench environments implemented with aviary, allowing agents to perform question answering on scientific tasks.
Installation
To install the LAB-Bench environment, run:
pip install 'fhaviary[labbench]'Usage
In labbench/env.py, you will find:
GradablePaperQAEnvironment: an PaperQA-backed environment that can grade answers given an evaluation function.ImageQAEnvironment: anGradablePaperQAEnvironmentsubclass for QA where image(s) are pre-added.
And in labbench/task.py, you will find:
TextQATaskDataset: a task dataset designed to pull down FigQA, LitQA2, or TableQA from Hugging Face, and create oneGradablePaperQAEnvironmentper question.ImageQATaskDataset: a task dataset that pairs withImageQAEnvironmentfor FigQA or TableQA.
Here is an example of how to use them:
import os
from ldp.agent import SimpleAgent
from ldp.alg import Evaluator, EvaluatorConfig, MeanMetricsCallback
from paperqa import Settings
from aviary.env import TaskDataset
async def evaluate(folder_of_litqa_v2_papers: str | os.PathLike) -> None:
settings = Settings(paper_directory=folder_of_litqa_v2_papers)
dataset = TaskDataset.from_name("litqa2", settings=settings)
metrics_callback = MeanMetricsCallback(eval_dataset=dataset)
evaluator = Evaluator(
config=EvaluatorConfig(batch_size=3),
agent=SimpleAgent(),
dataset=dataset,
callbacks=[metrics_callback],
)
await evaluator.evaluate()
print(metrics_callback.eval_means)Image Question-Answer
This is an environment/dataset for giving PaperQA a Docs object with the image(s) for one LAB-Bench question. It's designed to be a comparison with zero-shotting the question to a LLM, but instead of a singular prompt the image is put through the PaperQA agent loop.
References
[1] Skarlinski et al. Language agents achieve superhuman synthesis of scientific knowledge. ArXiv:2409.13740, 2024.
[2] Laurent et al. LAB-Bench: Measuring Capabilities of Language Models for Biology Research. ArXiv:2407.10362, 2024.
Last updated

