Generative AI-enhanced academic research assistants are transforming how research is conducted. By allowing users to pose research-related questions in natural language, these systems can generate structured and concise summaries supported by relevant references. However, hallucinations — unsupported claims introduced by large language models — remain a significant obstacle to fully trusting these automatically generated scientific answers.
SciHal stands for "Hallucination Detection for Scientific Content". SciHal will invite participants to detect hallucinated claims in the answers to scientific questions generated by GenAI-powered research assistants.
The dataset comprises research-oriented questions sourced from subject matter experts, along with corresponding answers and references. These answers are produced by real-world retrieval-augmented generation (RAG) systems which indexed approximately millions of published academic abstracts. Each answer is annotated to indicate whether it includes unsupported claims that are not grounded in the provided references. Two levels of labeling will be provided: a three-class scheme (entailment, neutral, contradiction) and a fine-grained scheme encompassing 10+ categories. Teams are challenged to classify claims into the appropriate categories, with evaluation metrics focusing on the precision, recall, and F1 of detecting unsupported claims.