Shared Tasks: Call for Participation

Quick links

  • LongSumm: Generating Long Summaries for Scientific Documents
  • SCIVER: Verifying Scientific Claims with Evidence
  • 3C: Citation Context Classification

LongSumm 2021: The 2nd Shared Task on Generating Long Summaries for Scientific Documents

Most of the work on scientific document summarization focuses on generating relatively short summaries (abstract like). While such a length constraint can be sufficient for summarizing news articles, it is far from sufficient for summarizing scientific work. In fact, such a short summary resembles more to an abstract than to a summary that aims to cover all the salient information conveyed in a given text. Writing such summaries requires expertise and a deep understanding in a scientific domain, as can be found in some researchers blogs.

The LongSumm task opted to leverage blogs created by researchers in the NLP and Machine learning communities and use these summaries as reference summaries to compare the submissions against.

The corpus for this task includes a training set that consists of 1,705 extractive summaries, and around 700 abstractive summaries of NLP and Machine Learning scientific papers. These are drawn from papers based on video talks from associated conferences (Lev et al. 2019 TalkSumm) and from blogs created by NLP and ML researchers. In addition, we create a test set of abstractive summaries. Each submission is judged against one reference summary (gold summary) on ROUGE and should not exceed 600 words.

This is the second year LongSumm is being hosted — the results from LongSumm @ SDP 2020 are reported on a public leaderboard. In 2021, the task will continue to expand, by incorporating additional summaries.

Long Summary Task

The task is defined as follows:

  • Given: For a detailed description of the provided data, please see the LongSumm GitHub repository
  • Task: Generate abstractive and extractive summaries for scientific papers

Evaluation

The Long Summary Task will be scored by using several ROUGE metrics to compare the system output and the gold standard Lay Summary. The intrinsic evaluation will be done by ROUGE, using ROUGE-1, -2, -L and Skipgram metrics. In addition, a randomly selected subset of the summaries will undergo human evaluation.

Corpus

The training data is composed of abstractive and extractive summaries. To download both datasets, and for further details, see the LongSumm GitHub repository.

Organizers

Guy Feigenblat, IBM Research AI

Michal Shmueli-Scheuer, IBM Research AI

Contact

Please contact shmueli@il.ibm.com and guyf@il.ibm.com with questions about this shared task.


SCIVER: Verifying Scientific Claims with Evidence

Due to the rapid growth in scientific literature, it is difficult for scientists to stay up-to-date on the latest findings. This challenge is especially acute during pandemics due to the risk of making decisions based on outdated or incomplete information. There is a need for AI systems that can help scientists with information overload and support scientific fact checking and evidence synthesis.

In the SCIVER shared task, we will build systems of the form:

  1. Take a scientific claim as input
  2. Identify all relevant abstracts in a large corpus
  3. Label them as Supporting or Refuting the claim
  4. Select sentences as evidence for the label
Here’s a live demo of what such a system could do.

Registration and Participation details

To register, please send an email to the organizers at sciver-info@allenai.org with:

  • Team name
  • Participant (full) name(s)
  • Participant affiliation(s)
  • Email(s) for primary contact(s)
Details:
  • All data and submission portals are already publicly available; registration is simply to help us keep in contact. Please register before March 17, 2021.
  • Participants will make submissions to the public leaderboard, which allows one submission per week. Please make sure to finish your leaderboard submissions by March 22, 2021. The leaderboard will remain open afterwards, but our subsequent report on shared task findings will reflect results up to this point.
  • We invite all participants to submit papers to the SDP 2021 workshop for peer review and publication within the ACL Anthology.
    • Title and Abstract submission deadline is March 17, 2021 (11:59 UTC-12)
    • Paper submission deadline is March 22, 2021 (11:59 UTC-12)
    • Please make your submission on SoftConf.

Dataset

We will use the SciFact dataset of 1,409 expert-annotated biomedical claims verified against 5,183 abstracts from peer-reviewed publications. Download the full dataset here. You can also find baseline models and starter code on the GitHub repo. Find out more details from the EMNLP 2020 paper.

For each claim, we provide:

  • A list of abstracts from the corpus containing relevant evidence.
  • A label indicating whether each abstract Supports or Refutes the claim.
  • All evidence sets found in each abstract that justify the label. An evidence set is a collection of sentences that, taken together, verifies the claim. Evidence sets can be one or more sentences.

An example of a claim paired with evidence from a single abstract is shown below.

            
           {
             "id": 52,
             "claim": "ALDH1 expression is associated with poorer prognosis for breast cancer primary tumors.",
             "evidence": {
                "11": [                     // 2 evidence sets in document 11 support the claim.
                   {"sentences": [0, 1],    // Sentences 0 and 1, taken together, support the claim.
                    "label": "SUPPORT"},
                   {"sentences": [11],      // Sentence 11, on its own, supports the claim.
                    "label": "SUPPORT"}
                ],
                "15": [                     // A single evidence set in document 15 supports the claim.
                   {"sentences": [4], 
                    "label": "SUPPORT"}
                ]
             },
             "cited_doc_ids": [11, 15]
           }
            
           

Submission

We will use the SciFact public leaderboard as the official submission portal for the SciVER task. Please read the online instructions for how to make submissions.

The final evaluation will use test claims with hidden relevant abstracts, labels, and evidence from the same released corpus. For each claim, the system is expected to predict which abstracts contain relevant evidence. Each predicted abstract must be annotated with the following two pieces of information:

  • A label indicating whether the abstract Supports or Refutes the claim.
  • A list of evidence sentences from the abstract that justify the label. For simplicity, we only require predicting evidence sentences, not whole evidence sets.

An example prediction is shown below:

              
          {
              "id": 52,
              "evidence": {
                  "11": {
                      "sentences": [1, 11, 13],   // Predicted rationale sentences.
                      "label": "SUPPORT"          // Predicted label.
                  },
                  "16": {
                      "sentences": [18, 20],
                      "label": "REFUTES"
                  }
              }
          }              
              
            

Evaluation

Two evaluation metrics will be used. For a full description, see Section 4 of the SciFact paper.

Abstract-level evaluation

Abstract-level evaluation is similar to the FEVER score, described in the FEVER paper (Thorne et al., 2018). A predicted abstract is Correct if:

  1. The predicted abstract is a relevant abstract.
  2. The abstract's predicted Support or Refute label matches its gold label.
  3. The abstract's predicted evidence sentences contain at least one full gold evidence set. Inspired by the FEVER score, the number of predicted sentences is limited to 3.
We then compute the F1 score over all predicted abstracts.

Sentence-level evaluation

Sentence-level evaluation scores the correctness of the individual predicted evidence sentences. A predicted sentence Correct if:

  1. The abstract containing the sentence is labeled correctly as Support or Refute.
  2. The sentence is part of some gold evidence set.
  3. All other sentences in that same gold evidence set are also identified by the model as evidence sentences.
We then compute the F1 score over all predicted evidence sentences.

Here's a simple step-by-step example showing how these metrics are calculated.

Timeline

  • Train & public test set release – December 14, 2020 (registration opens)
  • Open online leaderboard for blind test submissions – January 26, 2021
  • All Title & Abstract submissions due (registration closes) – March 17, 2021 (23:59 UTC-12)
  • All paper submissions due – March 22, 2021 (23:59 UTC-12)
  • Notification of acceptance – April 15, 2021
  • Camera-ready papers due – April 26, 2021
  • Workshop – June 10, 2021

Organizers

Dave Wadden, University of Washington

Kyle Lo, Allen Institute for Artificial Intelligence (AI2)

Iz Beltagy, Allen Institute for Artificial Intelligence (AI2)

Anita de Waard, Elsevier, USA

Tirthankar Ghosal, Indian Institute of Technology Patna, India

Contact

If you have any questions about this shared task, please reach out via email to sciver-info@allenai.org or dwadden@cs.washington.edu and kylel@allenai.org.

References

  1. Wadden, D., Lin, S., Lo, K., Wang, L.L., Zuylen, M.V., Cohan, A., & Hajishirzi, H. "Fact or Fiction: Verifying Scientific Claims." EMNLP (2020).
  2. Thorne, J., Vlachos, A., Christodoulopoulos, C., & Mittal, A. "FEVER: a large-scale dataset for Fact Extraction and VERification." NAACL (2018).

3C Citation Context Classification

Recent years have witnessed a massive increase in the amount of scientific literature and research data being published online, providing revelation about the advancements in the field of different domains. The introduction of aggregator services like CORE [1] has enabled unprecedented levels of open access to scholarly publications. The availability of full text of the research documents facilitates the possibility of extending the bibliometric studies by identifying the context of the citations [2]. The shared task organized as part of the SDP 2021 focuses on classifying citation context in research publications based on their influence and purpose.

Subtask A: A task for identifying the purpose of a citation. Multiclass classification of citations into one of six classes: Background, Uses, Compare_Contrast, Motivation, Extension, and Future.

Subtask B: A task for identifying the importance of a citation. Binary classification of citations into one of two classes: Incidental, and Influential.

Dataset

The participants will be provided with a labeled dataset of 3000 instances annotated using the ACT platform [3].

The dataset is provided in csv format and contains the following fields:

  • Unique Identifier
  • COREID of Citing Paper
  • Citing Paper Title
  • Citing Paper Author
  • Cited Paper Title
  • Cited Paper Author
  • Citation Context
  • Citation Class Label
  • Citation Influence Label

Each citation context in the dataset contains an "#AUTHOR_TAG" label, which represents the citation that is being considered. All other fields in the dataset correspond to the values associated with the #AUTHOR_TAG. The possible values of the citation_class_label are:

  • 0 - BACKGROUND
  • 1 - COMPARES_CONTRASTS
  • 2 - EXTENSION
  • 3 - FUTURE
  • 4 - MOTIVATION
  • 5 - USES

and that of citation_influence_label are:

  • 0 - INCIDENTAL
  • 1 - INFLUENTIAL

The following table shows a sample entry from the training dataset.

unique_id 1998
core_id 81605842
citing_title Everolimus improves behavioral deficits in a patient with autism associated with tuberous sclerosis: a case report
citing_author Ryouhei Ishii
cited_title Learning disability and epilepsy in an epidemiological sample of individuals with tuberous sclerosis complex
cited_author Joinson
citation_context West syndrome (infantile spasms) is the common estc epileptic disorder, which is associated with more intellectual disability and a less favorable neurological outcome (#AUTHOR_TAG et al, 2003)
citation_class_label 4
citation_influence_label 1

A sample training dataset can be downloaded by filling the shared task registration form. The full training dataset will be released shortly via the Kaggle platform.

The ACL-ARC dataset [4], which is compatible with our ACT dataset can be used by the participants during the competition.

Evaluation

The evaluation will be conducted using the withheld test data containing 1000 instances. The evaluation metric used will be the F1-macro score.

$$\mbox{F1-macro} = {\frac{1}{n} \sum_{i=1}^{n}{\frac{2 \times P_i \times R_i}{P_i + R_i}}}$$

Team Registration

The shared task is hosted on the Kaggle platform. Please note that both subtasks will be hosted as separate competitions on Kaggle. Please make sure you sign in/register on Kaggle before opening the following links.

To participate in the 3C Shared Task:

  • For subtask A, please visit Kaggle citation purpose classification
  • For subtask B, please visit Kaggle citation influence classification

Submission Guidelines

Each team can participate in any of the tasks or all of them. The submission files need to be in CSV format with the following fields:

  • For Subtask A: unique_id, citation_class_label
  • For Subtask B: unique_id, citation_influence_label

Upload your solutions using kaggle.

For submitting your paper and code to the 3C Shared task, please register here and use the 3C shared task submission link. While uploading the paper and code, please use the following naming convention:

[kaggle-team-name]_SDP2021_task_[A/B]

where A/B represents the subtask for which you are submitting. If you use the same approach for both subtasks, there is no need to write separate papers.

Timeline

  • Kaggle competition start date — February 26, 2021
  • Kaggle competition end date — April 30, 2021
  • Paper and code submission deadline — May 10, 2021 (23:59, UTC-12) May 17, 2021 (23:59, UTC-12) (new deadline)
  • Notification of acceptance — May 25, 2021
  • Camera-ready papers due — June 3, 2021
  • Workshop — June 10, 2021

Organizers

Petr Knoth, Open University, UK

Suchetha N. Kunnath, Open University, UK

David Pride, Open University, UK

Kuansan Wang, Microsoft Research

Dasha Herrmannova, Oak Ridge National Laboratory

Contact

If you have any questions about this shared task, please contact david.pride@open.ac.uk and suchetha.nambanoor-kunnath@open.ac.uk.

References

  1. Knoth, Petr and Zdrahal, Zdenek. "CORE: three access levels to underpin open access." D-Lib Magazine 18.11/12 (2012): 1-13.
  2. Pride, David and Knoth, Petr. "Incidental or influential? — A decade of using text-mining for citation function classification." (2017).
  3. Pride, David, Knoth, Petr and Harag, Jozef. "ACT: An Annotation Platform for Citation Typing at Scale." 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, 2019.
  4. Jurgens, David, et al. "Measuring the evolution of a scientific field through citation frames." Transactions of the Association for Computational Linguistics 6 (2018): 391-406.


Contact: sdproc2021@googlegroups.com

Sign up for updates: https://groups.google.com/g/sdproc-updates

Follow us: https://twitter.com/SDProc

Back to top