Long Talks

Long talks are given 20 minutes to present, including questions. We suggest limiting your talk to 15 minutes and leaving 5 minutes for questions.

  1. Gene Ontology Causal Activity Models (GO-CAMs) for human biology
    Patrick Masson, Cristina Casals-Casas, Lionel Breuza, Marc Feuermann, Sylvain Poux, Pascale Gaudet, Alan Bridge, Paul D. Thomas, Uniprot Consortium
    Expand the abstract
    The Gene Ontology (GO) (http://geneontology.org/) provides a human and machine readable synthesis of knowledge of the molecular functions of gene products and the cellular components and biological processes in which those functions occur. Gene Ontology Causal Activity Models (GO-CAMs) (https://geneontology.cloud/home) assemble genes and their functions into causally linked activity flow models. GO-CAMs provide a human readable description of biological systems and a framework for computational systems biology approaches such as network-based ‘omics integration and analysis and graph-based machine learning. In this presentation, we describe efforts to capture knowledge of human biology using GO-CAMs, including human gene functions, microbiome gene functions, and the regulation and roles of small molecules in human systems. These efforts build on a number of foundational pillars - a large corpus of existing GO annotations for human proteins, a draft human functionome that incorporates annotations for human proteins and orthologs using phylogenetic approaches (see poster from Feuermann et al.), small molecule annotation in UniProt from the Rhea knowledgebase of biochemical reactions (www.rhea-db.org) and the mapping of Rhea and the GO in Rhea2GO, and the creation of draft GO-CAM models of human biology from human Reactome pathway models (www.reactome.org). Our approach to synthesize this knowledge in GO-CAM and enrich it with emerging knowledge from the literature has created over 250 GO-CAM models for human biology. These and other GO-CAM models from the Gene Ontology Consortium are freely available at http://noctua.geneontology.org/.
  2. Annotation of biologically relevant ligands in UniProtKB using ChEBI
    Elisabeth Coudert, Sebastien Gehant, Nicole Redaschi, Alan Bridge, The Uniprot Consortium
    Expand the abstract
    The UniProt Knowledgebase (UniProtKB, at www.uniprot.org) is a reference resource of protein sequences and functional annotation that covers over 200 million protein sequences from all branches of the tree of life. UniProtKB provides a wealth of information on protein sequences and their functions, including descriptions of the nature of biologically relevant ligands (also known as cognate ligands) such as substrates/products of enzymes, cofactors, activators and inhibitors, as well as their binding sites. UniProtKB captures this information through expert literature curation and from experimentally resolved protein structures in the Protein Data Bank (PDB/PDBe). Here we describe improvements to the representation of cognate ligands in binding site annotations in UniProtKB using the chemical ontology ChEBI (www.ebi.ac.uk/chebi/). In 2022, we performed a complete reannotation of all cognate ligand binding sites in UniProtKB, replacing textual descriptions of defined ligands with stable unique identifiers from the ChEBI ontology, which we now use as the reference vocabulary for all new ligand annotations. The last UniProt release includes about 800 unique cognate ligands from ChEBI, which feature in over 65 million binding site annotations and over 15 million protein sequence records. This work continues the standardization of small molecule annotation in UniProtKB, which also covers the use of Rhea (www.rhea-db.org) and ChEBI for the annotation of enzymes, transporters, and cofactors. This enhanced dataset will provide improved support for efforts to study and predict functionally relevant interactions between proteins and their cognate ligands using computational approaches. Users can access the dataset via the UniProt website, REST API, and SPARQL endpoint, which have been modified to support ligand searches using the chemical ontology and chemical structure data of ChEBI.
  3. Single Cell Expression Atlas and FlyBase - the Fly Cell Atlas Anatomograms – where data meets art
    Nancy George, Irene Papatheodorou, Anja Fullgrabe, Yalan Bi, Pedro Madrigal
    Expand the abstract
    The Single Cell Expression Atlas knowledge base analyses curated, high quality gene expression data at the level of single cells. Data are then displayed through dimensionality reduction plots and heatmaps to show how cells cluster based on their expression profiles. However, in order to truly understand how expression profiles define cell populations and are altered by perturbations, expression profiles need to be linked to cell types and novel subpopulations identified. Thus, the single cell anatomogram project was born. Its aim is to display cell types realistically within their parent structures alongside cell types provided by the data submitter. Anatomograms are now available representing adult tissues for lung; placenta; kidney; pancreas and liver. The single cell anatomogram project is a diverse collaboration between scientific experts, curators, ontologists, artists, bioinformaticians and web developers to derive interactive images which show cells at single cell resolution within the wider context of the tissue. These images allow users to delve into lifelike representations of organ structures from macro-structures down to single cells within the tissue. Initially, artists work with curators and scientific experts, creating true to life shapes from real data, such as immunohistochemistry and immunofluorescence images to create structures and cell types 'in situ'. Cell shapes and detailed structures are layered on top of the tissue structure like a cake. Each shape is mapped either to an existing ontology term or a new term and relationships are created, in collaboration with ontologists from UBERON and Cell Ontology. This allows us to recreate the existing biological hierarchy from structure to cell type e.g. kidney > nephron > loop of Henle > epithelial cell of loop of Henle. Once illustrations have been created, mapped to their ontology identifiers and the appropriate hierarchy defined, these constructs are then incorporated into the Single Cell Expression Atlas by our web development team. Developers ensure that the resulting images are interactive and users can select regions within a tissue to ‘zoom into’ down to the level of individual cells within sub-structures of that tissue. This is enabled by the standardised bioinformatics analysis pipelines that adequately merge data with curated metadata. Last, but not least, is the data and cell types inferred from expression profiles. Author-defined inferred cell types are mapped where possible to ontology terms allowing us to identify the same cell types across multiple datasets. Once a user lands on these datasets, the anatomograms are displayed alongside a heatmap showing the top 5 genes per cell type associated with that dataset. When a cell type ‘shape’ is selected this then lights up showing that population, giving context to the data. Thus anatomograms provide a new perspective on visualisation of multi-layered biological information at the single cell level.
  4. A 20 year perspective on FAIR and TRUST-worthy Human Disease Knowledge Representation.
    Lynn M Schriml, J. Allen Baron, Dustin Olley, Mike Schor, Lance Nickel
    Expand the abstract
    Making human disease knowledge FAIR and TRUST-worthy are the hallmarks of the Human Disease Ontology (DO) project. Strengthening the biocuration of human disease related data has driven development of this resource. Coordination of key biomedical data across large-scale biomedical resources strengthens the foundation of knowledge, supports the development of new resources and provides a venue for evolving data models to meet the demands of knowledge generation and discovery. The Human Disease Ontology began, as most ontology projects do, as a community resource to coordinate data within one or two research groups. Growing over two decades into a highly utilized, international genomic resource. Serving as the nomenclature and classification standard for human diseases, the DO provides a stable, etiology-based structure integrating mechanistic drivers of human disease. In the past two decades the DO has grown from a collection of clinical vocabularies, into an expertly curated resource of over 11,000 common and rare diseases linking disease concepts through more than 35,000 vocabulary cross mappings. The responsibility of becoming a community resource serving hundreds of biomedical, clinical, ontology and software resources involves the development of rigorous quality control protocols, a structured release cycle, while building trust and demonstrating reliability through expert curation of each disease term, definition and disease annotation. Expanding an ontology resource to meet evolving needs and coordinating the ever expanding disease knowledge corpus necessitates periodic reassessment and expansion of the DO’s data model hand-in-hand with the commitment for coordinated knowledge development. Here, we report on the significant changes in content, data modeling, infrastructure development, utilization of ML tools, usage and community outreach for the DO project.
  5. Leveraging crowdsourcing and curation prioritisation for maintenance of clinical gene panels
    Arina Puzriakova
    Expand the abstract
    Genomics England’s PanelApp (https://panelapp.genomicsengland.co.uk/) is a knowledgebase which stores virtual gene panels relating to human conditions including rare diseases and cancer. It supports England’s NHS Genomic Medicine Service (GMS) by defining focused panels of genes with convincing evidence of disease causation. Such genes are deemed suitable for clinical genome interpretation, in turn enhancing diagnostic ability in a clinical setting. Support for gene involvement in a human disorder is derived from a combination of sources including published scientific literature and evidence from experts in the scientific and clinical communities, submitted as PanelApp gene reviews – a powerful integrated feature of the web interface. However, with the large volume of data relating to gene-disease relationships, maximising impact on scientific or clinical decision-making has become increasingly challenging. Manual curation remains integral for making final decisions on gene panel content, but due to the demands of maintaining nearly 200 panels, PanelApp curators have developed methods for prioritising gene-disease associations for assessment. External reviewer activity is extracted weekly, and reviews are automatically assigned categories of priority based on current and suggested gene rating classifications, ensuring evidence that may lead to changes to diagnostic-grade genes is assessed in a timely manner. Tracking tickets are created and assigned to curators based on disease specialities, enabling curators to become familiar with disease areas and specific panels. Tracking of reviews also enables concerted curation effort on the most frequently used panels, facilitating focused investigation with maximal downstream patient benefit. In parallel this highlights panels with minimal activity allowing curators to attribute this to limitations in disease knowledge or unmet community engagement needs. Active engagement of the PanelApp team with reviewer communities increases the valuable contributions from clinical specialists and aids delivery of results that are aligned with current clinical expertise. High quality panels have been shown to increase diagnostic yield in the NHS GMS, driving improved clinical utility for patients with rare diseases and cancer, and further emphasising the importance of a strategic approach for effective panel maintenance by curators.
  6. Biocuration in DisProt, the manually curated database for intrinsically disordered proteins
    Maria Victoria Nugnes, Federica Quaglia, Luiggi Tenorio Ku, Maria Cristina Aspromonte, Damiano Piovesan, Silvio Tosatto
    Expand the abstract
    DisProt (https://disprot.org/) is the major repository of manually curated annotations of intrinsically disordered proteins and regions from the literature. DisProt relies on both professional and community curators who are provided with a variety of material to support their curation process. These include a curation manual, interactive exercises and training sessions that evolved from virtual training sessions to eLearning courses available in English and Spanish, along with in person training sessions. To provide comprehensive and standardized annotations, DisProt relies on the Gene Ontology (GO) and Evidence and Conclusion Ontology (ECO) and on the adoption of the Minimum Information About Disorder (MIADE) standard. Higher quality and consistency of annotations is provided by a robust reviewing process - entries are reviewed and validated by expert curators that support other curators during the whole process. DisProt curators - both community and professional ones - continuously check the literature for novel experimental evidence for proteins belonging to eukaryota, bacteria, viruses and archaea, including also the model organisms represented in the DisProt home page. Finally, DisProt also focuses on thematic datasets, collections describing biological areas where IDPs play a crucial role, e.g. autophagy-related proteins and viral proteins. A new release of DisProt, pertaining technical and/or scientific content, is available every six months.
  7. GigaDB: Utilising ontologies to curate data publications
    Christopher Hunter, Chris Armit, Mary Ann Tuli, Yannan Fan
    Expand the abstract
    GigaScience journal celebrated its 10th birthday last year, and GigaDB is technically 1 yr older than the journal. In that time we have led the way in Open data sharing and FAIRification of data. We aim to ensure every article published by GigaScience Press is fully reproducible and transparent. In order to do that, it is essential that all data units are openly available along with the manuscript. GigaDB provide a manually curated dataset to accompany each manuscript published in both GigaScience and GigaByte (launched in 2020) journals, we collate the data files required for transparency and reproducibility and ensure data are deposited in appropriate public repositories with links included in both the manuscript and the dataset. In addition, we provide assistance to authors in; (a) the curation of sample metadata, making use of ontologies where appropriate (e.g. ENVO, PATO, OBI, BTO, UBERON, etc.); (b) the curation of files being made available to ensure they are complete, appropriately formatted, well described, and tagged with both a file format and a data-type chosen from a controlled vocabulary based on the EDAM ontology; and (c) the inclusion of specific Subject tags to each dataset which are chosen from a slim set of SRAO terms. Moving forwards, we hope to enable better integration of our datasets with other external resources by exposing the ontology term usage in GigaDB as metatags on each dataset webpage to assist researchers creating knowledge graphs.
  8. Join the International Society for Biocuration Community
    Ruth Lovering, Rama Balakrishnan, Susan Bello, Parul Gupta, Robin Haw, Tarcisio Mendes de Farias, Sushma Naithani, Federica Quaglia, Raul Rodriguez-Esteban, Mary Ann Tuli, Nicole Vasilevsky
    Expand the abstract
    When the International Society for Biocuration (ISB) was established in 2009, it set out to promote the field of biocuration and provide a forum for information exchange and networking through meetings and workshops. Our members are not only the data curators but also ontologists, data and software developers and are distributed across the globe, representing a wide range of geographical, national, linguistic, and cultural backgrounds. The ISB executive committee is composed of 11 members and is now registered as a non-profit organisation. Over the past 13 years, the ISB has supported several workshops, recognised excellence in biocurators by awarding 8 career and 3 life-time achievement awards and supported 12 conferences, with online conferences for 2 years. While the ISB's support for professional networking and building collaboration is still relevant and of ongoing importance, we have broadened its scope by addressing issues concerning Equity, Diversity and Inclusion, training the workforce, career support and outreach to our community and users of curated data. More recently, we have introduced new awards and grants to support and encourage biocurators belonging to all stages of their professional career. We invite the biocuration community to join our efforts by registering with our mailing list and becoming an ISB member. At the 16th International Biocuration Conference, we will explain how you can benefit from ISB activities. Please visit our poster to find out more, ask questions and suggest new areas for the ISB to prioritise.
  9. Human-pathogen interaction networks: IMEx’s approach on the contextual metadata of the experimental evidence.
    Kalpana Panneerselvam, Pablo Porras, Noemi del-Toro, Margaret Duesbury, Livia Perfetto, Luana Licata, Anjali Shrivastava, Eliot Ragueneau, Juan Jose Medina Reyes, Sandra Orchard, Henning Hermjakob
    Expand the abstract
    Host-pathogen interaction maps offer the scaffold to understand the biological processes behind the pathogenicity of the microbe and also helps in identifying potential drug targets. Access to the experimental evidence and contextual metadata which influence the interaction outcome are critical for accurate interpretation of molecular mechanisms. The IMEx consortium (www.imexconsortium.org) is an international database collaboration that exists to record molecular interaction data from scientific literature and direct depositions and make it freely accessible to the public, using an Open Access, Open-Source model. IMEx curators use a detailed representation of experimental evidence in the scientific literature: details of the affinity tags used, the mutations influencing the interaction outcome, variable experimental conditions, chemical and biological inhibitors, agonists and antagonists and required post-translational modifications, all using standard ontologies and controlled vocabularies to add more value towards the interpretation of scientific experiments and the biology behind. In the context of host-pathogen interactions, more than 33K binary interactions of Human proteins with bacterial and viral proteins are captured in IMEx. Interactions with SARS-CoV-2 and SARS Coronavirus with human count more than 7500 binaries; and ~6300 binary interactions have been curated in IMEx for Influenza-human PPIs. More than 4200 protein interactions of humans proteins with plague-causing Yersinia pestis bacterium are available at IMEx and 3000 interactions between humans and the anthrax bacterium Bacillus anthracis have been curated. 8300 human-viral protein interactions with members of Herpesvirus, Hepatitis and Papillomavirus family are also documented. Experimentally proven binding regions are available for more than 5000 interactions. More than 1000 interactions are shown to be affected by mutated protein sequence, compared to the wild-type protein. The IMEx’s MI scoring system for the molecular interactions based on the available experimental data are available to access the confidence behind every interacting pair. The IMEx consortium has been generating contextual molecular networks to reflect the type of relationship between the partners and over-laying metadata for the users to analyse those factors influencing the interaction outcome. This openly available resource is an invaluable tool with immediate applications in the study of variation impact on the interactome, sub-networks generated by mutated partners, small molecules affecting the networks, interaction interfaces as drug targets, among other key questions.
  10. The Comprehensive Antibiotic Resistance Database - Curating the Global Resistome
    Brian Alcock, Amogelang Raphenya, Arman Edalatmand, Andrew McArthur
    Expand the abstract
    The Comprehensive Antibiotic Resistance Database (CARD; card.mcmaster.ca) is an ontologically-driven knowledgebase and bioinformatics resource on the molecular biology and chemical components of antimicrobial resistance (AMR). This is achieved by integrating the Antibiotic Resistance Ontology (ARO) with the CARD Model Ontology (MO), which is used to organize AMR gene (ARG) sequences, resistance-conferring mutation data and bioinformatic parameters for in silico ARG detection by CARD’s Resistance Gene Identifier (RGI) software. To preserve integrity over time, the ARO is routinely updated by a biocuration team through, for example, the addition of novel AMR genes or gene variants or the revision of existing ontology branches for clarity, accuracy and/or computational efficiency. While manual curation of the literature and sequences is a cornerstone of CARD’s curation philosophy, the volume of AMR scientific literature renders this approach time-consuming and impractical. We therefore developed CARD*Shark, an algorithm and software for computer-assisted AMR literature triage. The current iteration, CARD*Shark 3, identifies and prioritizes literature for review through a machine-learning methodology, which is then reviewed by a CARD biocurator. CARD thereby integrates continuous curation from multiple approaches: computer-assisted literature triage, identification of errors and oversights through public feedback such as our GitHub repository (https://github.com/arpcard/amr_curation), and targeted curation within collaborative projects or other efforts of specific foci. To date, the ARO includes over 6500 terms which combine with 5000 ARG reference sequences and almost 2000 resistance-associated variants sourced from over 3000 publications to produce the current version of CARD. This manually curated data is used as a baseline for in silico prediction of resistomes and ARG prevalence for over 300 pathogens. Here we provide an overview of CARD’s design, curation methodology and overall content scope, and illustrate how a computer-assisted curation approach improves our efficacy and accuracy.
  11. Promoting the longevity of curated scientific resources through open code, open data, and public infrastructure
    Charles Tapley Hoyt, Benjamin M. Gyori
    Expand the abstract
    Many model organism databases, pathway databases, ontologies, and other curated resources that support research in the life and natural sciences combine expert-curated data with surrounding software code and services. However, such resources are often maintained internally by members of a single institution and are therefore susceptible to fluctuations in funding, personnel, and institutional priorities. Too often, resources go out of date, are abandoned, or become inaccessible, for example, when a grant runs out or a key person moves on. Therefore, we need better solutions for creating resources that are less susceptible to such external factors and can continue to be used and maintained by the community that they serve. We propose a new model for the creation and maintenance of curated resources that promotes longevity through a combination of technical and social workflows, and a progressive governance model that supports and encourages community-driven curation. 1) The technical aspect of our model necessitates open data, open code, and open infrastructure. Both code and data are permissively licensed and kept together under public version control. This enables anyone to directly suggest improvements and updates. Further, automation is used for continuous integration (e.g., semi-automated curation, quality assurance) and continuous delivery (e.g., static website generation, export in multiple formats). 2) The social aspect of our model first prescribes the composition of training material, curation guidelines, contribution guidelines, and a community code of conduct that encourage and support potential community curators. Second, it requires the use of public tools for suggestions, questions, discussion as well as social workflows like pull requests for the submission and review of changes. 3) The governance aspect of our model necessitates the division of responsibilities and authority (e.g., for reviewing/merging changes to the code/data) across multiple institutions such that it is more robust to fluctuation in funding and personnel that can also be updated over time. It prescribes liberal attribution and acknowledgement of the individuals and institutions (both internal and external to the project) who contribute on a variety of levels (e.g., code, data, discussion, funding). More generally, our model requires that a minimal governance model is codified and instituted as early as possible in a project's lifetime. This talk will provide a perspective on how existing resources relate to our model, describe each of our model’s aspects in more detail (illustrated through the Bioregistry resource), and provide a practical path towards both creating new sustainable resources as well as revitalizing existing ones.
  12. ACKnowledge: expanding community curation to include fact extraction using artificial intelligence
    Valerio Arnaboldi, Daniela Raciti, Kimberly Van Auken, Paul Sternberg
    Expand the abstract
    Biological knowledgebases are a critical resource for researchers and accelerate scientific discoveries by providing manually curated, machine-readable data collections. However, the aggregation and manual curation of biological data is a labor-intensive process that relies almost entirely on professional biocurators. Two approaches have been advanced to help with this problem: i) a computational approach that is based on natural language processing (NLP), text mining (TM) and machine learning (ML)); and ii) the engagement of researchers (community curation). However, neither of these approaches alone is sufficient to address the critical need for increased efficiency in the biocuration process. Our solution to these challenges is an NLP-enhanced community curation portal, Author Curation to Knowledgebase (ACKnowledge). The ACKnowledge system, currently implemented for the C. elegans literature, couples statistical methods and text mining algorithms to enhance community curation of research articles. Currently, the ACKnowledge system asks authors to verify five different types of entities (e.g. genes, variations) and fourteen different data types (e.g. phenotype, physical interaction) identified by our TM and ML approaches. We are now expanding the ACKnowledge system to incorporate ML models to identify sentences that describe specific types of experiments and then extract entities and concepts for more detailed author curation. Specifically, we are training classifiers to identify sentences that describe two types of experiments: 1) anatomical expression patterns and 2) protein kinase activity. To develop these models, we began by manually extracting and labeling sentences in previously curated papers. For each sentence, we note the presence or absence of specific features, e.g. gene names or synonyms, anatomy terms, or keywords such as ‘phosphorylation’, and label each sentence as ‘positive’ or ‘negative’ depending on whether they contain the specific data types analyzed, e.g. the sentence directly reports an experimental result or the sentence describes an experimental set up related to anatomical expression pattern or protein kinase activity. In addition, we note metadata such as the paper section in which a sentence is found. Using these sentences as training data, we will build a set of models to classify sentences based on their text similarity with sentences in the initial training set. We will transform each sentence into a vector using a pre-trained embedding model (BioSentVec - https://github.com/ncbi-nlp/BioSentVec) and will calculate their cosine similarity as a proxy of the semantic similarity of the sentences. We will then train a binary classifier based on a similarity threshold: if a sentence has cosine similarity with the average sentence in our training set that is higher than a certain threshold we classify the sentence as positive, otherwise as negative. If the performance of the sentence-level classifiers is satisfactory, we will apply the sentence classifier to papers and display the identified sentences, along with extracted entities and concepts, to authors in a table-like format where they can verify the suggested curatorial statement. Author validated statements will then be vetted by curators and integrated in knowledgebases including the Alliance of Genome Resources (https://www.alliancegenome.org).
  13. KG-IDG: A FAIR Knowledge Graph for Illuminating the Druggable Genome
    J. Harry Caufield, Justin Reese, Tudor Oprea, Jeremy Yang, Chris Mungall
    Expand the abstract
    Knowledge graphs (KGs) are representations of heterogeneous entities and their diverse relationships. Though KGs combine intuitive data models with massive data collections, their application to domain-spanning questions in biomedicine is constrained by the effort required to consistently bridge gaps between a massive array of frequently-updating resources. We sought to develop a KG in the context of the NIH-sponsored Illuminating the Druggable Genome (IDG) project with this challenge in mind. IDG research is focused on improving our understanding of the properties and functions of proteins that are currently unannotated within three commonly drug-targeted protein families: G-protein coupled receptors, ion channels, and protein kinases. Accordingly, we designed KG-IDG to integrate otherwise uncoordinated sources of drug vs. target information, e.g., drug properties and target interactions from the DrugCentral resource; protein target details from the Target Central Resource Database; diseases/phenotypes from MONDO, OMIM, Orphanet, and the HPO; along with several other data sources and ontologies. The KG-IDG graph supports intensive graph-based machine learning methods for inference of novel relationships between drugs, proteins, and diseases. KG-IDG incorporates several concrete design elements to ease its application to specific research questions. All contents are built upon the Biolink Model, a data model purpose-built for flexible representation of biological associations in property graphs. All graph versions are assembled using the Knowledge Graph Exchange (KGX) platform, with all components following a highly modular and configurable data ingest and transformation pipeline. The greatest contribution KG-IDG makes to data FAIRness, however, is its adherence to reproducibility and open-source principles: all code, transformed data, and graphs are publicly available. All graph products are distributed in easily human-readable KGX TSV format. We feel that this overall approach allows KG-IDG to serve as the foundation for future KG construction efforts while also yielding interoperable, machine readable data resources. All code for assembling KG-IDG is available at https://github.com/Knowledge-Graph-Hub/kg-idg.
  14. The Knowledge Graph Development Kit
    David Osumi-Sutherland, Robert Court, Huseyin Kir, Ismail Ugur Bayindir, Clare Pilgrim, Shawn Zheng Kai Tan, Nicolas Matentzoglu
    Expand the abstract
    The use of common biomedical ontologies to annotate data improves data findability, integration and reusability. Ontologies do this not only by providing a standard set of terms for annotation, but via the use of ontology structure to group data in biologically meaningful ways. One way to take advantage of this is via a knowledge graph in which ontologies and ontology semantics provide the glue that links content annotated knowledge and data in well documented and transparently queryable ways, providing an extensible base for building a APIs and applications and a potential input to machine learning. One barrier to fulfilling this potential is the lack of easily-usable standardised infrastructure for using ontologies and standard semantics to build and structure Knowledge Graphs in a form suitable for driving web applications. Triple stores can theoretically fulfil this role, but remain a niche technology and their standard query language (SPARQL) is not ideal to use for querying ontologies in OWL. Ensuring that web applications and APIs driven by knowledge graphs are easily usable by the biomedical community requires mechanisms to harness the power of semantics to label and categorise content in ways that are tailored to the application and the community. To overcome these barriers we built the Knowledge Graph Development Kit - a highly configurable containerised pipeline for integrating ontologies and curated information into easily queryable knowledge graphs. The pipeline consists of a triple-store integration layer that loads and integrates ontologies and curated content (TSV templates that the pipeline converts to RDF), and 3 front end servers: An OWLERY server supporting OWL-EL queries across ontologies and knowledge graphs; A Neo4j server that supports graph queries and visualisations and provides an accessible knowledge graph representation; A SOLR server that supports tuneable autosuggest with default settings that are optimised for ontology search and stores cached query results. A standard interconversion between OWL and Neo4J is central to this pipeline. It is optimised for human readable cypher queries and supports a highly configurable semantic tagging system built on the Neo4J label system. Semantic tagging is designed to harness the power of ontology and knowledge-graph semantics to add short, easily understandable, application-specific, human readable tags to ontology terms and annotated content. These semantic tags can be used as badges to efficiently communicate with users in terms that make sense to them, as a mechanism for configuring autosuggest search, for faceted browsing and even for configuring a web application We will present details of the pipelines and examples of their application in 3 different applications: Virtual Fly Brain, the Allen Brain Atlas Cell Type Explorer and The Cell Annotation Platform.
  15. Machine Learning for Scalable Biocuration
    Jane Lomax, Rachael Huntley, Rebecca Foulger, Shirin Saverimuttu, Paola Roncaglia
    Expand the abstract
    High-quality manual curation - the process of reviewing and annotating data manually - is critical for ensuring the reliability and validity of scientific data. However, manual curation is also time-consuming and expensive so is unlikely to be scalable to the large volume of data the life-sciences generate. We have been experimenting with using machine learning models, such as BioBERT, to perform some biocuration tasks using a 'human-in-the-loop' approach. One task we commonly perform is the identification of entities of a particular type in a corpus of text, and we will discuss in this talk the benefits and limitations of ML for this task. In addition, we will expand on our work using ML models for de-novo ontology building, synonym suggestion and relationship extraction. We propose that in the future biocurators time might be well-spent training, validating and retraining ML models to allow their valuable scientific input to be applied in a scalable way.

Short Talks

Short talks are given 10 minutes to present, including questions. We suggest limiting your talk to 7 minutes and leaving 3 minutes for questions. We recommend your talk is no more than 5 or 6 slides, excluding title, acknowledgements, and other housekeeping.

  1. Making expert curated knowledge graphs FAIR
    Jerven Bolleman, Alan Bridge, Nicole Redaschi
    Expand the abstract
    To address the users' need to combine knowledge from different expert curated resources, the biocuration community is heavily invested in the standardization of knowledge with shared ontologies and, more recently, in the representation of data in the form of knowledge graphs (KGs). To easily integrate data from, or query data across, different KGs it is necessary to also standardize the form in which they are published. The W3C standards RDF/OWL and SPARQL were created to address this need and enable the creation of a Semantic Web. Here we describe the use of these standards to publish public SPARQL endpoints for resources such as UniProt (sparql.uniprot.org/sparql), Rhea (sparql.rhea-db.org) and SwissLipids (beta.sparql.swisslipids.org) and RDF allowing private SPARQL endpoints on premise and in clouds (e.g. AWS, Oracle). At more than 110 distinct billion triples - RDF statements - UniProt is the largest freely available KG. UniProt and other SPARQL endpoints support complex analytical queries and inferences that go beyond queries through graph-based machine learning and other approaches. They integrate - federate - expert curated knowledge of protein function with biological and biochemical data from other KGs available in RDF or OWL like the Gene Ontology (functions), Bgee (expression patterns), OMA (orthology), and IDSM (chemical structures). They also serve as APIs to enhance website data display and data mining capabilities - for example to select and enrich SwissBioPics images to visualize subcellular location data, or to perform chemical similarity and chemical substructure search with IDSM directly in Rhea.
  2. BioKC: a collaborative platform for curation and annotation of molecular interactions
    Carlos Vega, Marek Ostaszewski, Valentin Grouès, Marcio Acencio, Reinhard Schneider, Venkata Satagopam
    Expand the abstract
    Curation of biomedical knowledge into standardised and inter-operable format is essential for studying complex biological processes. However, curation of relationships and interactions between biomolecules is a laborious manual process, especially when facing ever increasing growth of domain literature. The demand for systems biology knowledge increases with new findings demonstrating elaborate relationships between multiple molecules, pathways and cells. This calls for novel collaborative tools and platforms allowing to improve the quality and the output of the curation process. In particular, in the current systems biology environment, curation tools lack reviewing features and are not well suited for open, community-based curation workflows. An important concern is the complexity of the curation process and the limitations of the tools supporting it. Here, we present BioKC (Biological Knowledge Curation, https://biokc.pages.uni.lu), a web-based collaborative platform for the curation and annotation of biomedical knowledge following the standard data model from Systems Biology Markup Language (SBML). BioKC allows building multi-molecular interactions from scratch, or based on text mining results and their annotation, supported by an intuitive and lightweight Graphical User Interface. Curated interactions, with their annotations and grounding evidences, called facts, can be versioned, reviewed by other curators and published under a stable identifier. Underlying SBML model ensures interoperability, allowing export of entire collections of such facts for later import into databases, or used as a source material in systems biology diagram construction. We believe BioKC is a useful tool for extracting and standardising biomedical knowledge.
  3. A complete draft human gene functionome from large-scale evolutionary modeling and experimental Gene Ontology annotations
    Marc Feuermann, Pascale Gaudet, Huaiyu Mi, Anushya Muruganujan, Dustin Ebert, Paul Denis Thomas, The Go Consortium
    Expand the abstract
    Understanding the human functionome – the set of functions performed by the protein-coding genes of the human genome – has been a longstanding goal of biomedical research. The last two decades has seen substantial progress towards achieving this goal, with continued improvements in the annotation of the human genome sequence, and dramatic advances in the experimental characterization of human genes and their homologs from well-studied model organisms. Here, we describe the first attempt to create a complete, draft human functionome through a comprehensive synthesis of functional data obtained for human genes and their homologs in non-human model organisms. All relevant function information in the Gene Ontology knowledgebase has been synthesized using an evolutionary framework based on phylogenetic trees, creating curated models of function evolution for thousands of gene families, which are updated as new knowledge accumulates. Our draft human functionome specifies at least one functional characteristic for 80% of human protein-coding genes, each of which can be individually traced to experimental evidence in human and/or non-human model systems. Our analyses of these models and annotations provide insights into the nature of function evolution and the importance of gene duplication in this process, as well as a quantitative estimate of the contribution of studies in model organisms to our current understanding of human gene function. We expect that the evolutionary models and resulting GO annotations will be useful in numerous applications from gene set enrichment analysis to understanding genetic evolution.
  4. Predicting protein metal binding sites with artificial intelligence and machine learning in UniProt
    Rossana Zaru, Vishal Joshi, Sandra Orchard, Maria Martin
    Expand the abstract
    Metal binding is essential for many protein functions. Metals can stabilise protein structure, be part of enzyme catalytic sites or regulate protein function in response to extra- or intracellular signals. Mutations affecting metal-binding residues often result in disease, highlighting the importance of identifying the amino acids involved in metal coordination in order to understand disease etiology and to design therapy drugs. The UniProt Knowledgebase (UniProtKB) collects and centralises functional information on proteins across a wide range of species. For each protein, we provide extensive annotation of sequence features. For example, for metal-binding proteins, UniProtKB specifies the specific amino acid residues that participate in metal binding sites and which metal is bound. Currently, around 16% of reviewed/Swiss-Prot proteins have annotations of metal binding site residues, which are identified from the literature or known structures from PDB. However, only 3% of unreviewed/TrEMBL entries have annotated metal binding sites, which are created by a variety of automated annotation methods. The difference in coverage between the reviewed/Swiss-Prot and unreviewed/TrEMBL entries suggests that there are many millions of missing metal binding site annotation predictions. To increase the coverage of unreviewed/TrEMBL entries, we decided to take the opportunity offered by the huge advances made by artificial intelligence and machine learning (AI/ML) in addressing protein biology such as the prediction of 3D structures by AlphaFold or the prediction of names for uncharacterised proteins by Google’s ProtNLM. We set a challenge for the AI/ML community to generate models to predict metal binding sites with the aim of identifying one or more software tools that are both accurate and scalable and that we can apply within the UniProt production environment. Here, we will provide an overview on how metal binding sites are identified and annotated in UniProtKB, discuss the challenges in annotating and predicting them and how we will evaluate the proposed AI/ML models.
  5. APICURON: standardizing attribution of biocuration activity to promote engagement
    Federica Quaglia, Damiano Piovesan, Silvio Tosatto, Adel Bouhraoua
    Expand the abstract
    Biocuration plays a key role in making research data available to the scientific community in a standardized way. Despite its importance, the contribution and effort of biocurators is extremely difficult to attribute and quantify. APICURON (https://apicuron.org) is a web server that provides biological databases and organizations with a real-time automatic tracking system of biocuration activities. APICURON stores biocuration activities and calculates achievements (medals, badges) and leaderboards on the fly to allows an objective evaluation of the volume and quality of the contributions. Results are served through a public API and available through the APICURON website. The API and the website have been recently redesigned to improve database performance and user experience. A large amount of documentation and guidelines have been published aiming at helping the connecting resources to improve their interoperability and expose curation activities respecting the FAIR principles. APICURON is already supported by ELIXIR and well connected with the International Society for Biocuration. A core of early adopters’ curation databases (DisProt, PED, Pfam, Rfam, IntAct, SABIO-RK, Reactome, PomBase, SILVA, BioModels) are connecting to APICURON. APICURON aims at promoting engagement and certifying biocuration CVs, to this end it is already integrated into ORCID automatically propagating badges and achievements to ORCID user profiles.
  6. Wikidata as a tool for biocuration of cell types
    Tiago Lubiana, Helder Nakaya
    Expand the abstract
    The Human Cell Atlas and the boom of single-cell omics have put cell types at the center of modern biology. Various databases have become central to bioinformaticians, providing information about cell features (especially markers), which are core for labeling new datasets. Despite the centrality of cell types, the organization of information about these biological entities is still in its infancy. Unlike species and genes, there is no standard nomenclatural scheme for cell types nor clear boundaries for cell type assignment. Though most datasets use only ambiguous natural language, the Cell Ontology has provided unique identifiers for cell types for the past two decades. It is re-used in large efforts such as the Human Cell Atlas and HuBMAP. The Cell Ontology, however, is relatively cumbersome to contribute, requiring advanced skills in GitHub and ontology development. It currently provides identifiers for less than 2700 cell types. Wikidata - the all-purpose, open knowledge graph of the Wikimedia Foundation, gathering more than 100 million entities - is increasingly being used to integrate biomedical knowledge. It enables navigation and editing both from a visual interface and well-documented APIs. After efforts integrating data from Gene Ontology, Cellosaurus, Complex Portal, and beyond, its web-based SPARQL Query Service is mighty for biomedical discovery. In this work, we describe a 3-year effort to explore Wikidata as a platform to represent information about cell types. Wikidata currently hosts identifiers for over 4600 cell types, with over 1000 cross-references to CL, 8400 markers genes, 500 links to Wikipedia pages, and 150 links to openly-licensed images, all queryable via SPARQL. Its simple, anyone-can-contribute infrastructure enables fast biocuration, improving coverage and providing a field laboratory for large-scale organization of information about cell types. Wikidata is at a mature stage for cell type information and ready to be harnessed by Cell Ontology and bioinformatics workflows.
  7. eMIND: Enabling automatic collection of protein variation impacts in Alzheimer’s disease from the literature
    Samir Gupta, Xihan Qin, Qinghua Wang, Julie Coward, Hongzhan Huang, Cathy Wu, K Vijay-Shanker, Cecilia Arighi
    Expand the abstract
    Alzheimer’s disease and related dementias (AD/ADRDs) are among the most common forms of dementia, and yet no effective treatments have been developed. To gain insight into the disease mechanism, capturing the connection of genetic variations to their impacts, at the disease and molecular levels, is essential. The scientific literature continues to be a main source for reporting experimental information about the impact of variants. Thus, the development of automatic methods to identify publications and extract the information from the unstructured text would facilitate collecting and organizing information for reuse. We developed eMIND, a deep learning-based text mining system that supports the automatic extraction of annotations of variants and their impacts in AD/ADRDs. In particular, we use this method to capture the impacts of protein-coding variants affecting a selected set of protein properties, such as protein activity/function, structure and post-translational modifications. A major hypothesis we are testing is that the structure and words used in statements that describe the impact of one entity on another entity or event/process are not specific to the two objects under consideration. Thus, a BERT model was fine-tuned using a training dataset with 8,245 positive and 11,496 negative impact relations derived from impact relations involving microRNAs. We conducted a preliminary evaluation on the efficacy of eMIND on a small manually annotated corpus (60 abstracts) consisting of variant impact relations from AD/ADRDs literature, and obtained a recall of 0.84 and a precision of 0.94. The publications and extracted information by eMIND are integrated into the UniProtKB computationally mapped bibliography to expand annotations on protein entries. eMIND’s text-mined output are presented using controlled vocabularies and ontologies for variant, disease and impact along with the evidence sentences. Evaluation of eMIND on a larger test dataset is ongoing. A sample of annotated abstracts can be accessed at URL: https://research.bioinformatics.udel.edu/itextmine/emind. Funding: This work has been funded by NIA supplement grant to UniProt 3U24HG007822-07S1, NIH/NHGRI: UniProt - Enhancing functional genomics data access for the Alzheimer's Disease (AD) and dementia-related protein research communities. Acknowledgements: We would like to acknowledge the UniProt Consortium (https://www.uniprot.org/help/uniprot_staff).
  8. Assessing the Use of Supplementary Materials to Improve Genomic Variant Discovery
    Emilie Pasche, Anaïs Mottaz, Julien Gobeill, Pierre-André Michel, Déborah Caucheteur, Nona Naderi, Patrick Ruch
    Expand the abstract
    The curation of genomic variants requires collecting evidence in variant knowledge bases, but also in the literature. However, some variants result in no match when searched in the scientific literature. Indeed, it has been reported that a significant subset of information related to genomic variants are not reported in the full text, but only in the supplementary material associated with a publication. In the study, we present an evaluation of the use of supplementary data to improve the retrieval of relevant scientific publications for variant curation. Our experiments show that searching supplementary data enables to significantly increase the volume of documents retrieved for a variant, thus reducing by about 63% the number of variants for which no match is found in the scientific literature. Supplementary data thus represents a paramount source of information for curating variants of unknown significance and should receive more attention by global research infrastructures, which maintain literature search engines.
  9. wwPDB Biocuration: Supporting Advances in Science and Technology
    Irina Persikova, Jasmine Y. Young, Ezra Peisach, Chenghua Shao, Zukang Feng, John D. Westbrook, Yuhe Liang, Wwpdb Biocuration Team, Stephen K. Burley
    Expand the abstract
    The Protein Data Bank (PDB) was established in 1971 as the first open-access digital data resource in biology with just seven X-ray crystallographic structures of proteins. Today, the single global PDB archive houses more than 200,000 experimentally determined three-dimensional (3D) structures of biological macromolecules that are made freely available to millions of users worldwide with no limitations on usage. This information facilitates basic and applied research and education across the sciences, impacting fundamental biology, biomedicine, biotechnology, bioengineering, and energy sciences. The Worldwide Protein Data Bank (wwPDB, wwpdb.org) jointly manages the PDB, EMDB, and BMRB Core Archives and is committed to making data Findable, Accessible, Interoperable, and Reusable (FAIR). As the PDB grows, developments in structure determination methods and technologies can challenge how structures are represented and evaluated. wwPDB Biocurators work together with community experts to ensure the standards for deposition, biocuration and data quality assessment align with advances in this rapidly evolving field. The wwPDB deposition-biocuration-validation system, OneDep, is constantly enhanced with extended metadata, enumeration lists, and improved data checking. PDBx/mmCIF dictionary at the core of the wwPDB data deposition, biocuration, and archiving, is regularly updated to provide controlled vocabulary and boundary ranges that reflect the current state of various experimental techniques and ensure accuracy and completeness of deposited metadata. wwPDB Biocurators promote community-driven development and usage of the PDBx/mmCIF dictionary. wwPDB continuously incorporates new and improved data assessment metrics to maintain state-of-the-art validation tools. wwPDB engages working groups of community experts to provide recommendations for improving the wwPDB data validation protocols. In this presentation we provide an overview of the wwPDB Biocurators efforts in promoting enriched data dictionary development, improving data validation standards, and fostering community engagement in data standard setting to support advances in science.
  10. Improved Insights into the SABIO-RK Database via Visualization
    Dorotea Dudaš, Ulrike Wittig, Maja Rey, Andreas Weidemann, Wolfgang Müller
    Expand the abstract
    SABIO-RK is a manually curated database for biochemical reactions and their kinetics. After more than 15 years of data insertion into SABIO-RK with more than 300,000 kinetic parameters extracted from about 7,500 publications, the database has now reached a qualitative and quantitative level which makes a visualization of the data interesting and worthwhile. The complex relationships between the multidimensional data are often difficult to follow or even not represented when using standard tabular views. Visualization is a natural and user-friendly way to quickly get an overview of the data and to detect clusters and outliers. Since the data entered in SABIO-RK is extracted from their original publications without evaluation concerning correctness of the measurement or quality of the biological or experimental setup there exists occasionally a high discrepancy in measured kinetic values from different publications. To help database users identifying and filtering the most prominent or probable values within the correct context easily the visualization was implemented. For that purpose we use a heat map, parallel coordinates and scatter plots to allow the interactive visual exploration of general entry-based information of biochemical reactions and specific kinetic parameter values. The usability and functionality of the visualization was reviewed by users whose comments and requests were considered or implemented. The user feedback was generally positive with a high learning curve.
  11. Multiplexed scRNA-seq Experiments in Biocuration
    Yalan Bi, Nancy George, Irene Papatheodorou, Anja Fullgrabe, Silvie Fexova, Natassja Bush
    Expand the abstract
    The ability to investigate the transcriptomes of cells at a single cell resolution has been a major advance in genome sequencing. Since the emergence of commercial single-cell RNA sequencing (scRNA-seq) platforms, this technology has been well developed and widely accepted by the researchers. Thereafter, it has continued to evolve, becoming more prevalent and more complex, for example, allowing researchers to generate multiple library types from a single cell. Accordingly, in order to represent these advances during data archival, we are required to change our curation practices in order to describe these multiplexed scRNA-seq experiments accurately. One advance is the concept of spatial transcriptomic data, in which the gene expression profiles are linked to the location of the biological material from which the samples are collected. The most commonly used technology is having a gridded ‘capture area’ on a slide onto which tissues are placed. Each point has a unique spatial barcode, which maps RNA transcriptomes back to that location. Prior to sequencing, an image is acquired to capture the biological sample. Later, the material is extracted, point by point, and sequenced including the ‘spatial barcode’ within the reads. This allows researchers to map of the transcriptome from each point back to the original image of the tissue. Capturing these additional information therefore becomes essential for the accurate representation and reuse of these data. Another type of multiplexing methodology enables pooled samples per sequencing run by labelling individual samples with a molecular tag, a technology called feature barcoding. Here multiple library types, depending on the additional ‘feature’, are generated from individual cells. Therefore, to capture these data, we need to accurately represent these differential library constructions whilst maintaining the mapping back to the individual cells and biological sample(s). With the development of these scRNA-seq technologies, the functional genomics team at EMBL-EBI are continuously improving our curation standards to provide a well-structured and comprehensive sample annotation for our users and the wider scientific community. Here we share our experience in curating multiplexed scRNA-seq experiments and looking forward to suggestions from the community how to better support these and upcoming new technologies in the future.
  12. The landscape of microRNA interactions annotation in IMEx: analysis of three rare disorders as case study
    Simona Panni, Kalpana Panneerselvam, Pablo Porras, Margaret Duesbury, Livia Perfetto, Luana Licata, Henning Hermjakob, Sandra Orchard
    Expand the abstract
    Mammalian cells express thousands of ncRNA molecules that play a key role in the regulation of genes. In recent years, a huge amount of data on ncRNA interactions has been described in scientific papers and databases. Although a considerable effort has been made to annotate the available knowledge in public repositories and in a standardized representation, to support subsequent data integration, there is still a significant discrepancy in how different resources capture and interpret data on ncRNAs functional and physical associations. Since 2002, the HUPO Proteomic Standard Initiative has provided a standardized annotation system for molecular interactions, and has defined the minimal information requirements and the syntax of terms used to describe an interaction experiment. The IntAct team has now focused on the development of similar standards for the capture and annotation of microRNAs networks (https://www.ebi.ac.uk/intact/documentation/user-guide#curation_manual, section 4.4.3). In the present project, we have focused on microRNAs which regulate genes associated with rare diseases. In particular, we have selected three disorders, among those listed in the Genomics England PanelApp knowledgebase (https://panelapp.genomicsengland.co.uk/), which are: early onset dementia, growth failure in early childhood, and mitochondrial disorders. All of them are associated with genes regulated by microRNAs. The knowledge about RNA, proteins or genes involved in the interaction was extracted from the literature and integrated with a detailed description of the cell types, tissues, experimental conditions and effects of mutagenesis, providing a computer-interpretable summary of the published data integrated with the huge amount of protein interactions already gathered in the database. Furthermore, for each interaction, the binding sites of the microRNA are precisely mapped on a well-defined mRNA transcript of the target gene, possibly in line with the main transcript as indicated by GIFTS (https://www.ebi.ac.uk/gifts/ ). This information is crucial to conceive and design optimal microRNA mimics or inhibitors, to interfere in vivo with a deregulated process. In the last years, several microRNA-based therapeutics have been developed, and some have entered phase II or III of clinical trials. As these approaches become more feasible, high-quality, reliable networks of microRNA interactions are needed, for instance to help in the selection of the best target to be inhibited or manipulated and to predict potential secondary effects on off targets.
  13. FAIR Wizard: Making the FAIRification process accessible
    Wei Kheng Teh, Fuqi Xu
    Expand the abstract
    FAIR principles are abstract in nature, and harbour contextual complexities such as country-specific data protection laws and highly technical but broad guidelines. As such, FAIR principles are useful for creating guidelines for data providers and managers, but do not provide guidance on how to improve the ‘level of FAIRness’ of a project. The FAIR Wizard is an accessible and freely available tool that breaks down the abstract nature of FAIR principles into specific practical actions, each supported with examples and value. The FAIR Wizard understands the contextual nature of applying FAIR principles, and utilises a case-by-case approach. First, by assessing the project via questionnaire to understand the current and desired level of FAIRness, then creating a pathway of actionable steps to move from the current to the desired level. The FAIR Wizard also supports by linking to other FAIR resources, such as the FAIR Cookbook, a collection of examples and detailed ‘recipes’ for each specific step. This tool was developed collaboratively by the EMBL-EBI and the FAIRplus consortium, to assist data generators and data managers in increasing the overall level of FAIR of their data. By reducing the technical complexity and abstract nature of applying FAIR principles, the FAIR Wizard aims to make FAIRificiation process accessible to the wider community. The FAIR Wizard has additional plans for development based on community feedback, and has been used with the IMI and eLwazi projects in Spain and South Africa, respectively.
  14. Phenopackets for curated repository data over Beacon v2 Progenetix database
    Ziying Yang, Rahel Paloots, Hangjia Zhao, Michael Baudis
    Expand the abstract
    Many data repositories such as NCBI GEO contain a vast amount of annotations and metadata from human "omics" experiments, frequently in semi-structured documents accessible through the resource's API. Here, the use of phenopackets, a flexible schema which can represent clinical data for any kind of human disease, over a standardized API such as Beacon v2 will provide major improvements for data harmonization, FAIRification and empowering of federated data analysis strategies.  Progenetix (progenetix.org) is a curated oncogenomic resource with a focus on copy number variation (CNV) profiling. It presently contains data for more than 140,000 hybridization or NGS based experiments derived from a over 1000 publications as well as resources and projects such as GEO, TCGA or cBioportal. All samples are annotated for biological and procedural metadata, e.g. their corresponding resource identifiers, publication ids, NCIt, UBERON and ICD-O codes and cellosaurus ids, where applicable, and additionally for a core set of biological and clinical data. This curated data with its reasonably large content of data from identifier-tagged, public repository linked samples provides an interesting test case for representation of common "omics" metadata as phenopackets documents delivered over the Beacon v2 API.   The Progenetix Beacon+ API recently introduced a "phenopackets" response format (PXF) in expansion of the Beacon v2 default model's entry types. For "record"-level granularity (i.e. document delivery upon a Beacon request) phenopackets are generated ad hoc from Beacon defaults using through the bycon package driving the Progenetix API. Since Beacon v2 schemas for biosamples and individuals have been designed to closely align with the Phenopackets v2 specifications, necessary remappings are of limited complexity. Here, current efforts are aimed at the integration of emerging tools from the Phenopackets ecosystem especially for compliance testing. We have selected example usage scenarios involving common data repositories with curated data represented in Progentix. Based on these use cases, we tested and compared the Beacon+ Phenopackets prototype towards Phenopackets v2 compliance. For this purpose, we iteratively adjust the Beacon+ PXF implementation towards increasing compliance, implement a prototype of a service to extract Phenopackets from Beacon+ Phenopackets responses, and evaluate the implementation of alternative (i.e. non-Beacon) REST APIs for such Phenopackets responses. Based on our implementation of a Beacon API based, PXF formatted representation of repository-derived, curated genomic and metadata we propose a more general adoption of such a scenario. Here, an extensive (multi-10k) demonstrator project which would extend the scope beyond Progenetix data types could showcase usages scenarios and directly support diverse analysis projects, with direct value for the wider "-omics" communities. A future scenario would include direct GA4GH standards integration (Phenopackets, Beacon, service info etc.) to resource providers using the demonstrated benefit from the demonstrator cases.
  15. Resolving code names to structures from the medicinal chemistry literature: not as FAIR as it should be
    Christopher Southan, Miguel Amaral
    Expand the abstract
    The practice of assigning code names (CNs) as the publicly declared identifiers for distinct lead compound in drug discovery is widespread but remains problematic for biocuration. They are are typically used on company web sites, press releases, abstracts, posters, slides, clinical trials and journal articles. The most common approximate form is “XXX-123456”, with letter prefixes for the organization of origin and numbering from an implicit internal registration system. However, they are effectively non-standardized and may include, single letter codes, spaces, commas, suffixes, multiple hyphens and CNs too short to have any useful searching specificity. It can also be challenging to resolve and extract the name-to-structure (n2s) from the journal article, especially for image-only representations. Further challenges arise when some CNs are blinded in press releases and clinical trial entries (i.e. there is no open n2s). This work had an initial focus on detecting and curating CNs from the Journal of Medicinal Chemistry. From ~2000 PubMed abstracts ~ 300 codes were identified which could be manually mapped to structures. We also developed an extended regular expression syntax to identify as many CNs as possible automatically from just the abstract text. However extensive specificity tweaking was needed including the compilation of false-positive blacklists corresponding to in many cases to gene and cell line names in the abstracts. While many CNs had n2s matches in PubChem from various submitting sources such as Guide to Pharmacology, BindingDB and ChEMBL others were novel. However, many lead structures remained difficult to map into databases because of trivial non-coded naming (e.g. compound 22b). Causes and amelioration of these curation and FAIRness issues for medicinal chemistry lead compounds will be outlined.
  16. The Human Microbiome Drug Metabolism (HMDM) Database
    Amogelang Raphenya, Michael Surette, Gerard Wright, Andrew McArthur
    Expand the abstract
    Human gastrointestinal (gut) bacteria have been shown to contribute to the metabolism of drugs in the gut as far back as the 1900s. Since oral drug administration is the preferred method, drugs taken orally have many limitations, such as the inability to reach their target due to variable absorption rates, variable concentrations, high acid content, and the action of many digestive enzymes. The latter is most important as the gut microbiome can modify drugs enzymatically. Drug metabolism is not limited to orally administered drugs, and the microbiome also converts drug metabolites destined for excretion via the gut, including drug conjugates from the liver. The microbiome can regulate host gene expression and modulate xenobiotic absorption. On the other hand, xenobiotics can affect microbiome viability. To date, no resources systematically catalog all enzymes involved in gut microbiome drug metabolism. Yet, there is a need to understand drug metabolism as it can reduce time and resources during the drug development process by avoiding adverse reactions or treatment failure by way of the gut microbiome. In addition, there is no easy way to understand the prevalence of drug-metabolizing enzymes in the human gut. Knowing the frequency of drug-metabolizing genes, we can better prioritize enzymes that will contribute to poor drug efficacy during drug development. We developed an ontology-centric database to catalog all enzymes systematically, and their encoding gene sequences involved in microbiome drug metabolism termed the Human Microbiome Drug Metabolism (HMDM) database. The HMDM database is manually curated with bacterial enzymes reported in the literature with experimental data showing drug metabolism. We developed a prevalence module for the HMDM database to assess the frequency of the drug-metabolizing enzymes from the species commonly found in the human gut microbiome. The gut microbiome genomic data were obtained from the National Center for Biotechnology Information (NCBI) Datasets. The genomes were analyzed using a newly developed enzyme detection software called the Drug Metabolising Enzyme (DME). The DME predicts potential drug-metabolizing enzymes based on curated enzymes in the HMDM database. The β-glucuronidase genes are more common in this dataset, suggesting they are more prevalent. Some enzymes are only present in a few strains of the same species, such as tyrosine decarboxylase and cardiac glycoside reductase operon. Since different gut microbes colonize everyone, resulting in a heterogeneous response to therapeutics among individuals. The prevalent dataset is an important resource for identifying drug-metabolizing enzymes for different demographics, which will help personalize treatments and improve drug efficacy.
  17. Building a reference dataset of single-cell RNA-Seq data for training Machine-Learning algorithms
    Anne Niknejad, Vincent Gardeux, Fabrice David, David Wissel, Bart Deplancke, Marc Robinson-Rechavi, Mark Robinson, Frederic B. Bastian
    Expand the abstract
    Single-cell RNA-Seq (scRNA-Seq) data are being massively produced in many conditions and species. They allow the study of hundreds of cell types, in widely different contexts regarding, e.g., anatomical localization, developmental stage, or disease state. The characterization of the cell type of each cell is highly labor intensive and error prone, especially when annotating results from bead-based or nanowell technologies, where the a priori cell type is unknown. This characterization usually involves a clustering of the cells based on their gene expression, the identification of marker genes for each cluster, and manual identification by an expert of the cell type corresponding to these marker genes. Additionally, data accessibility for reanalyzing scRNA-Seq data is often sparse, with e.g. missing barcode information, missing explicit relation between each cell and their annotated cluster, or free text format for cell type annotation. Machine-Learning (ML) methods are being used to annotate single-cell data, to facilitate the characterization of cell types. However, because of the lack of standardization of these data, it is challenging to train and evaluate algorithms for a variety of tissue and species contextes. This project aims at providing a reference dataset in D. melanogaster, and to train and benchmark several ML algorithms thanks to it. It is a collaboration between Bgee (https://bgee.org/), specialized in transcriptomics data annotation, ASAP (https://asap.epfl.ch/), specialized in scRNA-Seq analysis pipeline standardization, and the Robinson Statistical Bioinformatics Group (https://robinsonlabuzh.github.io/), specialized in genomics statistical methods. Several experiments have been re-annotated and standardized: the Fly Cell Atlas, plus all publicly available experiments using the “10x Genomics” technology. Cluster annotations are all standardized using reference ontologies (e.g., Cell Ontology, Uberon), using ontology post-composition methods, as well as the a priori information known before the clustering step. Cell barcode information, and link to their cluster allowing cell type assignment, are checked and integrated. Information for all integrated experiments is released in a common format, as H5AD files. Several systematic challenges have been already identified in metadata: uncertainties about cell type assignment, incorrect cell type assignments, or differences in annotations depending on the clustering method used. We report and correct these errors and uncertainties, in order to provide a gold standard reference dataset of FAIR and annotated scRNA-Seq data. We will present this reference dataset and the lessons learned, to address open questions about, e.g., the validity of using a same ML classifier in different tissues and conditions, or even in different species; or about how to handle cell type assignment uncertainties. This dataset will allow researchers to evaluate and improve their own ML classification methods, and will provide a foundation for defining a common standard for scRNA-Seq FAIR data exchange.
  18. scFAIR: Standardization and stewardship of single-cell metadata
    Frederic B. Bastian, Vincent Gardeux, Bart Deplancke, Marc Robinson-Rechavi
    Expand the abstract
    Single-cell functional genomics is bringing major insight into the life sciences. Single-cell data are rapidly increasing both in quantity and in diversity, but lack method and metadata standardization. While some large projects have clear standards of reporting, most publicly available datasets have partial or non standardized metadata. This leads to multiple non-compatible “standards” across datasets, and limits reusability, which in turn presents challenges to make these data useful to an increasing community of specialists and non-specialists. Therefore, there is a need for a centralized, standardized repository where researchers can collaboratively upload, annotate, or access single-cell metadata. There is also a need for standards in the way single-cell data are stored and annotated, especially for cell type and other associated information. Indeed, metadata is critical to the capacity to use these large and potentially very informative datasets. It includes protocols, which constrain which transcripts were accessible or which normalizations are relevant, the association between barcodes and annotations, or the methods used to identify cell types. Existing ontologies and controlled vocabularies are not used systematically, even when information is reported. The project scFAIR has the aim of building a collaborative platform supporting and disseminating Open Research Data practices for the single-cell genomics community, including data stewardship, both for sharing datasets and their metadata. scFAIR is funded by the Swiss Confederation with the aim of anchoring existing Open Research Data practices and taking them to the next level. It is a collaboration between the labs developing Bgee (https://bgee.org/) and ASAP (https://asap.epfl.ch/). An important aspect is to provide data stewardship to help researchers make their data FAIR, rather than adding a new layer of under-used “standards”. In the first part of the project, we are gathering feedback, and learning from existing practices, in order to define a standard for single-cell data that can be widely adopted by researchers. The challenges identified include, e.g., for single-cell RNA-Seq data, the requirement to have access to barcode information, in relation to cluster information; or to have access to the pipeline analysis parameters allowing to reproduce the clustering step. The second part of this project will be to develop a collaborative platform, implementing this standard, to improve single-cell data availability and reusability. An essential aspect is to obtain the involvement of the research community, to support them in the data submission process, notably by providing helpful information about errors identified in their metadata, and to disseminate the use of this single-cell FAIR practice. At this Biocuration 2023 conference, we would like to make researchers aware of this funded Open Research Data initiative, and obtain a large involvement of the biocuration community. We believe scFAIR has strong potential to become a tool for biocurators. We will present the limitations already identified in existing metadata, and the solutions so far.
  19. Unifying Protein Complex Curation across the Diversity of Species
    Sandra Orchard, Birgit Meldal, Helen Attrill, Giulia Antonazzo, Edith Wong, Henning Hermjakob
    Expand the abstract
    Proteins are essential for building cellular structures and as the tools that make the cell function. However, proteins do not operate in isolation and often form molecular machines in which several proteins bind together and with other biomolecules to act as a single entity called a molecular complex. This provides tremendous versatility and regulatory capacities, since by changing a single component of the complex, its function can be dramatically altered. Protein complexes often also form more stable structures than isolated proteins, and their formation creates new active sites as protein chains from different molecules assemble in close proximity. It is therefore of crucial importance to know the composition of complexes and study them as discrete functional entities in order to truly understand how cellular processes work. The Complex Portal (www.ebi.ac.uk/complexportal) is an encyclopaedic database that collates and summarizes information on stable, macromolecular complexes of known function from the scientific literature through manual curation. Complex Portal curators have now completed a first draft of all the stable molecular complexes from the gut bacteria Escherichia coli and through collaboration with the Saccharomyces Genome Database, also of the complexome of Saccharomyces cerevisiae. Work is ongoing to produce a reference set of human protein complexes and also, in association with FlyBase, for the model organism Drosophila melanogaster. Protein complex evolution can now be shown to occur through the gain and loss of subunits and a better understanding of this process could improve predictions of, for example, the phenotypic effects of mutations and variants causative of change of function or susceptibility to disease. We invite other data resources, active in the biocuration of other organisms or biological processes to contribute to this collaborative effort and further increase the biodiversity of molecular machines described in the Complex Portal and, via import to UniProt and other resources, enhance our understanding of the inter-dependence of proteins within an organism.
  20. Providing Expanded Contextual Metadata for Biological Samples using Both Geographic and Taxonomic Factors
    Peter Woollard, Stephane Pesant, Josephine Burgin, Guy Cochrane
    Expand the abstract
    The European Nucleotide Archive (ENA; https://www.ebi.ac.uk/ena) is a long-standing database of record for nucleotide sequence data and associated metadata. The ENA has minimal required metadata standards for submitted records to balance the needs of the data generators/submitters and making the metadata as FAIR as possible for downstream users though recommended standards are not always utilised to their full potential and details can be left out. There are nearly 200,000 marine samples alone within ENA and as part of the BlueCloud project (https://blue-cloud.org/) it was identified that there was a need to enhance the available specific metadata for marine and freshwater samples. By utilising user-provided geographic metadata, we can assert additional contextual metadata to enhance the existing sample records. Approximately 17% of all ENA samples have GPS coordinates. We have used the GPS coordinates to determine additional metadata, for example, the geographic political regions (e.g. countries and EEZs) and environment types (e.g. land and sea), via computational geometry. These were compared to existing submitter metadata provided with these samples. Additionally organism taxonomies were categorised with their likely marine or freshwater environment. The submitter, GPS and taxonomy insights were merged and compared. As expected much of the time there is clear cut metadata agreement, sometimes explainable differences and occasionally harder to explain or understand differences. For the ENA and similar archives, submitter entered data is the record and so metadata cannot be changed substantively on the primary record without the approval of data owners. The extra contextual metadata is being added to the ELIXIR Contextual Data Clearinghouse see https://elixir-europe.org/internal-projects/commissioned-services/establishment-data-clearinghouse; the metadata will be programmatically available from https://www.ebi.ac.uk/ena/clearinghouse/api/. It will thus be straightforward to programmatically query the clearinghouse and the ENA portal APIs to more easily find, access and re-use marine and freshwater sample data. We outline our approaches and discuss our findings in more detail. Affiliation: European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton. CB10 1SD. United Kingdom
  21. ICGC-ARGO Data Submission Workflow - Integration of data validation and submission to accelerate the data curation and improve the data quality
    Qian Xiang, Edmund Su, Hardeep Nahal-Bose, Robin Haw, Melanie Courtot
    Expand the abstract
    The International Cancer Genome Consortium Accelerating Research in Genomic Oncology (ICGC-ARGO) aims to uniformly analyze specimens from 100,000 patients with high-quality clinical data to address outstanding questions vital to the quest to defeat cancer. In order to achieve this ambitious goal, a critical task of the Data Coordination Center (DCC) is to enforce various data validation rules during data submission. This ensures the received raw molecular data files and associated metadata are of high quality and conform to genomic data standards, which are extremely important to perform uniform data processing by a Regional Data Process Center (RDPC) and data release through the ICGC-ARGO data platform (https://platform.icgc-argo.org/) to the research community. ICGC-ARGO has established a robust metadata management and storage system (SONG) to easily track and manage genomic data files in a secure and validated environment against a flexible data model defined in JSON schema. The schema consists of a core module to track primary patient identifiers, and dynamic modules for specific parts of the model which are defined based on any desired business rules. With great flexibility, comes great responsibility. Since genomic and clinical data are usually submitted to different databases by different submitters according to their different schedules, system sanity checks are applied early to avoid potential discrepancies between the genomic metadata in SONG and records in the clinical database. To ensure critical metrics of the genomic data files are adequately reflected in their associated metadata, a stand-alone command line tool is implemented to conduct different levels of validations. Ultimately, an ICGC-ARGO Data Submission Workflow is designed by integrating the above sanity checks and validations, metadata generation, and streamlining genomic data/metadata submission processes. The workflow is implemented in Nextflow and currently supports submissions for both local and remote data stored in EGA archives for FASTQ, BAM and CRAM type files. Here we share our experience of designing and integrating this workflow in the ICGC-ARGO platform. We expect it will play a critical role in helping to ease the data submission, accelerate the data curation process and improve the data quality, which significantly impacts all the downstream genome data analysis pipelines.
  22. Towards making sense of glycan-mediated protein-protein interactions
    Catherine Hayes, Masaaki Shiota, Akihiro Fujita, Kiyoko Aoki-Kinoshita, Frédérique Lisacek
    Expand the abstract
    Research into glycobiology is fast becoming recognised as important in all fields of biology, having implications in inflammation, digestion and protection of the mucosal layer and structural integrity of proteins, among others. With the increase in publications dedicated to the solving of carbohydrate structures there is a need for well curated, annotated and searchable databases to disseminate this work. Several groups now work in fulfilling this role, and among them, Glyco@Expasy (glycoproteome.expasy.org) GlyCosmos (www.glycosomos.org) and GlyGen (www.glygen.org) have come together to create the GlySpace Alliance (www.glyspace.org), a cross-country/continent alliance to aid the glycobiology community by sharing and collaborating on glycobioinformatic resources on a FAIR basis. Each resource brings its own expertise to the collaboration with global or pairwise initiatives. Lately, the Glyco@Expasy and GlyCosmos groups have been investing efforts in linking data, through the adoption of the resource description framework and the establishment of RDF endpoints. In particular, the Glyco@Expasy side provides among others, the GlySTreeM triple store of glycan structures (glyconnect.expasy.org/glystreem/sparql) described by the GlySTreeM ontology (Daponte et al., 2021). The GlyCosmos side has its own endpoint (ts.glycosmos.org/sparql) through which the GlyTouCan glycan structure repository (www.glytoucan.org) can be accessed, among others. A key aspect to unifying resources is to make the journey between a glycoprotein (that presents one or more glycan structures) and a glycan-binding protein/lectin (that recognises one or more glycan structures) transparent and easily accessible. This sets the basis of functional glycobiology. To this end, the first step is to connect a glycan structure with its known ligand parts. This is done using a GlycoQL, a GlySTreeM-based translator tool for substructure searching (Hayes et al., 2022). It allows the creation of glycan structure queries in SPARQL and was initially tested with federated queries across UniProt and GlyConnect, the curated and annotated resource of glycoproteins of Glyco@Expasy (thousands of structures). To test the robustness of the model, it was necessary to adapt the procedure to larger data sets, such as GlyTouCan (hundred thousand entries). Collaborative work is geared towards integrating the GlySTreeM model with the rest of the GlyCosmos family of tools. By setting this up on the GlyTouCan repository the glycan structures can be identified by their GlyTouCan ID (a unique identifier) which will also facilitate the linking of multiple resources with those in the Linked Open Data universe (https://lod-cloud.net/). Examples of federated queries will be presented. Daponte, V., Hayes, C., Mariethoz, J., & Lisacek, F. (2021). Dealing with the Ambiguity of Glycan Substructure Search. Molecules, 27(1), 65. https://doi.org/10.3390/molecules27010065 Hayes, C., Daponte, V., Mariethoz, J., & Lisacek, F. (2022). This is GlycoQL. Bioinformatics, 38(Issue Supplement_2), ii162–ii167. https://doi.org/10.1093/bioinformatics/btac500
  23. Machine-assisted curation of molecular mechanisms using automated knowledge extraction and assembly
    Benjamin M. Gyori, Charles Tapley Hoyt
    Expand the abstract
    Curated resources such as protein-protein interaction databases and pathway databases require substantial human effort to maintain. An important bottleneck is the large body of published literature containing such information and the rate at which new publications appear. We present an approach that combines text mining and knowledge assembly to extend existing curated resources automatically using the INDRA system [1,2]. INDRA is an open-source Python library that integrates multiple text mining systems that can extract relations representing molecular mechanisms such as binding, phosphorylation and transcriptional regulation from publications at scale. Individual extractions providing evidence for each mechanism are then assembled from different text mining systems and publications. Assembly involves finding overlaps and redundancies in mechanisms extracted from published papers (using an ontology-guided approach) and using probability models to assess confidence and reduce machine reading errors. Beyond automated extraction, assembly and confidence assessment, INDRA also makes available a web-based interface to review assembled mechanisms along with supporting evidence and mark any errors to improve downstream usage. Automatically assembled knowledge extends and enriches curated resources in several ways: by (1) finding new relationships that have not yet been manually curated, (2) adding additional mechanistic detail to existing curated relationships and (3) finding additional evidence for existing curated relationships in new experimental contexts. We demonstrate all forms of extensions (1-3) using INDRA on human protein-protein interactions in BioGRID [3] and kinase-substrate annotations in PhosphoSitePlus [4]. We discuss characteristics of this machine-assisted curation workflow in terms of the number of assembled mechanisms that need to be reviewed to find correct new relationships (i.e., the curation yield) from text mining. Finally, we quantify the effect of these extensions on downstream data analysis [5]. Overall, this constitutes a reusable workflow for machine-assisted curation that can be applied to a broad range of resources to extend and enrich their content. [1] Gyori BM, Bachman JA, Subramanian K, Muhlich JL, Galescu L, Sorger PK. From word models to executable models of signaling networks using automated assembly. Molecular Systems Biology, 2017 13(11):954. https://doi.org/10.15252/msb.20177651 [2] Bachman JA, Gyori BM, Sorger PK Automated assembly of molecular mechanisms at scale from text mining and curated databases bioRxiv, 2022. https://doi.org/10.1101/2022.08.30.505688 [3] Oughtred R, Rust J, Chang C, Breitkreutz BJ, Stark C, Willems A, Boucher L, Leung G, Kolas N, Zhang F, Dolma S, Coulombe-Huntington J, Chatr-Aryamontri A, Dolinski K, Tyers M. The BioGRID database: A comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Sci. 2021;30(1):187-200. https://doi.org/10.1002/pro.3978 [4] Hornbeck PV, Kornhauser JM, Latham V, Murray B, Nandhikonda V, Nord A, Skrzypek E, Wheeler T, Zhang B, Gnad F. 15 years of PhosphoSitePlus®: integrating post-translationally modified sites, disease variants and isoforms. Nucleic Acids Res. 2019;47(D1):D433-D441. https://doi.org/10.1093/nar/gky1159 [5] Bachman JA, Sorger PK, Gyori BM. Assembling a corpus of phosphoproteomic annotations using ProtMapper to normalize site information from databases and text mining bioRxiv, 2022. https://doi.org/10.1101/822668
  24. A Framework for Assisting MeSH Vocabulary Development at the National Library of Medicine: Reliably Identifying Literature Containing New Chemical Substances
    Rezarta Islamaj, Nicholas Miliaras, Olga Printseva, Zhiyong Lu
    Expand the abstract
    Medical Subject Headings (MeSH) is a controlled and hierarchically-organized vocabulary produced by the National Library of Medicine (NLM), used for indexing, cataloging, and searching of biomedical and health-related information in PubMed. It consists of more than 30,000 MeSH Headings (descriptors) organized hierarchically in 16 main branches of the MeSH tree, and more than 320,000 supplementary concept records (SCRs) which map to various headings, or branches of the tree. Due to daily advances in biomedicine and related areas, an important goal of the biocuration scientists at NLM is to keep MeSH current. The Chemicals and Drugs tree is the largest main branch in the MeSH tree. It is also one of the most active vocabulary development areas. New chemicals, new synonym terms for existing concepts, and new chemical groups, comprised 62% of the new MeSH requests in 2022. Consequently, from May-December 2022, NLM curators identified 1,140 chemical and protein terms that were not included in MeSH, and after review, created 592 new SCRs, and added 129 new terms/synonyms. To respond to NLMs curators’ need for a tool to assist with rapid identification of new articles containing new chemical names, and relying on our previous success in accurately identifying chemicals in biomedical literature, we developed a framework for assisting the MeSH vocabulary development efforts for chemicals. In March 2022, we started with a selected set of 200 PubMed articles in TeamTat, annotated with the NLM-Chem algorithm for chemicals, which five curators reviewed. Of those, 174 articles contained new chemical terminology, not included in MeSH. These articles were used as the seed for classification training in LitSuggest, which considered all articles published since January 1, 2022, and identified 1,416 that contained chemicals, which were tested for our topic of interest (new chemicals). From this set, 39 articles marked positive from LitSuggest were then ported to TeamTat for manual review. Since March 2022, we have run this framework several times. Each time we re-populate the training set in LitSuggest with articles of interest verified by TeamTat and retrain the classifier. We run the classifier and test the newly published articles in PubMed. We use the NLM-Chem algorithm to mark all chemicals in the LitSuggest identified articles and review them on TeamTat. As a result of this work, we have identified 453 new chemical entries submitted as chemical flags or MeSH requests. Of these, 333 were new chemical substances, 71 terms were new synonyms for existing terms, and 49 were new chemical groups. We find this framework will significantly improve our productivity, and aim to fully integrate it in the indexing production pipeline for new terminology suggestion.
  25. Medical Action Ontology (MAxO) development and tool implementation for the annotation of Rare Disease (RD)
    Leigh C Carmody, Michael Gargano, Nicole A Vasilevsky, Sabrina Toro, Lauren Chan, Hannah Blau, Xingmin A. Zhang, Monica C Munoz-Torres, Chris Mungall, Nicolas Matentzoglu, Melissa Haendel, Peter Robinson
    Expand the abstract
    A rare disease (RD) affects fewer than one in 2000 individuals. Finding relevant clinical literature about strategies to manage RD patients is often difficult. Responding to this need, the Medical Action Ontology (MAxO) was developed to provide structured vocabulary for medical procedures, interventions, therapies, and treatments for disease. While MAxO’s initial use case is to annotate RD, it can be broadly applied to common and infectious diseases. Currently, MAxO contains over 1387 terms added by manual curation using the Protégé OWL editor and ROBOT templates or by semi-automated curation using Dead Simple Design Patterns (DOSDP). Ontologies from the Open Biological and Biomedical Ontology Foundry (OBO Foundry) were used to axiomatize many terms and help give MAxO structure. Ontologies imported include Uber-anatomy Ontology (UBERON), Ontology for Biomedical Investigations (OBI), Chemical Entities of Biological Interest (ChEBI), Food Ontology (FOODON), Human Phenotype Ontology (HP), and Protein Ontology (PRO). An annotation database is currently under development to capture the medical actions and treatments used for RD. Three types of annotations are being curated (1) diagnosis annotations (MAxO-HP); these annotations are diagnostic terms, such as laboratory tests or diagnostic imaging procedures, that are used to observe or measure phenotypic abnormalities. These annotations are disease agnostic since the modality for assessing an abnormal phenotypic feature such as ‘Agenesis of the corpus callosum’ (HP:0001274) or ‘Hypertelerosm’ (HP:0000316)does not depend on the underlying disease. (2) RD-associated phenotypes are also annotated withmedical actions to capture how symptoms are treated in the context of a particular disease. For example, in Tumor Predisposition Syndrome where patients are prone to lung adenocarcinomas, the medical recommendation is to avoid radiation, including avoiding chest X-rays and computerized tomography (CT) so as not to exacerbate the condition. Therefore, medical recommendation annotations (e.g. ‘avoid CT scans’’ (MAXO:0010321) and ‘radiographic imaging avoidance’ (MAXO:0001127) are specific to the phenotypes (e.g ‘lung adenocarcinoma’ (HP:0030078) associated with that particular disease (e.g ‘BAP1-related tumor predisposition syndrome ’, MONDO:0013692). (3) Disease-specific annotations (MAxO-RD): these are either curative treatments or treatments that alleviate disease-associated phenotypes by directly affecting the disease cause. For example, gene therapy (MAXO:0001001) directly targets the gene variant causing the RD and thereby affects all phenotypes associated with the disease. The POET website (https://poet.jax.org/) houses MAxO annotations. Once fully established, the medical and research community will be invited to contribute to MAxO annotation database. While our initial efforts are focused on annotating RD, future annotations could be collected for common diseases. All community-added annotations will be verified before being published. All annotations will be available on the Human Phenotype Ontology website (hpo.jax.org). MAxO is open-source and freely available under a Creative Commons Attribution license (CC-BY 4.0) (https://github.com/monarch-initiative/MAxO; https://obofoundry.org/ontology/maxo).
  26. Community SARS-CoV-2 Curation Driven Emergent Experiences - Increased Curation Efficiency and Learned Lessons for the Future
    Marc Gillespie, Peter D'Eustachio, Robin Haw, Lisa Matthews, Andrea Senff-Ribeiro, Lincoln Stein, Guanming Wu, Henning Hermjakob, Cristoffer Sevilla, Marija Milacic, Veronica Shamovsky, Karen Rothfels, Ralf Stephan, Justin Cook
    Expand the abstract
    In April 2019 Reactome began a COVID-19 curation project. This project differed in multiple ways from Reactome's standard curation practices, but ultimately provided critical procedural insights. Reactome is an open-source, open access, manually curated and peer-reviewed pathway database. Our goal is to provide intuitive bioinformatics tools for the visualization, interpretation and analysis of pathway knowledge to support basic and clinical research, genome analysis, modeling, systems biology and education. At the beginning of the COVID-19 pandemic the Reactome group joined the COVID-19 Disease Map community, a broad community-driven effort developing a COVID-19 Disease Map. This group combined the biocuration practices and philosophies of Reactome, Cell Designer, and WikiPathways communities. The parallell curation of of the the viral lifecycle and host cell interactions provided each curation community with a unique opportunity to learn, compare, and adopt best curation practices. Shared data repositories, weekly meetings, chat and knowledge exchange software was present from the very beginning of this communal effort, and the software and mechanisms to compare each communities work only grew as the effort proceeded. Curation of a molecular pathway often starts with a set of expert assertions found in literature reviews covering a wide range of experiments and knowledge. For SARS-CoV-2 no such literature existed. Reactome used an orthoprojection approach, first curating the well supported SARS-CoV-1 viral lifecycle, and then using Reactomes orthologous projection tools to create the associated pathways for SARS-CoV-2. This approach produced a logical scaffolding for SARS-CoV-2 curation, rapidly accelerating the curation of an emerging pathogen. The process of adding SARS-CoV-2 experimental support was accelerated, allowing curators and other researchers with a global view of each new facts impact on the model. Assessing the literature support for those curatorial assertions posed another challenge. The experimental work around SARS-CoV-2 was growing far quicker than any one curator or community could manage. Pre-prints, which were never part of Reactome or most other molecular pathway curation efforts, were critical to this endeavor. What emerged was a pipeline that combined human expertise with automated computational paper identification. A literature triage system grew that connected curators and editors across the community all building a central repository of papers tagged to specific molecular entities and steps in the pathway. As annotated modeles emerged the communities provided rapid review and revision. Groups of curators provided an additional layer of review. These review groups were precisely positioned, building similar networks, using similar sets of literature support, and could understand the nuances of the different data models that each each community worked with. This process identified four areas: Use of orthoprojection, literature triage, alignment of curatorial communities working in the same area but using different tools, and a new layer of curatorial community review. All of these steps would be enhanced using current computational approaches. Rather than look to replace manual curation, we should focus on using the best aspects of curation across communties and approaches.
  27. Curating species descriptions for the digital age
    Peter Uetz
    Expand the abstract
    About 2 million species of animals, plants, and microbes have been described. Excluding microbes, the vast majority of them has been described using plain text and some illustrations. While there are about 150 global species databases, now organized under the umbrella of the Catalogue of Life project (https://www.catalogueoflife.org), only some of them have compiled comprehensive data on species descriptions. Using the Reptile Database as an example (http://www.reptile-database.org), we present an attempt to collect historical and recent species descriptions, developing dictionaries and an ontology of morphological terms, and initial results from text-mining this corpus. The database currently contains descriptions of about 8,000 (out of ~12,000) species, curated from ca. 5,000 papers and books, plus photos for 6,200 species. This dataset is complemented by links to various other data sources such as NCBI Taxonomy and thus Genbank, IUCN conservation data, range data, and via DOIs to their source publications. The goal of this project is to extract information into tables which can be used for downstream applications and analyses, such as identification tools or phylogenetic and macro-ecological studies. The state of species descriptions across databases is summarized and related projects are presented, such as those by Plazi and the Biodiversity Community Integrated Knowledge Library (BiCikl).
  28. Accessing UK Biobank-derived data through the AZ PheWAS portal, with reassigned phenotypic ICD10 codes.
    Jennifer Harrow, Karyn Mégy, Amanda O'Neil, Keren Carss, Quanli Wang, Gulum Alamgir, Shikta Das, Sebastian Wasilewski, Eleanor Wheeler, Katherine Smith, Slavé Petrovski
    Expand the abstract
    Early-stage incorporation of human genomic data into the assessment of drug targets has been shown to significantly increase drug pipeline success rates [1],[2]. Genome-wide association studies have uncovered thousands of common variants associated with human disease, but the contribution of rare variants to common disease remains relatively unexplored. The UK Biobank (UKB) contains detailed phenotypic data linked to medical records for approximately 500,000 participants, offering an unprecedented opportunity to evaluate the effect of rare variation on a broad collection of traits. We have studied the relationships between rare protein-coding variants and 17,361 binary and 1,419 quantitative phenotypes using exome sequencing data from 269,171 UK Biobank participants of European ancestry; we have performed an ancestry-specific and pan-ancestry collapsing analyses using exome sequencing data from 11,933 UKB participants of African, East Asian or South Asian ancestry. The results highlight a significant contribution of rare variants to common disease [3], and the summary statistics have been made publicly available through the AZ PheWAS interactive portal (http://azphewas.com/). Phenotypes originate from multiple sources and are encoded with different classifications, or different versions of the same classification. We have manually validated the existing NLP-derived ICD9-ICD10 mapping [4] and assigned ICD10 codes to the updated ICD10 codes and unclassified UKB fields. This mapping will be integrated to the upcoming release of the AZ PheWAS Portal. We would like to engage with ontologists and the ISB community around remapping of phenotypes and discuss availability of community-data dictionaries to facilitate re-mapping of further datasets.

Poster Presentations

There will be two back-to-back poster presentation sessions for 1 hour each Posters should be at a maximum 90cm wide and 100-110cm tall. Poster boards and materials for hanging posters will be provided. We suggest that posters include a QR code that viewers can scan that either link to a downloadable version of the poster or other relevant resources.

  1. Enhanced integration of UniProtKB/Swiss-Prot, ClinVar and PubMed
    Maria Livia Famiglietti, Anne Estreicher, Lionel Breuza, Alan Bridge, The Uniprot Consortium
    Expand the abstract
    UniProtKB/Swiss-Prot is a reference resource of protein sequences enriched with expert curated information on protein functions, interactions, variation and disease. It describes over 6,200 human diseases linked to over 4,800 protein coding genes and 32,000 disease-associated human variants. Information on variants is annotated by expert curators based on peer-reviewed published articles, identified either manually or thanks to text mining tools such as LitSuggest, a web-based system for identification and triage of relevant articles in PubMed. Our current work focuses on the annotation of clinical significance of variants using the ACMG guidelines and ClinGen tools, submission of interpretations to ClinVar, and annotation of their functional characterization, if available. Functional annotations are standardized using controlled terms from a range of ontologies, including Gene Ontology and VariO to provide UniProt users with machine-readable data. Taken together, this work will increase the coverage and usability of curated variant data in UniProtKB, and the utility of UniProt as a platform to integrate genome and variation data with the knowledge of protein function and disease.
  2. SciBite: A bespoke approach to facilitating FAIR practice in the Life Sciences Industry
    Rachael Huntley
    Expand the abstract
    Pharmaceutical and Biotechnology companies have multiple information-rich documents in various formats and locations, making it difficult to find relevant information using simple search techniques and hampering efforts to make their data FAIR (Findable, Accessible, Interoperable and Reusable). We present an overview of the solutions that the technology company SciBite provides to our customers in the life science space to assist in their journey for FAIR data. SciBite’s technology is reliant on the use of expertly curated ontologies and vocabularies. In addition to creating and updating these ontologies and vocabularies, our curators work with all teams within SciBite, to contribute to product development and software testing, as well as with customers to provide a bespoke solution to their data management. Hopefully this overview will provide biocurators with an insight into a curators’ role in an industry setting.
  3. Enzyme and transporter annotation in UniProtKB using Rhea and ChEBI
    Cristina Casals-Casas, Uniprot Consortium
    Expand the abstract
    The UniProt Knowledgebase (UniProtKB, at www.uniprot.org) is a reference resource of protein sequences and functional annotation. Here we describe a broad ranging biocuration effort, supported by state-of-the-art machine learning methods for literature triage, to describe enzyme and transporter chemistry in UniProtKB using Rhea, an expert curated knowledgebase of biochemical reactions (www.rhea-db.org) based on the ChEBI ontology of small molecules (www.ebi.ac.uk/chebi/). This work covers proteins from a broad range of taxonomic groups, including proteins from human, plants, fungi, and microbes, and both primary and secondary metabolites. It provides enhanced links and interoperability with other biological knowledge resources that use the ChEBI ontology and standard chemical structure descriptors, and improved support for applications such as metabolic modeling, metabolomics data analysis and integration, and efforts to predict enzyme function and biosynthetic and bioremediation pathways using advanced machine learning and other approaches.
  4. Assessing Resource Use: A Case Study with the Human Disease Ontology
    J. Allen Baron, Lynn Schriml
    Expand the abstract
    As a genomic resource provider, grappling with getting a handle on how your resource is utilized and being able to document the plethora of use cases is vital to demonstrate sustainability. Herein we describe a flexible workflow built on readily available software, that the Disease Ontology (DO) project has utilized to transition to semi-automated methods to identify uses of the ontology in published literature. The novel R package DO.utils has been devised with a small set of key functions to support our usage workflow in combination with Google Sheets. Use of this workflow has resulted in a three-fold increase in the number of identified publications that use the DO and has provided novel usage insights that offer new research directions and reveal a clearer picture of the DO’s use and scientific impact. Our resource use assessment workflow and the supporting software are designed to be utilized by other genomic resources to achieve similar results.
  5. Genomic Standards Consortium tools for genomic data biocuration - 2023 update.
    Lynn Schriml, Chris Hunter, Ramona Walls, Pier Luigi Buttigieg, Anjanette Johnston, Tanja Barrett, Josie Burgin, Jasper Koehorst, Peter Woollard, Montana Smith, Bill Duncan, Mark Miller, Jimena Linares, Sujay Sanjeev Patil
    Expand the abstract
    The Genomic Standards Consortium’s (GSC, www.gensc.org) successful development and implementation of the Minimum Information about any (x) Sequence (MIxS) genomic metadata standards have established a community-based mechanism for sharing genomic and other sequence data through a common framework. The GSC, an international open-membership working body of over 500 researchers from 15 countries, promotes community-driven efforts for the reuse and analysis of contextual metadata describing the collected sample, the environment and/or the host and sequencing methodologies and technologies. Since 2005, the GSC has deployed genome, metagenome, marker gene, single amplified genome, metagenome-assembled genome and uncultivated virus genome checklists and a library of 23 MIxS environmental packages to enable standardized capture of environmental, human and host-associated study data. In 2022, the GSC relaunched its website using GitHub pages, allowing us to have a shared and distributed approach to website maintenance. We also have made our standards available in the GSC’s GitHub repository (https://github.com/GenomicsStandardsConsortium/mixs/tree/main/mixs). The latest release, MIxS v6.0, includes six new environmental checklists: Agriculture Microbiome, Host-Parasite Microbiome, Food-animal and animal feed, Food-farm environment, Food-food production facility and Food-human foods. These and other specifications are managed using GitHub releases and LinkML schema tooling and are assigned globally unique URIs and unique MIxS IDs. MIxS is released in excel, JSON-LD, OWL, and SHEX serializations. These standards capture expert knowledge, enable data reuse and integration and foster cross-study data comparisons, thus addressing the critical need for consistent (meta)data representation, data sharing and the promotion of interoperability. The GSC’s suite of MIxS reporting guidelines have been supported for over a decade by the International Nucleotide Sequence Database Collaboration (INSDC) databases, namely NCBI GenBank and BioSample, EMBL-EBI ENA and Biosamples and DDBJ, thus allowing for an enriched environmental and epidemiological description of sequenced samples. To date, over 1,793,966 NCBI BioSample records (compared to 450,000 in 2019) have been annotated with the GSC’s MIxS standards. In the last year, GSC has established two significant collaborations with the National Microbiome data collaborative (NMDC) and Biodiversity Information Standards (TDWG). Upon its launch, NMDC used MIxS v5 for metadata terms that describe and identify biosamples. The NMDC and GSC actively collaborate to improve the MIxS representation. Feedback gathered from the NMDC Data Portal (https://data.microbiomedata.org/), users and subject matter experts has identified new metadata terms, updates to improve the MIxS schema and introduced LinkML as a method for managing MIxS, converting the use of MIxS Google Sheets to LinkML’s schemasheets representation, in order to make future versions of MIxS more computable and easier to maintain. To sustainably bridge the GSC standards to recent, biomolecular-focused extensions of TDWG's Darwin Core standard (https://dwc.tdwg.org), we have implemented technical semantic and syntactic mappings in the Simple Standard for Sharing Ontology Mappings (SSSOM). A memorandum of understanding has been formalized to govern this mapping's maintenance, providing users with an authoritative resource for interoperation.
  6. Biocuration at Rhea, the reaction knowledgebase
    Kristian B. Axelsen, Anne Morgat, Elisabeth Coudert, Lucila Aimo, Nevila Hyka-Nouspikel, Parit Bansal, Elisabeth Gasteiger, Arnaud Kerhornou, Teresa Batista Neto, Monica Pozzato, Marie-Claude Blatter, Nicole Redaschi, Alan Bridge
    Expand the abstract
    Rhea (www.rhea-db.org) is a FAIR resource of expert curated biochemical and transport reactions described using the ChEBI ontology of small molecules (www.ebi.ac.uk/chebi/) and evidenced by peer-reviewed literature (https://pubmed.ncbi.nlm.nih.gov/). Since 2018, Rhea is used for explicit annotation of enzymatic activities in UniProtKB (www.uniprot.org). It is also used as a reference for enzyme and transporter activity by the GO ontology (http://geneontology.org/) and Reactome (https://reactome.org/). Rhea covers biochemically characterized reactions from primary and secondary metabolism of a broad range of taxa involving small molecules and the reactive groups of macromolecules. Curation priorities are for a large part driven by reaction requests that mainly comes from UniProtKB. We also create reactions included in the IUBMB Enzyme nomenclature, resulting in Rhea providing full coverage of reactions described by EC numbers. In addition, we also try to create newly characterized reactions of general interest, identified with the help of ML approaches like LitSuggest. To enable the creation of reactions in Rhea, it is very often necessary for Rhea curators to submit the needed compounds to ChEBI, making Rhea one of the primary sources of new compounds. This poster describes the current content of Rhea resource and demonstrates how chemicals, reactions and proteins can be linked between these complementary knowledge resources.
  7. Protein tunnels database
    Anna Špačková, Karel Berka, Václav Bazgier
    Expand the abstract
    Channels in proteins plays significant role in developing new drugs, because of this is important to study these structures. MOLEonline (https://mole.upol.cz/ ) is available tool for discovering tunnels in proteins structure. Algorithm can find out tunnels, pores and channels on the protein surface. Obtained information can be stored in ChannelsDB database (https://channelsdb.ncbr.muni.cz/). Because of the results from algorithm don’t say anything about biological importance, we would like to develop new tool which can recognize it. We are planning based this tool on artificial intelligence together with knowledge of biologically useful channels, and thus create new ontology. This improvement can help with docking molecules into buried active sites and overall in drug discovery.
  8. Machine learning for extraction of biochemical reactions from the scientific literature
    Blanca Cabrera Gil, Anne Morgat, Venkatesh Muthukrishnan, Elisabeth Coudert, Kristian Axelsen, Nicole Redaschi, Lucila Aimo, Alan Bridge
    Expand the abstract
    Rhea (www.rhea-db.org) is an expert curated knowledgebase of biochemical reactions built on the chemical ontology ChEBI (www.ebi.ac.uk/chebi), the reference vocabulary for enzyme and transporter annotation in UniProtKB (www.uniprot.org) and an ELIXIR Core Data Resource. Rhea currently describes over 15,000 unique reactions and provides annotations for over 23 million proteins in UniProtKB in forms that are FAIR – but most knowledge of enzymes remains locked in literature and is inaccessible to researchers. Machine learning methods provide a powerful tool to address this problem. Here we present work designed to accelerate the expert curation of Rhea - by using Rhea itself to teach large language models the rules of chemistry, and to thereby learn to extract putative enzymatic reactions automatically from the literature. This showcases the power of expert-curated knowledgebases like Rhea to enable the development of machine learning applications.
  9. Application profile based RDF generation for FAIR data publishing
    Nishad Thalhath, Mitsuharu Nagamori, Tetsuo Sakaguchi
    Expand the abstract
    The Resource Description Framework (RDF) is a format for representing information on the Semantic Web, which allows for the publication of data that follows the FAIR (Findable, Accessible, Interoperable, Reusable) principles and can be encoded and expressed with interoperable metadata. RDF has the advantages of being flexible, extensible, and interoperable. Application profiles, also known as metadata application profiles, are a way of modeling and profiling data in RDF. These profiles combine terms from different namespaces and define how they should be used and optimized for a particular local application, along with constraints on their use to ensure the data is valid. Application profiles can promote interoperability between different metadata models and harmonize metadata practices among communities. To ensure the data is FAIR, it is important to define the semantic model of the data, which describes the meaning of entities and relationships in a clear, accurate, and actionable way for a computer to understand. Developing a proper semantic model can be challenging, even for experienced data modellers, and it is important to consider the specific domain and purpose for which the model is being created. Application profiles help ensure the semantic interoperability of the data they represent by providing an explanation of the data and its constraints, and can help FAIRify the data and improve its quality by providing validation schemas. The authors have developed the YAMA Mapping Language (YAMAML) as a tool for creating RDF from non-RDF data. It is based on the Yet Another Metadata Application Profiles (YAMA) format, which is derived from the Description Set Profiles (DSP) language for constructing application profiles. YAMAML is implemented using YAML, a popular data serialization format known for its human readability and compatibility with programming languages. As a variant of JSON, YAML can easily be converted to and from other data formats. YAMAML presents the elements of YAMA's application profiles in a streamlined markup language for mapping non-RDF data to RDF. While YAMAML is intended to generate RDF, it is not itself an RDF representation syntax. The authors have developed a specification and tooling to demonstrate its capabilities as a method for generating RDF. YAMAML can generate RDF from non-RDF data, generate application profiles for the data, and generate RDF validation scripts in Shape Expression Language (ShEx). The authors developed YAMAML with the idea that a proper semantic model can help transform non-FAIR data into linkable data, provide more 5-star open data, and improve its reusability and interoperability.
  10. SwissBioPics – an interactive library of cell images for the visualization of subcellular location data
    Philippe Le Mercier, Jerven Bolleman, Edouard de Castro, Elisabeth Gasteiger, Andrea Auchincloss, Emmanuel Boutet, Lionel Breuza, Cristina Casals Casas, Anne Estreicher, Marc Feuermann, Damien Lieberherr, Catherine Rivoire, Ivo Pedruzzi, Nicole Redaschi, Alan Bridge
    Expand the abstract
    SwissBioPics is a freely available library of interactive high-resolution cell images designed for the visualization of subcellular location data that covers subcellular locations and cell types from all kingdoms of life. The images can be explored on the SwissBioPics website (www.swissbiopics.org) and used to display subcellular location annotations on other websites with our reusable web component (www.npmjs.com/package/%40swissprot/swissbiopics-visualizer). This web component, when provided with an NCBI taxonomy identifier and a list of subcellular location identifiers (UniProt or GO terms), will automatically select the appropriate image and highlight the given subcellular locations. Resources such as UniProt (www.uniprot.org) and Open Targets (www.opentargets.org/) have adopted SwissBioPics for visualization, and we regularly update the image library as knowledge in UniProt evolves. We hope other developers will adopt the SwissBioPics web component, and would welcome requests to expand the SwissBioPics image library and enhance programmatic access to it. SwissBioPics is freely available under a Creative Commons Attribution 4.0 license (CC BY 4.0).
  11. Glycan biomarker curation for integration with publicly available biomarker and glycobiology resources
    Karina Martinez, Daniel Lyman, Jeet Vora, Nathan Edwards, Rene Ranzinger, Mike Tiemeyer, Raja Mazumder
    Expand the abstract
    Altered glycosylation is associated with almost all major human diseases and reflects changes in cellular status, making glycans a promising target in the search for accessible biomarkers that can indicate disease with high sensitivity and specificity. Despite the translational importance of glycans as biomarkers, there currently appears to be no curation effort which specifically attempts to consolidate and standardize this knowledge. The complexity of glycan structures and heterogeneity of data presents unique challenges for curation efforts. In this exploratory study, we curated 30 glycan, and glycosylation-related, biomarkers from the literature. The curation effort captured glycoconjugates, panels and free glycans, which were then mapped to GlyTouCan, UniProtKB and GlycoMotif accessions. Motifs identified in this study include Type 2 LacNAc, sialyl Lewis x, and Tn antigen, all of which exhibit altered levels of expression associated with disease. Within the context of a larger curation effort including genes, proteins, metabolites, and cells, we developed a data model that accounts for the complex and nuanced nature of glycan biomarkers. The harmonization of glycan biomarker data facilitates integration with an existing biomarker data model and with the glycoinformatic resource, GlyGen. The availability of a curated glycan biomarker dataset will present new opportunities for data mining and disease prediction modeling.
  12. Curating Somatic Variants in Haematological Cancers in COSMIC
    Rachel Lyne, Joanna Argasinska, Denise Carvalho-Silva, Charlotte Cole, Leonie Hodges, Alex Holmes, Amaia Sangrador-Vegas, Sari Ward
    Expand the abstract
    COSMIC, the Catalogue of Somatic Mutations In Cancer (http://cancer.sanger.ac.uk), is the world’s largest source of expert manually curated somatic mutation information relating to human cancers. The most recent release of COSMIC (v97) has focussed on curation of haematological cancers and includes data from whole genome studies, large next generation sequencing panels and case reports of individual cases. Haematological cancers are the fifth most common type of cancer in the world and account for 7% of all cancer deaths. They make up a broad range of cancer types including leukaemias, lymphomas, myelomas and myeloproliferative neoplasms. The haematological cancer focus in COSMIC v97 involved the curation of 76 publications, comprising 43 case studies, 21 research papers and 12 whole exome/genome sequencing studies, which resulted in 2,687 tumour samples with 24,356 novel variants added to the database. Nine new blood tumour types were also added to COSMIC, with seven of these being newly proposed for the National Cancer Institute cancer classification system. This brings the total number of unique forms of haematopoietic & lymphoid neoplasms in COSMIC to over 340. Furthermore, 16 COSMIC tumour types were re-mapped to more specific NCIT tumour types, increasing the precision and interoperability of the data. Research into blood cancers has escalated in recent years and survival rates are much higher than they were thirty years ago. The development of targeted therapies based on mutations in specific genes has played a large part in this success. However, poor publishing practices are hindering data aggregation and sharing, and ultimately are slowing down further development of personalised treatments in this field. From our PubMed search results, 28 publications were not curated because of poor quality or missing data. Haematological cancer publications often publish large sample sets, however it is often difficult or impossible to extract key data points. Such data should be presented in a format that renders itself to computational curation and re-use easily.
  13. Addressing the data challenge of emerging viral diseases: COVID and MPOX resources in ViralZone
    Edouard de Castro, Patrick Masson, Cristina Casals Casas, Arnaud Kerhornou, Chantal Hulo, Ivo Pedruzzi, Sylvain Poux, Nicole Redaschi, Alan Bridge, Philippe Le Mercier
    Expand the abstract
    The emergence of viruses in humans has accelerated, as has the need to monitor and control the resulting new viral diseases. Research and medicine must have access to knowledge and data to develop accurate research, diagnostics, vaccines, and therapeutics. To address this need, dedicated resources for SARS-CoV-2 and monkeypox viruses have been developed in ViralZone. The resources provide curated data on the biology of the virus: genome, transcriptome, proteome, and replication cycle; known antiviral drugs; vaccines; and links to epidemiological data (Nextstrain). For SARS-CoV-2, there is a variant resource with all major circulating variants. Reference sequences are important to the scientific community because they facilitate genome-based diagnostics, bioinformatics, and research by ensuring that all groups use a common sequence. These strains are selected in collaboration with NCBI, Nextstrain, and ViPR to harmonize key viral databases. Modulation of host biology by viruses is a major factor in viral disease. The major host-virus interactions are curated with evidences in the ViralZone resources. For monkeypox virus, 13 of these interactions have been curated into GO-CAM models that can better describe the interactions and their impact on cell biology. We plan to continue to provide specific resources for any emerging or re-emerging viruses in the future.
  14. Rapid Development of Knowledge Bases using Prompt Engineering and GitHub CoPilot Abstract ISB 2023
    Sierra Moxon, Chris Mungall, Harshad Hegde, Mark A. Miller, Nomi Harris, Sujay Patil, Tim Putman, Kevin Schaper, Justin Reese, J. Harry Caufield, Patrick Kalita, Harold Solbrig
    Expand the abstract
    Development of knowledge bases (KBs) and ontologies is a time-consuming and largely manual process, requiring a combination of subject matter expertise and professional curation training. Despite recent advances in fields such as natural language processing and deep learning, most trusted knowledge in such repositories is either manually entered, or generated by deductive reasoning or rule-based processes. There are a number of reasons why AI techniques are not yet mainstream in KB construction[5,6]. One of the main obstacles is the trustworthiness of predicted facts. While these tools often do well against common ranking metrics, a significant portion of predicted facts are wrong[3], and incorporating them would pollute the KB and result in the erosion of trust in that resource. Instead of using AI in isolation and forcing a review of its predicted facts by a subject matter expert, a better approach would be to draw from the strengths of each member of the partnership working in tandem. The expert has wide-ranging and deep domain knowledge, the ability to understand at a fundamental level natural language descriptions of phenomena (e.g. as described in the scientific literature), and the ability to reason about representations of these phenomena. In contrast, the AI has no such knowledge or understanding but does possess a phenomenal ability to pattern-match and consume vast troves of information at a superficial level. Increasingly, AIs can also generate plausible and potentially correct content. Here we present initial findings on content generation assistance via prompt engineering. We call this approach Knowledge Base Prompt Engineering (KBPE). In this approach, large foundational language models intended for assisting in software development are adapted to KB curation workflows through the use of schemas structured using the Linked Data Modeling Language (LinkML) framework[9]. We find that software-based prompt engineering tools (specifically PyCharm[7] and GitHub Copilot[8]) work surprisingly well for a subset of knowledge acquisition tasks, in particular for rote tasks involving the structuring of common knowledge. A particularly relevant finding is that this approach readily adapts to custom domain-specific schemas, and is easily primed by previously stated facts. Results are highly variable and dependent on multiple dynamic factors, but in some cases, a single prompt can generate hundreds of largely accurate and useful facts, representing speedups of orders of magnitude. The challenge here is to identify and discard true-looking facts that are inaccurate. Due to the wide range of knowledge bases, the time that is taken to construct them, and the challenges in evaluating them, we performed a qualitative assessment in order to broadly inform us of the general feasibility of knowledge-based prompt engineering. Our tests include a variety of assessments including lexical definition completion, fact completion, fact negation and validity checking, logical definition completion, classification of facts, and the auto-generation of data properties. Using gene ontology curation[1,2,4] as an example, we walk through our findings in this presentation.
  15. Capturing the experimental research history of signalling pathways in Drosophila melanogaster
    Giulia Antonazzo, Helen Attrill, Nicholas H. Brown, Flybase Consortium
    Expand the abstract
    FlyBase, the knowledgebase for Drosophila melanogaster, introduced a new signalling pathways resource in 2018. This resource systematically assembles the experimental knowledge on key signalling pathways, providing research evidence-based lists of core members and regulators, network visualizations of pathway component physical interactions, as well as integrating tools and data to aid bench research. Since its introduction, this resource has been continuously updated to integrate more pathways as well as new features, including thumbnail images that show a textbook pathway representation, and a graphical comparison of member gene functions. We describe the growth of the resource and demonstrate its potential and utility to the research community. We use the corpus of highly curated pathway data to analyse the knowledge landscape of Drosophila signalling pathway research and ask questions such as, which genes are most studied? How has the volume of research on certain pathways changed over the years? We also present analyses that make use of the curated pathway members lists, together with functional genomics data available elsewhere in FlyBase, to exemplify how the resource can contribute to characterizing the biological properties of signalling pathways in Drosophila.
  16. GlyGen: Computational and Informatics Resources for Glycoscience 
    Rene Ranzinger, Karina Martinez, Jeet Vora, Sujeet Kulkarni, Robel Kahsay, Nathan Edwards, Raja Mazumder, Michael Tiemeyer
    Expand the abstract
    Advancing our understanding of the roles that glycosylation plays in development and disease is frequently hindered by the diversity of the data that must be integrated to gain insight into these complex phenomena. GlyGen is an initiative with the goal of democratizing glycoscience research by developing and implementing a data repository that integrates diverse types of information, including glycan structures, glycan biosynthesis enzymes, glycoproteins, genomic and proteomic knowledge. To achieve this integration, GlyGen has established international collaborations with database providers from different domains (including but not limited to EBI, NCBI, PDB, and GlyTouCan) and glycoscience researchers. Information from these resources and groups are standardized and cross-linked to allow queries across multiple domains. To facilitate easy access to this information, an intuitive, web-based interface (https://glygen.org) has been developed to visually represent the data. In addition to the browser-based interface GlyGen also offers RESTful webservice-based APIs and a SPARQL endpoint, allowing programmatic access to integrated datasets.    For each glycan and glycoprotein in the dataset, GlyGen provides a details page that displays information from the integrated resources in a concise representation. Individual details pages are interlinked with each other allowing easy data exploration across multiple domains. For example, users can browse from the webpage of a glycosylated protein to the glycan structures, that have been described to be attached to this protein, and from there, to other proteins that carry the same glycan. All information accessed through GlyGen is linked back to original data sources, allowing users to easily access and browse through information pages in these resources as well. The GlyGen portal itself provides multiple different search interfaces for users to find glycans and proteins based on their properties or annotations. The most advanced version of these searches is the GlyGen Super Search that visualizes the entire data model in one graph and enables users to find glycans and proteins by adding constrains to this graph. Beyond the data on glycans and proteins, GlyGen also provides multiple tools for studying glycosylation pathways, investigating relationships between glycans based on incomplete structures or mapping of different ID namespace.   Our goal is to provide scientists with an easy way to access the complex information that describes the biology of glycans and glycoproteins. To schedule an individual demo of GlyGen or add your data to GlyGen contact Rene Ranzinger (rene@ccrc.uga.edu).
  17. Using Disease Focused Curation to Enhance Cross Species Translation of Phenotype Data
    Susan M. Bello, Yvonne M. Bradford, Monte Westerfield, Cynthia L Smith, The Mgi And Zfin Curation Teams
    Expand the abstract
    Using model organism phenotype data to improve human disease diagnosis and treatment is limited by difficulties in translating model organism phenotypes to patient signs and symptoms and by incomplete curation of model organism data. Ontologies, like the Human Phenotype (HPO) ontology which is used to annotate KidsFirst data, Mammalian Phenotype (MP) ontology used for mouse phenotypes, and the Phenotype and Trait Ontology (PATO) with the Zebrafish Anatomy ontology (ZFA) used for zebrafish phenotypes, have been developed to standardize reporting of phenotype data in each species, but translating the data among species is not always straightforward without defined relations between these ontologies. Mouse Genome Informatics (MGI, www.informatics.jax.org) and the Zebrafish Information Network (ZFIN, www.zfin.org) have established a joint effort to help bridge the translation gap through focused curation of model organism research on diseases from the KidsFirst (kidsfirstdrc.org) data resource. Both groups identified relevant publications for models of Scoliosis and Cleft Palate and annotated all phenotypes reported for these models. As part of phenotype curation workflows, terms missing from the relevant ontologies were identified and added. After annotation, the set of mouse model phenotypes were extracted from MGI and mapped to HPO terms using the Simple Standard for Sharing Ontology Mappings (SSSOM). These mappings are available from the Mouse-Human Ontology Mapping Initiative GitHub repository (github.com/mapping-commons/mh_mapping_initiative). Mapped MP and HPO terms are also used to improve alignment with ZFIN annotations. We are collaborating with the KidsFirst team to incorporate these data into the KidsFirst data portal. The expanded annotations and curated mappings will both enrich data available for model to patient translations and support development of new methods to improve phenotype translations in general. Supported by NIH grant OD033657.
  18. Toward a curated glyco-interactome knowledgebase for the biology community using CarbArrayART
    Yukie Akune, Sena Arpinar, René Ranzinger, Ten Feizi, Yan Liu
    Expand the abstract
    Glycans are chains of variously linked monosaccharides biosynthesized by glycosyltransferases. They occur as oligosaccharides, and parts of glycoconjugates, such as polysaccharides, glycoproteins and glycolipids. They participate in innumerable recognition systems in health (development, cell differentiation, signalling and immunomodulation) and in disease states (inflammatory, infectious and non-infectious including neoplasia). Glycan microarray technologies for sequence-defined glycans, first introduced in 2002 by Feizi and colleagues, have revolutionized the molecular dissection of specificities of glycan-protein interactions [1-3]. Many datasets have been published in almost 1700 scientific publications from groups using different glycan array platforms. However, the data interpretation is not always straightforward as there are differences in array platforms which may give differing results with the same glycan binding system. The glycans are variously derivatized using numerous linkers and chemistries so that they can be immobilized on microarray surfaces covalently or noncovalently. These parameters can have a pronounced effect on the microarray readouts, hence the need for curation and annotation in the interpretation of glycan array data. Software tools have facilitated data handling [4,5] and we have recently released an advanced and distributable software called Carbohydrate micro-Array Analysis and Reporting Tool (CarbArrayART, http://carbarrayart.org) for glycan array data processing, storage and management [6]. As part of GlyGen (https://www.glygen.org) [7], a data integration and dissemination project for carbohydrate- and glycoconjugate-related data, we have been involved in the planning and design of a much-needed public glycan array data repository. An extension of CarbArrayART is being developed to allow uploading and downloading data to and from the repository. A critical component of this submission system is the definition and implementation of a common format for glycan array data and associated metadata in accordance with the glycan array minimum information guidelines developed by the MIRAGE commission [8]. In this communication, we will share our progress in defining criteria for glycan microarray data curation. These include establishing glycan microarray metadata standards for describing glycan binding samples, experimental conditions, glycan probes arrayed, and microarray data processed. Using these standards, we have curated published array data from the Glycosciences Laboratory for submission to the GlyGen glycan microarray repository. In the future, CarbArrayART will serve as the vehicle for data transfer between local databases and the public glycan array repository, not only for newly generated data but also for datasets from existing research publications. We will extend our criteria to define glycan ‘recognition motif(s) for each glycan binding system. This will fill knowledge gaps in glycan-mediated molecular interactions in the wider biological landscape. 1. Fukui S, Feizi T, et al. Nat.Biotechnol. 20:1011-7 (2002) 2. Rillahan CD, Paulson JC. Annu.Rev.Biochem. 80:797-823 (2011) 3. Palma AS, Feizi T, et al. Curr.Opin.Chem.Biol, 18:87-94 (2014) 4. Stoll M, Feizi T. Proceedings of the Beilstein Symposium on Glyco-Bioinformatics. 123-140 (2009) 5. Mehta AY, Heimburg-Molinaro J, et al. Beilstein J.Org.Chem. 16:2260-2271 (2020) 6. Akune Y, Arpinar S, et al. Glycobiology. 32:552-555 (2022) 7. York WS, Mazumder R, et al. Glycobiology. 30:72-73 (2020) 8. Liu Y, McBride R, et al. Glycobiology. 27:280-284 (2017)
  19. Exploiting single-cell RNA sequencing data on FlyBase
    Damien Goutte-Gattat, Nancy George, Irene Papatheodorou, Nick Brown
    Expand the abstract
    Single-cell RNA sequencing has proved an invaluable tool in biomedical research. The ability to survey the transcriptome of individual cells offers many opportunities and has already paved the way to many discoveries in both basic and clinical research. For the fruit fly alone, nearly a hundred of single-cell RNA sequencing datasets have already been published since the first reported use of the technique in fly laboratories in 2017, a number that is only expected to grow quickly in the coming years. This increasing amount of single-cell transcriptomic data available, including whole-organism single-cell transcriptomic atlases, creates a challenge for biological databases to integrate these data and make them easily accessible to their users. FlyBase is the Model Organism Database (MOD) for all data related to Drosophila melanogaster. It provides access to a wide range of scientific information either manually curated from the published literature or from high-throughput research projects. For single-cell RNA sequencing data, we aim to help fly researchers to: (i) discover the available Drosophila datasets; (ii) learn the most important informations about a dataset of interest; and (iii) get a quick overview of the expression data from those datasets. To that end, we have set up a collaboration with the Single Cell Expression Atlas (SCEA), the EMBL-EBI resource for gene expression at the single cell level. FlyBase curators assist the EMBL-EBI’s data scientists in obtaining and annotating Drosophila single-cell RNA sequencing datasets; in return, the SCEA provides FlyBase with the processed data in a standardized format, allowing for easier ingestion into our database. We then exploit the ingested data to enrich our gene report pages with specific displays for single cell expression data, giving our users an immediate view of the cell types in which a given has been found to be expressed.
  20. Biocuration meets Deep Learning
    Gregory Butler
    Expand the abstract
    Machine learning (ML) for protein sequence analysis relies on well-curated sources such as Swiss-Prot to provide ``gold standard'' datasets to train, test and evaluate tools that classify unknown proteins. Most tools use supervised learning where the labels (annotations) from curators are essential. The trained classifiers may become basic tools in regular use by curators, thus completing the circle from curation to ML to curation. The bias in our understanding of cell molecular biology is reflected in the resources, through no fault of the curators. The accumulation of knowledge is driven by many factors, including ease of lab work, available instruments, availability of funding, historical focus on model organisms, and an emphasis on the publication of ``positive'' results rather than ``negative'' results. A lack of negative results effectively means that ML is not performing the expected positive-negative discrimination of samples but rather positive-unknown discrimination. Furthermore, most proteins are not fully annotated, that is, not all roles of the protein are known, only those reported in the literature. This means that multi-label ML --- the ideal where each of the roles is predicted --- is rarely attempted. Indeed proteins annotated with multiple roles are often removed from the gold standard datasets for binary or multi-class learning. Classical ML is highly dependent on feature engineering (FE) to determine a good set of features (attributes) on which to base the classifier. FE is difficult as the feature space is virtually limitless. Deep learning (DL) offers tools to bypass FE. The models themselves learn (sub)features at each level of the DL architecture. The trade-off requires substantial computational resources to train the DL models, and also large datasets. DL models, like many classical ML models, lack interpretability, so predictions cannot be explained to the end-user scientists. AlphaFold attracted wide coverage in the literature for its success in the CASP challenge in 2018. Training required 170,000 proteins with structures and required a few weeks using 100-200 GPUs. Significantly this led to the AlphaFold Protein Structure Database, with predicted structures for the proteomes of model organisms. Beyond structure, computational biologists are utilizing protein language models (PLM) from deep learning. A PLM is constructed by self-supervised learning which requires no labels, though they do require training on a very large number of sequences, which does require substantial computing resources. The ProtBERT-BFD PLM was trained on BFD with 2.5 billion sequences, which took days using 1024 TPUs (tensor processing units). Initial results are promising for subcellular localization. We are also having state-of-the-art results applying PLM to a broad range of classification tasks for membrane proteins. While interpretability remains a major obstacle, DL representations address the positive-unknown problem. Furthermore, so-called few-shot learning, which uses mappings between DL representation from different sources such as sequence, structure, annotations, and text descriptions, allows predictions in situations where there is a small number of labelled examples (even zero examples). How DL will impact the work of curators remains to be seen.
  21. AlphaFold structure predictions help improve Pfam models and annotations - Usage in curation.
    Sara Chuguransky, Typhaine Paysan-Lafosse, Alex Bateman
    Expand the abstract
    AlphaFold DB is an online resource developed by DeepMind in collaboration with the EMBL-EBI, based on results of AlphaFold 2.0, an AI system that predicts highly accurate 3D protein structures from the amino acid sequence. The latest version of AlphaFold DB contains structure predictions of the human proteome and 47 other key organisms, covering most of the representative sequences from the UniRef90 data. The availability of experimental 3D structures is limited, therefore, AlphaFold predictions are really helpful in curation to provide more informative and accurate annotations, especially for poorly characterised sequences. In Pfam, a protein classification database based at the EMBL-EBI widely used by the scientific community, we have historically used experimentally determined, when available, to refine domain boundaries, improve our models, protein coverage and annotations. In addition, we are able to find relationships with other Pfam entries and group them in superfamilies, that we call clans, given structural similarities that may imply a common origin. We have now started to use AlphaFold predictions to revisit existing families that have never had an experimentally determined structure to correct their domain boundaries and find their evolutionary relationships. We also build new domains currently missing in Pfam based on these highly accurate structure predictions. This curation effort is supported by AlphaFold colab and Foldseek tools, which assist us in determining clan memberships. Here, we present some examples of improved annotations generated using these tools. The human protein ZSWIM3 currently has two domain annotations, a zinc finger domain (SWIM-type, residues 531-572) and a domain of unknown function (579-660). AlphaFold predicts the presence of 5 domains, which according to the pLDDT score, are highly accurate. Therefore, we built new domains: PF21599, N-terminal domain (1-104, CL0274), PF21056, a RNaseH-like domain (179-304, CL0219) and PF21600, and helical domain (312-437). For PF21599 and PF21056, we were able to find structural relationships with WRKY-like DNA-binding domains and RNaseH domains, respectively, using the AlphaFold colab tool [3]. Similarly, we improved the domain boundaries for the C-terminal DUF (PF19286): It previously covered 579-660 region of the protein, which, according to the structure prediction, was longer than it should and partially overlapped the zinc finger domain. We adjusted this boundaries to cover the two α-helices. In this other example, we refined the model of this uncharacterised protein from a fruit fly, A0A034WDA7 which is is annotated as PF05444 - Protein of unknown function (DUF753). From structure prediction, we can see that this protein actually consists of two identical domains. Based on this, we splitted this Pfam entry into the corresponding domains. We conclude that all these tools are very helpful not only for curation but also for research, as they help us to improve protein classification and provide better and more accurate predictions. This is particularly useful in those cases for which functional or structure information is scarce.
  22. Protein Structures and their cross-referencing in UniProt
    Nidhi Tyagi, Uniprot Consortium
    Expand the abstract
    Annotation of proteins from structure-based analyses is an integral component of the UniProt Knowledgebase (UniProtKB). There are nearly 200,000 experimentally determined 3-dimensional structures of proteins deposited in the Protein Data Bank. UniProt works closely with the Protein Databank in Europe (PDBe) to map these 3D structural entries to the corresponding UniProtKB entries based on comprehensive sequence and structure-based analyses, to ensure that there is a UniProtKB record for each relevant PDB record and to import additional data such as ligand-binding sites from PDB to UniProtKB. SIFTS (Structure Integration with Function, Taxonomy and Sequences), which is a collaboration between the Protein Data Bank in Europe (PDBe) and UniProt, facilitates the link between the structural and sequence features of proteins by providing correspondence at the level of amino acid residues. A pipeline combining manual and automated processes for maintaining up-to-date cross-reference information has been developed and is run with every weekly PDB release. Various criteria are considered to cross-reference PDB and UniProtKB entries such as a) the degree of sequence identity (>90%) b) an exact taxonomic match (at the level of species, subspecies and specific strains for lower organisms) (c) preferential mapping to a curated SwissProt entry (if one exists) or (d) mapping to proteins from Reference/Complete proteome or (e) mapping to the longest protein sequence expressed by the gene. Complex cases are inspected manually by a UniProt biocurator using a dedicated curation interface to ensure accurate cross-referencing. These cases include short peptides, chimeras, synthetic constructs and de novo designed polymers. The SIFTS initiative also provides up to date cross referencing of structural entries to literature (PubMed), taxonomy (NCBI), Enzyme database (IntEnz), Gene Ontology annotations (GO), protein family classification databases (InterPro, Pfam, SCOP and CATH). In addition to maintaining accurate mappings between UniProtKB and PDB, a pipeline has been developed to automatically import data from PDB to enhance the unreviewed records in UniProtKB/TrEMBL. This includes details of residues involved in the binding of biologically relevant molecules including substrate, nucleotides, metals, drugs, carbohydrates and post-translational modifications that greatly improves the biological content of these records. UniProt has successfully completed the non-trivial and labour intensive exercise of cross referencing 187,997 PDB entries (647,078 polypeptide chains) to 60,599 UniProtKB entries (manual curation). Manual annotation of protein entries with 3D-structures is given high priority and such proteins are curated based on relevant literature. UniProt also provides structural predictions through AlphaFold for various proteomes. All this work enables non-expert users to see protein entries in the light of relevant biological context such as metabolic pathways, genetic information, molecular functions, conserved motifs and interactions etc. Structural information in UniProtKB serves as a vital dataset for various academic and biomedical research projects.
  23. YeastPathways at the Saccharomyces Genome Database: Transitioning to Noctua
    Suzi Aleksander, Dustin Ebert, Stacia Engel, Edith Wong, Paul Thomas, Mike Cherry, Sgd Project
    Expand the abstract
    The Saccharomyces Genome Database (SGD, http://www.yeastgenome.org) is the leading knowledgebase for Saccharomyces cerevisiae. SGD collects, organizes and presents biological information about the genes and proteins of the budding yeast, including information concerning metabolism and associated biochemical pathways. The yeast biochemical pathways were originally sourced from the YeastCyc Pathway/Genome Database, which uses the metabolic pathways from MetaCyc. YeastCyc Pathways were imported into SGD in 2002, then SGD biocurators edited them as necessary to make them specific to S. cerevisiae, using publications from the primary literature. This comprehensive curation ensured that only the reaction directions and pathways that are physiologically relevant to S. cerevisiae were included, and also provided written summaries for each pathway. Loading, editing, and maintaining YeastPathways displayed at SGD was accomplished using the Pathway Tools software. YeastPathways have now been available at SGD for over 20 years, with the last major content update in 2019. Recently, YeastPathways have been transitioned from a curation system that uses Pathway Tools software to one that uses the Gene Ontology curation platform Noctua, the interface SGD already uses to curate GO annotations. Each of the 220 existing pathways in YeastPathways has been converted from Pathway Tools BioPAX format into separate GO-Causal Activity Models (GO-CAMs), which are available as Turtle (.ttl) files. The GO-CAM structured framework allows multiple GO annotations to be linked, which is ideal for a metabolic pathway. Although many of the steps are automated, manual intervention was required throughout the process to complete the GO-CAMs and to verify inferences made by the conversion tools. Using the Noctua curation interface for biochemical pathways, without the need for external software, will streamline curation. Additionally, each model’s metadata is accessible to any interested party. GO-CAMs can be deconstructed into standard GO annotations, making the pathway information accessible for enrichment studies and other applications. YeastPathways as GO-CAMs will also be available through the Gene Ontology’s GO-CAM browser and the Alliance of Genome Resources, as well as any other resources that display GO-CAMs. This upgrade to the SGD’s curation process will also make YeastPathways more transparent and compliant with FAIR guidelines and TRUST principles.
  24. Adding semantics to data digitization: strengths and possibilities
    Pratibha Gour
    Expand the abstract
    Over the years, life sciences has changed from descriptive to a data-driven discipline wherein, new knowledge is produced at an ever increasing speed and thus, the list of research articles, databases and other knowledge resources keep on piling up. These data become knowledge within a defined context only when the relations among various data elements are understood. Thus, discovery of novel findings depends a lot on the integration, comparison and interpretation of these massive data sets. The volume as well as variability of data, both poses a challenge to its seamless integration. This problem is even intensified in case of low-throughput (better called gold standard) data published in research articles, as no uniform/structured formats are available for their storage and dissemination. These experimental data are usually represented as a gel image, autoradiograph, bar graph etc. Moreover, experimental design is highly complex and diverse such that even studies using same experimental technique can have a very different design. Manually curated database of rice proteins (MCDRP) addresses these issues by adopting data models based on various ontology terms or custom-made notations, which index experimental data itself, such that it becomes amenable for automated search. This semantic integration not only renders the experimental data suitable for computer-based analysis such as rapid search and automated interpretation but also provides it a natural connectivity. Some of the most interesting correlations could be drawn by analyzing proteins that share a common ‘Trait’ or ‘Biological Process’ or ‘Molecular Function’. Moreover, such a semantic digitization facilitates searching of/access to extensive experimental data sets at granular level. The data digitization formats used here are generic in nature and facilitate digitization of almost every aspect of the experimental data, thereby providing a better understanding of any biological system. These models have been successfully used to digitize data from over 20,000 experiments spanning over 500 research articles on rice biology.
  25. Swiss Personalized Health Network: Making health data FAIR in Switzerland
    Deepak Unni, Sabine Österle, Katrin Crameri
    Expand the abstract
    The Swiss Personalized Health Network (SPHN) is a national initiative responsible for the development, implementation, and validation of coordinated data infrastructures that make health-related data FAIR (Findable, Accessible, Interoperable, Reusable) and available for research in Switzerland in a legally and ethically compliant manner. SPHN brings together stakeholders from various university hospitals and research institutions across Switzerland to enable the secondary use of health-related data, including but not limited to clinical routine data, omics and cohort data for personalized health research. To that end, the SPHN initiative consists of several key elements. Firstly, SPHN provides an Interoperability Framework for the definition and harmonization of health data semantics. Therein, the various health-related concepts and attributes are defined, with meaning binding to internationally recognized terminologies like SNOMED-CT and LOINC, and for certain attributes, the value sets defined from both international and local terminologies. The semantics are then translated into a formal representation - using Resource Description Framework (RDF), RDF-Schema (RDFS), and Web Ontology Language (OWL) - to create the SPHN RDF Schema. Secondly, SPHN provides an ecosystem that supports the generation, quality check, dissemination, and analysis of health data. The system enables the translation of health data into RDF, provides a terminology service for access to external terminologies in RDF, and a schema template to support projects to create and work with their individual subset of concepts and attributes. Since both the schema and the data are in RDF, the ecosystem relies heavily on other semantic web technologies like SPARQL Protocol and RDF Query Language (SPARQL), and Shapes Constraint Language (SHACL). For example, the ecosystem includes tools for performing quality checks and improving the quality of the data represented in RDF, such as the SPARQL Generator (SPARQLer) and SHACL Generator (SHACLer). By building on top of well-established standards, the technical burden on hospitals and projects can be reduced. One such example is the SPHN Connector, a tool for connecting data providers to SPHN infrastructure and services, which provides a unified interface for ingesting data in various formats and handles conversion to SPHN compliant RDF data and validation. On BioMedIT, Switzerland’s secure trusted research environment for processing sensitive data, researchers find the tools and support to work with their sensitive graph data. Lastly, SPHN also provides services that improve the discoverability of data, with the SPHN Federated Query System (FQS), and of metadata, contributing with Swiss cohort data to the international Maelstrom Catalogue, enabling researchers to explore and identify cohorts of interest and request access to the data. Since 2017, SPHN has contributed to the establishment of clinical data management platforms at the five Swiss University Hospitals to make health data efficiently available for research, has supported researchers in the discovery and analysis of data, and built the required infrastructures to bridge the gap between healthcare and data-driven research. Moving forward, the SPHN is exploring approaches for the translation of data in SPHN into data models like OMOP, i2b2, and FHIR to increase interoperability with existing national and international communities.
  26. From Model Organisms To Human: Increasing Our Understanding Of Molecular Disease Mechanisms By Data Curation
    Yvonne Lussi, Elena Speretta, Kate Warner, Michele Magrane, Sandra Orchard, Uniprot Consortium
    Expand the abstract
    The UniProt Knowledgebase (UniProtKB) is a leading resource of protein information, providing the research community with a comprehensive, high quality and freely accessible platform of protein sequences and functional information. Manual curation of a protein entry includes sequence analysis, annotation of functional information from literature, and the identification of orthologs. For human entries, we also provide information on disease involvement, including the annotation of genetic variants associated with a human disease. The information is extracted from scientific literature and from the OMIM (Online Mendelian Inheritance in Man) database. Identifying the underlying genetic variation and its functional consequences in a disease is essential for understanding disease mechanisms. Model organism research is integral to unraveling the molecular mechanisms of human disease. Therefore, information of human proteins involved in disease is supplemented with data from protein orthologs in model organisms, providing additional information based on mutagenesis assays and disruption phenotypes. With these efforts, we hope to improve our understanding between the association of genetic variation with its functional consequences on proteins and disease development. Our ongoing efforts aim to compile available information on proteins associated with human diseases, including the annotation of protein variants and the curation of orthologs in model organisms, to help researchers to better understand the relationship between protein function and disease. Understanding disease mechanisms and underlying molecular defects is crucial for the development of targeted therapy and new medicines.
  27. Beyond advanced search: literature search for biocuration
    Matt Jeffryes, Henning Hermjakob, Melissa Harrison
    Expand the abstract
    Biocurators are experts at identifying papers of interest within the vast scientific literature. Workflows for biocuration are very diverse, but most involve the use of literature search engines, such as PubMed, Europe PMC or Google Scholar. However, these search engines are designed for a general scientific audience. We have developed a ‘biocuration toolkit’ which incorporates search features specific to biocuration, with flexibility to fit within a variety of biocuration workflows. An innovative feature is the capability to exclude current content of a database from result sets, allowing to focus on increased coverage of a resource rather than adding additional evidence for already well-covered facts. We have developed our tool in collaboration with IntAct database biocurators. The IntAct database contains evidence for the interaction of pairs of molecules. The specific biocuration task which IntAct wish to improve their performance on is to curate molecular interactions into their database where at least one of the pair has few or no existing entries within the database, and particularly those where the molecule has entries in peer databases. Our tool allows biocurators to prioritise literature which mentions proteins which are not yet present in the IntAct database. However, this feature is designed flexibly to allow filtering based on other axes such as disease or organism mentions. As we gather feedback on biocurators we intend to implement further filtering and prioritisation methods, which may be combined in a modular way by biocurators to direct their searches, alongside typical advanced search features such as boolean search terms, filtering by date and article type. By building on top of Europe PMC, we are also able to optionally support searching of preprint articles from a number of servers, including bioRxiv and medRxiv.
  28. Curation and integration of single-cell RNA-Seq data for meta-analysis
    Jana Sponarova, Pavel Honsa, Klara Ruppova, Philip Zimmermann
    Expand the abstract
    The immune system is a complex system of various cell populations and soluble mediators, and its proper understanding requires detailed analysis. These efforts have been boosted by the development of single-cell RNA sequencing (scRNAseq). ScRNAseq is a powerful approach to understanding molecular mechanisms of development and disease and uncovering cellular heterogeneity in normal and diseased tissues, but it possesses several challenges. As the amount of scRNAseq data present in the public domain has grown substantially in the last few years, a platform is needed to enable the meta-analysis. Moreover, multiple different scRNAseq protocols are in use, and a number of additional modalities co-analyzed with RNA is growing. Another challenging aspect of the analysis of scRNAseq data is the correct cell type identification which is especially important for studying the immune system consisting of a wide range of cell types and states that might have very different functions. Here we present a custom pipeline that enables unified and standardized processing of single-cell transcriptomics data from various sources, utilizing different protocols (10x, Smart-Seq) and accompanied by additional modalities (VDJ-Seq, CITE-Seq, Perturb-Seq). The processing of every single study includes raw data mapping, standardized and strict quality control, data normalization, and integration. These steps are followed by cell clustering with subsequent cluster identification and description. The cell type annotation is then synchronized across all studies in the compendium. This approach makes the cell type annotation more accurate and enables a compendium-wide meta-analysis. To keep accuracy yet gain efficiency in the annotation process, we have built multiple proprietary cell atlases and references. The pipeline outputs are enriched with sample-level information (e.g., patient-level data), and data are integrated into a user-friendly analysis software - GENEVESTIGATOR, a high-performance visualization tool for gene expression data for downstream analysis on single-cell as well as pseudo-bulk level. In summary, we have built a manually curated and globally normalized scRNAseq compendium mainly consisting of immune cells obtained from studies focused on immuno-oncology, autoimmune diseases, and other therapeutic areas. This deeply harmonized compendium represents an important asset for downstream ML and AI applications in pre-clinical biomarker discovery and validation.
  29. The Nebion Cell-type Ontology
    Anna Dostalova, Pavel Honsa, Jana Sponarova
    Expand the abstract
    The advent of single-cell profiling has accelerated our understanding of the cellular composition of human body, however, it has also brought a challenge of correct classification and description of the various cell types and cell states. In an effort to consolidate annotations that are featured in our biomarker discovery platform, GENEVESTIGATOR, we have built a comprehensive ontology of approx. 700 cell-types. It has a simplified, easy-to-navigate tree structure, where each cell type is present only once. The cell-type categories were compiled from several sources and structured by their developmental lineages, function, and phenotypes. The same ontology is used for annotating human, mouse, and rat studies. In combination with two other ontologies (“Tissue” and “Cell state”), it enables streamlined analyses of single-cell sequencing data from across different studies and research areas. The presented ontology helps to advance our understanding of the complexity of cell types.
  30. Curated information about single-cell RNA-seq protocols
    Sagane Joye, Anne Niknejad, Marc Robinson-Rechavi, Julien Wollbrett, Sebastien Moretti, Tarcisio Mendes De Farias, Marianna Tzivanopoulou, Frédéric Bastian
    Expand the abstract
    Single-cell RNA sequencing (scRNA-seq) technologies have dramatically revolutionized the field of transcriptomics, enabling researchers to study gene expression at the single cell level and uncover new insights in a variety of fields. However, the rapid pace of development in this field means that new scRNA-seq protocols are being developed and modified on an ongoing basis, each with its own unique characteristics and varying levels of performance, sensitivity, and precision. In addition to the technical differences between scRNA-seq protocols, the way in which the data is processed can also vary depending on the chosen method and on the study. This diversity of scRNA-seq protocols and the associated differences in processing requirements can make it difficult for researchers to determine the most suitable method for their specific needs and to correctly process the resulting data, and for databases such as Bgee to integrate these data. To help navigate this complex landscape, we have created an exhaustive table that includes essential information about 23 protocols, including isolation methods, exact barcode structure, target RNA types, transcript coverage, multiplexing capability, strand specificity, amplification and reverse transcriptase strategies. In addition, we provide brief guidelines for best practices to apply depending on the chosen technology. We hope that this summary will be a valuable resource for scientists looking to easily find and compare key information on the different scRNA-seq protocols, allowing them to make informed decisions about which method to use and how to correctly process their data.
  31. Collaborative Annotation of Proteins Relevant to the Adaptive Immune Response
    Randi Vita, Nina Blazeska, Hongzhan Huang, Daniel Marrama, Karen Ross, Cathy H Wu, Maria Martin, Bjoern Peters, Darren A Natale
    Expand the abstract
    The Immune Epitope Database (IEDB) is a freely available resource funded by the National Institute of Allergy and Infectious Diseases (NIAID) that has cataloged experimental data on the adaptive immune response to more than 1.5 million antibody and T cell epitopes studied in humans, non-human primates, and other animal species in the context of infectious disease, allergy, autoimmunity, and transplantation. Epitopes are the portions of an antigen, frequently a protein, that are recognized by antibodies and T cell receptors. Epitopes are key to understanding healthy and abnormal immune responses and are crucial in the development of vaccines and therapeutics. We have undertaken a collaboration to connect the epitope data in the IEDB to the rich functional annotation of proteins in UniProtKB, to the explicit representation of proteoforms in the Protein Ontology (PRO), and to the library of post-translational modifications (PTMs) available in iPTMnet. We are also combining IEDB data with resources specializing in protein-protein interactions, human genetic variation, diseases, and drugs. Integration of the IEDB’s immune epitope information with the wealth of additional biomedical data in humans and model organisms will enable novel opportunities for hypothesis generation and discovery. Researchers interested in human disease will be able to fully exploit knowledge derived from model organisms, improve disease models in non-human organisms, and identify potential cross-reactivities within and across organisms.This will enable novel queries of high interest to translational researchers. For example, what PTMs and/or genetic variants overlap with an epitope of interest or if a human epitope of interest is found in orthologous/homologous proteins in model organisms or vice-versa. Answers to such questions can provide insight into factors that affect auto-antigenicity or immune evasion by pathogens. Display of this information in the UniProt ProtVista environment and via the IEDB website will make it easily accessible to the large community of immunology and disease researchers. This collaborative effort among multiple major resources will overcome barriers to consumption of IEDB data and enrich the knowledge available in UniProtKB, PRO, and iPTMnet, thereby supporting inquiry into the role of the immune system in human disease.
  32. Retrieval of expert curated transcriptomics metadata through a user-friendly interface
    Anne Niknejad, Julien Wollbrett, Marc Robinson-Rechavi, Frederic B. Bastian
    Expand the abstract
    Transcriptomics data are of major importance to understand organism biology, and are being made available notably through primary repositories hosting their raw data, such as the Sequence Read Archive, European Nucleotide Archive, or DDBJ Sequence Read Archive. These repositories are essential for allowing for reproducible data science, but usage is limited by the available metadata. These metadata are often free text as directly provided by the authors of an experiment, cannot be checked for inconsistencies because of the amount of data submitted daily, and cannot be queried using, e.g., ontology reasoning. Bgee (https://bgee.org/) is a database for retrieval and comparison of gene expression patterns and levels across multiple animal species, produced from multiple data types (bulk RNA-Seq, single-cell RNA-Seq, Affymetrix, in situ hybridization, and EST data) and from multiple datasets. The Bgee team has manually curated thousands of samples, to annotate each of them to ontologies, correct for mistakes (often after direct contact with the authors of an experiment), and filter to keep only healthy, wild-type, high quality data. For instance, after careful curation, half of the samples from the GTEx experiment have been considered unsuitable to these healthy wild-type requirements, leading to discarding about 6000 of them from Bgee. Moreover, Bgee processes all curated samples to generate gene expression quantification, e.g., TPM values for each gene for bulk RNA-Seq, or CPM values for single-cell RNA-Seq data. Until recently, these annotations and processed expression values were available only through the Bioconductor R package BgeeDB (https://bioconductor.org/packages/BgeeDB/), with basic query capabilities. We have now built a user-friendly interface, on top of a JSON API, with advanced query capabilities. Thanks to this API and interface, it is possible to query data in Bgee using ontology reasoning, to retrieve, for instance, all samples annotated to “brain” including all its substructure (e.g., “white matter”, “hypothalamus”). Precise annotations are provided regarding: anatomical localization and cell type; developmental and life stage; strain; and sex. Each of these condition parameters can be precisely tuned for advanced querying of transcriptomics data. Additionally, it is possible to filter for presence of expression data for specific genes, experiments, or samples. The returned results include precise description of each sample, e.g. for single-cell RNA-Seq data, the sequencer used, sequenced transcript part, or forward/reverse strain studied. At Biocuration 2023 we will present our precise criteria of annotation, the capabilities of this new search tool and API, and examples of applications for enhancing discoveries in the transcriptomics field. These query tools are available since February 2023 at https://bgee.org/search/raw-data.
  33. Curation and Terminology Strategies of the Immune Epitope Database
    Randi Vita, James A. Overton, Hector Guzman-Orozco, Bjoern Peters
    Expand the abstract
    The Immune Epitope Database (IEDB) relies upon manual biocuration, automated curation support tools, and a novel Source of Terminology (SOT) system to accurately and efficiently curate published experimental data from the literature. We use a team of PhD curators to achieve consistent data entry via manual biocuration with the assistance of sophisticated automated tools. The biocurators enter data into our web application which enforces a well-defined data structure. Our automated curation support tools enable automated query and classification of the literature, autocomplete functionality during the curation process, in form validation as soon as data is entered, ontology driven finder applications which retrieve appropriate ontology terms on the fly, and post curation data calculations. We exploit the logical relationships between ontology terms to drive logical validation. For example, our immune exposure process model enforces that only specific in vivo processes can lead to specific diseases, which in turn, can only be caused by specific pathogens. Our webform finder applications are ontology driven, use abbreviations and common names provided by the ontology experts, and allow curators to browse terms within the context of a hierarchical tree. Additionally, we are obligated to extend some external ontologies and resources to fully represent the immunology literature whenever we require the benefits of that pre-existing resource, but the needed child terms are beyond their scope. For example, SARS-CoV2 is a term found in NCBI, however, all the many variants of SARS-CoV2 which are very important to immunologists, and therefore, are critical for our curators to capture, are not. Our SOT system was recently developed out of the need to manage the use of many changing external ontologies. It incorporates diverse community data standards, provides for custom “immunology friendly” term labels, and manages custom term requests, as well as versioning. Interoperability is facilitated by a public API allowing users to resolve CURIEs for the terms utilized. Additionally, when we encounter terms that are not yet present in the public resources, we add temporary ontology terms, managed by our system, which are later replaced as new terms are created in the appropriate ontologies. We believe that our independent SOT system will be useful for other projects facing similar needs, regardless of which ontologies they are using.
  34. Mapping from virulence to interspecies interactions - an anotation review
    Antonia Lock, Sandra Orchard
    Expand the abstract
    Proteins in UniProt are automatically and manually annotated to keywords. Keywords may be mapped to various ontologies, resulting in automatic association of proteins with relevant ontology terms. If a mapped ontology term is obsoleted, a manual review process may be required in order to establish whether a suitable alternative term exists. The keyword KW-0843 Virulence was mapped to the now obsoleted Gene Ontology (GO) term GO:0009405 pathogenesis. This term was obsoleted from the GO as it was deemed out of scope of the ontology, and was superseded by terms pertaining to interspecies interactions. The loss of the mapping resulted in a loss of over 150,000 GO annotations. The ~150.000 proteins annotated to the keyword virulence (4144 manually annotated and 147,664 automatically annotated) were reviewed to assess if a new virulence keyword-GO term mapping could be created.
  35. Treatment Response Ontology in GENEVESTIGATOR
    Eva Macuchova, Jana Sponarova, Alena Jiraskova, Iveta Mrizova
    Expand the abstract
    In clinical trials, evaluating the treatment outcome is an essential feature of the success of therapeutics development. Analyses of patients' responder versus non-responder status can indicate clinically useful correlative biomarkers for therapeutic resistance. Objective response measurement is only applicable if it is performed on the basis of generally accepted, validated, and consistent criteria. Some patients' cohorts have detailed, multiple response measurements, while others have either no or sparse/ill-defined data. Response status is evaluated differently across the clinical trials, and it may be based on: a) disease-specific and well-established classification systems (e.g., response evaluation by RECIST in solid tumors, IMWG response criteria in multiple myeloma, EULAR response criteria in rheumatoid arthritis), b) time-point assessment (e.g., progression-free survival (PFS), overall survival (OS)), c) disease-specific and investigator-defined assessment criteria. To harmonize treatment response annotation in curated Genevestigator clinical datasets, we have built the Response Ontology that adheres to the terminology of the National Cancer Institute (NCI) Thesaurus, Common Terminology Criteria for Adverse Events (CTCAE), and other relevant resources. Development of the Response Ontology was preceded by a detailed analysis of the curated content in Genevestigator, including the identification of synonymous terms, followed by the unification of the curation rules. We will present the structure of the Response Ontology, the challenges of its development, and examples of its application in the curation of clinical data. The newly implemented Response Ontology allows Genevestigator users to identify and analyze treatment response status in cancer and autoimmune disease datasets. The architecture and nomenclature of the Nebion treatment Response Ontology have implemented FAIR Data principles. Moreover, the application of the Response Ontology facilitates not only machine-readable but also machine-actionable data.
  36. A general strategy for generating expert-guided, simplified views of ontologies
    Anita R. Caron, Josef Hardi, Ellen M. Quardokus, James P. Balhoff, Bradley Varner, Paola Roncaglia, Bruce W. Herr II, Shawn Zheng Kai Tan, Helen Parkinson, Mark A. Musen, Katy Börner, David Osumi-Sutherland
    Expand the abstract
    The use of common biomedical ontologies to annotate data within and across different communities improves data findability, integration and reusability. Ontologies do this not only by providing a standard set of terms for annotation, but via the use of ontology structure to group data in biologically meaningful ways. In order to meet the diverse requirements of users, and to conform to good engineering practices required for scalable development, biomedical ontologies inevitably become larger and more complex than the immediate requirements of individual communities and users. This complexity can often make ontologies daunting for non-experts, even with tooling that lowers the barriers to searching and browsing. We have developed a suite of tools that take advantage of Ubergraph (https://zenodo.org/record/7249759#.Y7QE4OzP31c) to solve this problem for users that start from a simple list of terms mapped to a source ontology or for users who have already arranged terms in a draft hierarchy in order to drive browsing on their tools. This latter starting point is common among developers of anatomical and cell type atlases. A view generation tool renders simple, tailored views of ontologies limited to a specified subset of classes and relationship types. These views accurately reflect the semantics of the source ontology, preserving its usefulness for grouping data in biologically meaningful ways. A hierarchy validation system validates these user-generated hierarchies against source ontologies, replacing unlabelled edges with formal ontology relationships which can be safely used to group content. A review of hierarchical relationships that do not validate against source ontologies provides potential corrections to hierarchies and source ontologies. A combination of validation and view generation can be used to generate ontology views based on the provided hierarchy. Here we described the view generation and hierarchy generation tools and illustrated their use in generating views and validation reports for the HubMap Human Reference Atlas.
  37. Expanding the Repertoire of Causal Relations for a Richer Description of Activity Flow in Gene Ontology Causal Activity Models (GO-CAMs)
    Kimberly Van Auken, Pascale Gaudet, David Hill, Tremayne Mushayahama, James Balhoff, Seth Carbon, Chris Mungall, Paul Thomas
    Expand the abstract
    The Gene Ontology (GO) is the de facto bioinformatics resource for systematic gene function description. For over 20 years, gene function description using GO involved associating genes or gene products with terms from each aspect of GO: biological process (BP), molecular function (MF), and cellular component (CC). The power and utility of GO annotations have recently been extended with the introduction of GO Causal Activity Models (GO-CAMs)1. GO-CAM is a structured framework in which molecular activities from the MF aspect are connected in a causal chain using relations from the Relations Ontology (RO). The molecular activities are contextualized with terms from the BP and CC aspects, as well as external ontologies, with the ultimate goal of fully describing a biological system. Currently, GO-CAMs can be browsed on the GO site, downloaded from GitHub, and select causal diagrams viewed on gene pages at the Alliance of Genome Resources allowing biologists to see gene product relationships in the context of their activities. To enable GO curators to represent the richness of biological knowledge that can be represented in GO-CAMs, we: 1) expanded the repertoire of causal RO relations and 2) refined definitions of existing relations to ensure accurate and consistent modeling. We identified three main areas of classification: regulatory vs non-regulatory, direction of effect (i.e. positive or negative), and directness (i.e. temporal proximity of one activity to another). We refined the definition of regulatory relations in RO to include the restriction that they are specific, conditional effects and made a distinction between direct and indirect regulation, with the former used to describe regulation with no intervening activities, e.g. a regulatory subunit of an enzyme, and the latter used to describe cases where an upstream activity controls a downstream activity but with intervening activities between them, e.g. DNA-binding transcription factor activity and the activity of the gene product whose expression they control. In addition, we introduced more specific non-regulatory causal relations to describe constitutive effects in which an upstream activity is required, but normally present and not rate-limiting, for execution of a downstream activity. We also added a relation to link an upstream activity that removes small molecule inputs to a downstream activity when the former activity is not considered regulatory. A series of simple selection criteria has been implemented in the interface of the GO-CAM curation tool, Noctua, to guide curators in using the expanded repertoire of causal relations and ensure consistency and efficiency in curation. Lastly, for further reference, we created documentation on each causal relation, with usage guidelines and links to example models, on the GO Consortium’s wiki. The new causal RO relations and Noctua curation interface will enhance GO’s representation of complex biological systems using the GO-CAM framework. Thomas P.D. et al. (2019) Gene Ontology Causal Activity Modeling (GO-CAM) moves beyond GO annotations to structured descriptions of biological functions and systems. Nat. Genet., 51, 1429–1433.
  38. Harnessing veterinary data to drive precision medicine: Expanding the Mondo Disease Ontology and Creating the Vertebrate Breed Ontology.
    Sabrina Toro, Kathleen R. Mullen, Nicolas Matentzoglu, Imke Tammen, Frank W. Nicholas, Halie M. Rando, Nicole A. Vasilevsky, Christopher J. Mungall, Melissa Haendel
    Expand the abstract
    The wealth of information from research and health records can be leveraged to support advances in diagnostics and treatments. Existing computational tools, which rely on data standardization, integration, and comparison, have successfully supported human precision medicine. Though species agnostic, these tools are currently optimized for diagnosis and treatment of human patients since the harmonized data mostly comes from human health records and animal model databases. Including all non-human animal data would improve translational research, and expand diagnostic support and treatment discovery for non-human animals. Data integration and comparison require standards, including ontologies. Current standards can be used for non-human animal data: NCBI gene for genes, the Uber-anatomy ontology (Uberon) for anatomy, and the Unified phenotype ontology (uPheno) or more species-specific ontology such as the Mammalian Phenotype Ontology (MP) for phenotypes. Here, we report the expansion of the Mondo Disease Ontology (Mondo) for use as a standard for non-human animal diseases, and the creation of a new Vertebrate Breed Ontology (VBO) as a standard for animal breeds. Mondo integrates multiple disease terminologies into a coherent logic-based ontology that provides precise semantic mappings between terms. It represents a hierarchical classification of over 20,000 diseases in humans and across species, covering a wide range of diseases including cancers, infectious diseases and Mendelian disorders. We improved the coverage of non-human diseases in Mondo by adding new terms represented in veterinary electronic health records and animal databases. In addition, we added axioms to indicate the species affected by a disease, and whether a disease affects a single or several species. Importantly, semantics connect a non-human disease and its analogous counterpart in humans to support translational research. VBO was created to be a single source for data standardization and integration of all breed names. Breeds included in VBO currently cover livestock, cat, and dog breeds as determined, defined, and/or recognized by international organizations, communities, and/or experts. VBO includes information about breeds, such as common names and synonyms, breed recognition status, domestication status, breed identifiers/name codes, and references in other databases. Adopting Mondo as the source of diseases, VBO as the source of breed names, as well as other ontologies such as Uberon and Upheno or MP, in databases and veterinary electronic health records will improve data computability and consistency. This will enhance data interoperability, support data integration and comparison, and ultimately improve the current computational tools for both humans and other animals.
  39. Data Curation and management of public proteomics datasets in the PRIDE database
    Deepti Jaiswal Kundu, Shengbo Wang, Suresh Hewapathirana, Selvakumar Kamatchinathan, Chakradhar Bandla, Yasset Perez-Riverol, Juan Antonio Vizcaíno
    Expand the abstract
    Introduction The PRoteomics IDEntifications (PRIDE) database at the European Bioinformatics Institute is currently the world-leading repository of mass spectrometry (MS)-based proteomics data [1]. PRIDE is also one of the founding members of the global ProteomeXchange (PX) consortium [2], and by January 2023, contain around 31,500 datasets (~83% of all ProteomeXchange datasets). PRIDE is an ELIXIR core data resource and ProteomeXchange has been recently named as a Global Core Biodata Resource by the Global Biodata Coalition [3]. Thanks to the success of PRIDE and ProteomeXchange (PX), the proteomics community is now widely embracing open data policies, a scenario opposite to the situation just a few years ago. Therefore, PRIDE has grown significantly in recent years (~500 datasets/month were submitted on average just in the past two years). Our major challenge is to ensure a fast and efficient data submission process, ensuring that the data representation is correct. Data handling and curation For each submitted dataset, a validation pipeline is first run to ensure that the data complies with the PRIDE metadata requirements and that the included files in the dataset are correctly formatted. Issues of different types can often occur. Therefore, the direct interaction between the PRIDE curation team and the users becomes critical. In the second step, the actual data submission takes place and dataset accession numbers are provided to the users. Finally, datasets are released when the corresponding paper is published. The stand-alone PX Submission tool is used by submitters in the data submission process. Some improvements have been made in the last year to improve the submission functionality and facilitate the handling of resubmissions and large datasets. It should also be noted that version 1.0 of the PRIDE data policy was released on May 2022 (https://www.ebi.ac.uk/pride/markdownpage/datapolicy). Conclusion The quantity, size, and complexity of the proteomics datasets submitted to PRIDE are rapidly increasing. The diversity of proteomics data makes a fully automated data deposition process very challenging, especially since data formats are complex and very heterogeneous. Curators then play a very active role in supporting the data submitters in the preparation and quality control of each PRIDE data submission. References [1] Perez-Riverol Y, Bai J, Bandla C, Hewapathirana S, García-Seisdedos D, Kamatchinathan S, Kundu D, Prakash A, Frericks-Zipper A, Eisenacher M, Walzer M, Wang S, Brazma A, Vizcaíno JA (2022). The PRIDE database resources in 2022: A Hub for mass spectrometry-based proteomics evidence. Nucleic Acids Res 50(D1):D543-D552 (34723319) [2] Deutsch EW, Bandeira N, Sharma V, Perez-Riverol Y, Carver JJ, Kundu DJ, García-Seisdedos D, Jarnuczak AF, Hewapathirana S, Pullman BS, Wertz J, Sun Z, Kawano S, Okuda S, Watanabe Y, Hermjakob H, MacLean B, MacCoss MJ, Zhu Y, Ishihama Y, Vizcaíno JA(2020). The ProteomeXchange consortium in 2020: enabling ‘big data’ approaches in proteomics, Nucleic Acids Res 48(D1):D1145-D1152 (31686107) [3] Global Biodata Coaltion 2022. Global Core Biodata Resources: Concept and Selection Process: https://doi.org/10.5281/zenodo.5845116
  40. ICGC ARGO Clinical Data Dictionary
    Hardeep Nahal-Bose, Peter Lichter, Ursula Weber, Melanie Courtot
    Expand the abstract
    The International Cancer Genome Consortium Accelerating Research in Genomic Oncology (ICGC ARGO) project is an international initiative to sequence germline and tumour genomes from 100,000 cancer patients, spanning 13 countries and 22 tumour types. ICGC ARGO will link genomic data to extensive clinical data from clinical trials and community cohorts concerning treatment and outcome data, lifestyle, environmental exposure and family history of disease for a broad spectrum of cancers. The goal will be to accelerate the translation of genomic information into the clinic to guide interventions, including diagnosis, treatment, early detection and prevention. A significant challenge is harmonizing large sets of clinical data from many different tumour types and programs globally. ICGC ARGO has developed a clinical data dictionary to collect high-quality clinical information according to standardized terminologies. The ICGC ARGO Clinical Data Dictionary was developed by the Ontario Institute of Cancer Research (OICR) Data Coordination Centre and the ICGC ARGO Tissue & Clinical Annotation Working Group. It defines a minimal set of clinical fields that must be submitted by all ARGO affiliate programs and is available at https://docs.icgc-argo.org/dictionary. The ICGC ARGO Dictionary is an event-based data model which captures relationships between different clinical events and enables longitudinal clinical collection. It is based on common data elements (CDEs) and uses international standardized terminology wherever possible. The dictionary defines a data model consisting of 15 schemas in total. These include 6 core schemas: Sample Registration, Donor, Specimen, Primary Diagnosis, Treatment, and Follow up. There are also 5 schemas for submitting detailed treatment information regarding Chemotherapy, Immunotherapy, Surgery, Radiation, and Hormone Therapy. In addition, there are 4 optional schemas for collecting clinical variables encompassing exposure, family history of disease, biomarkers and comorbidity. The data model consists of 67 core fields and 104 optional extended fields. Each clinical field is defined by a data tier and an attribute classification, reflecting the importance of the field in terms of clinical data completeness, and validation rules are enforced to ensure data integrity and correctness. Furthermore, the data model is interoperable with other data models such as mCODE/FHIR (Minimal Common Oncology Data Elements), and is being used by several funded projects, including the European-Canadian Cancer Network (EuCanCan) and the Marathon of Hope Cancer Centres Network (MOHCCN). The ICGC ARGO data dictionary is a comprehensive clinical data model that ensures interoperability across other data models and standards and will enable high-quality clinical data collection that will be linked to genomic data to help answer key clinical questions in cancer research.
  41. MolMeDB - Molecules on Membranes Database
    Jakub Juračka, Kateřina Storchmannová, Dominik Martinát, Václav Bazgier, Jakub Galgonek, Karel Berka
    Expand the abstract
    Biological membranes are natural barriers of cells. The membranes play a key role in cell life and also in the pharmacokinetics of drug-like small molecules. There are several ways how a small molecule can get through the membranes. Passive diffusion, active or passive transport via membrane transporters are the most relevant ways how the small molecules can get through the membranes. There is an available huge amount of data about interactions among the small molecules and the membranes also about interaction among the small molecules and the transporters. MolMeDB (https://molmedb.upol.cz/detail/intro) is a comprehensive and interactive database. Data is available from 52 various methods for 40 biological or artificial membranes and for 184 transporters in MolMeDB. The data within the MolMeDB is collected from scientific papers, our in-house calculations (COSMOmic and PerMM) and obtained by data mining from several databases. Data in the MolMeDB are fully searchable and browsable by means of name, SMILES, membrane, method, transporter or dataset and we offer collected data openly for further reuse. Newly are data available in RDF format and can be queried using SPARQL endpoint (https://idsm.elixir-czech.cz/sparql/endpoint/molmedb). Federated queries using endpoints of other databases are also possible. Lately this database has been used for analysis of different functional groups influence on molecule-membrane interactions.
  42. cancercelllines.org - a new curated resource for cancer cell line variants
    Rahel Paloots, Ellery Smith, Dimitris Giagkos, Kurt Stockinger, Michael Baudis
    Expand the abstract
    Cancer cell lines are important models for studying the disease mechanisms and developing novel therapeutics. However, they are not always good representations of the disease as they accumulate mutations over propagation. Additionally, due to human error, cancer cell lines can become misidentified or contaminated. Here, we have collected a set of cancer cell line variant data to facilitate identification of suitable cell lines in research. Our dataset includes both structural and single nucleotide cancer cell line variants. The set of copy number variants (CNVs) originates mainly from Progenetix - a resource for cancer copy number variants. All available cell lines from Progenetix have been mapped to Cellosaurus (a cell line knowledge resource). In total, over 5000 cancer cell line CNV profiles are available for over 2000 distinct cell lines. Additionally, we have a curated set of annotated cancer cell line single nucleotide variants (SNVs) from ClinVar and a collection of known SNVs from CCLE. Curated variants include information about pathogenicity of the variant as well as clinical phenotype associated with this variant. Moreover, to get additional CNVs and associated metadata we performed data mining by using natural language processing tools. The results are displayed on an interactive CNV profile graph- a novel feature that allows for the selection of region of interest and shows publications associated with the area. Here, we introduce the features and data included in the database that are publicly available and freely accessible.