CIKM2020 Keynote: Accelerating discovery science with an Internet of FAIR data and services

Accelerating Discovery Science
with an Internet of FAIR Data and Services
@micheldumontier::CIKM:2020-10-211
Michel Dumontier, Ph.D.
Distinguished Professor of Data Science
Director, Institute of Data Science

The world is awash with vast amounts of data

4
A common rejection module (CRM) for acute rejection across multiple organs identifies novel
therapeutics for organ transplantation
Khatri et al. JEM. 210 (11): 2205
DOI: 10.1084/jem.20122709
Main Findings:
1. CRM of 11 overexpressed genes predicted future injury to a graft
2. Mice treated with existing drugs against specific CRM genes extended graft survival
3. Retrospective EHR data analysis supports treatment prediction
Key Observations:
1. Meta-analysis offers a more reliable estimate of the magnitude of the effect
2. Data can be used to generate and support/dispute new hypotheses

However, significant effort is
still needed to find the right
dataset(s), make sense of them,
and use for a new purpose

7 @micheldumontier::CIKM:2020-10-21
Our ability to reproduce landmark studies is surprisingly low:
39% (39/100) in psychology1
21% (14/67) in pharmacology2
11% (6/53) in cancer3
unsatisfactory in machine learning4
1doi:10.1038/nature.2015.17433 2doi:10.1038/nrd3439-c1 3doi:10.1038/483531a 4https://openreview.net/pdf?id=By4l2PbQ-
Most published research findings are false.
- John Ioannidis, Stanford University
PLoS Med 2005;2(8): e124.

9
What hope do we really have to realize
?

It’s time to completely rethink
how we perform research

Poor quality
(meta)data Reproducibility
Crisis
Translational
Failure
Broken windows theory
Inadequate reusability theory
visible signs of crime, anti-
social behavior, and civil
disorder create an
environment that
encourages more serious
crimes
Poor quality metadata and the
inaccessibility of original research
results make it less likely to
reproduce original work, resulting
in an ineffective translation of
research into useful applications

It’s time to completely rethink
how we perform research
(and how we document and report it)

Lambin et al. Radiother Oncol. 2013. 109(1):159-64. doi: 10.1016/j.radonc.2013.07.007

Rethinking Publishing Scientific Research
Data Science. 2017 1(1-2):139-154. DOI: 10.3233/DS-170010
http://www.tkuhn.org/pub/sempub/

De-centralized knowledge graphs
Kuhn T., Chichester C., Krauthammer M., Dumontier M. (2015) Publishing
Without Publishers: A Decentralized Approach to Dissemination, Retrieval, and
Archiving of Data. In: Arenas M. et al. (eds) The Semantic Web - ISWC 2015.
ISWC 2015. Lecture Notes in Computer Science, vol 9366. Springer, Cham

We need a new social contract, supported
by legal and technological infrastructure
to make digital resources available in a
responsible manner

Human Machine collaboration
will be crucial to our future success

An international, bottom-up paradigm for
the discovery and reuse of digital content
for the machines that people use

http://www.nature.com/articles/sdata201618

FAIR in a nutshell
FAIR aims to enhance social and economic outcomes by facilitating the
discovery and reuse of digital resources through key requirements:
– unique identifiers to distinguish and retrieve all forms of digital content and
knowledge
– high quality meta(data) to enhance discovery of relevant digital resources
– use of common vocabularies to facilitate query and statistical analysis
– establishment of community standards to reduce the effort in data reuse
– detailed provenance to provide adequate context and to enable reproducibility
– registered in appropriate repositories to fulfill a promise to future content seekers
– simpler terms of use to clarify expectations and intensify innovation
– social and technological commitments to make data ready for intelligent applications

The lack of FAIR data costs the European Economy a minimum of €10.2bn per year
EC:DG R&I; PWC 2018 Report: Cost-benefit analysis for FAIR research data

Why Should *you* Go FAIR?
• Makes it easier for to use your own data for a new purpose
• Makes it easier for other people to find, use and cite your
data, and for them to understand what you expect in return
• Makes it easier/possible for people to verify your work
• Ensure that the data are available in the future, especially as
you may not want the responsibility
• Satisfy the expectations around data management from
institution, funding agency, journal, my peers

Let’s build and use the
Internet of FAIR data and services

FAIRification process
GO FAIR Fairification: https://www.go-fair.org/fair-principles/fairification-process/
FAIRplus FAIR cookbook: https://fairplus.github.io/cookbook-dev/intro.html
Utrecht FAIR: https://www.uu.nl/en/research/research-data-management/guides/how-to-make-your-data-fair
EC H2020 Guidelines: https://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf

28
http://w3id.org/AmIFAIR
Other schemes: https://fairassist.org

The Semantic Web
is a portal to the web of knowledge
standards for publishing, sharing and querying
facts, expert knowledge and services
scalable approach for the discovery
of independently constructed,
collaboratively described,
distributed knowledge
(in principle)

https://lod-cloud.net/

@micheldumontier::PLDN:2020-01-2232
Success depends on quality of metadataSearch registries for relevant datasets

Metadata identifier
Resource identifier
Standardized, machine readable format
Use of community vocabularies
License?
Provenance?

http://www.w3.org/TR/hcls-dataset/
standard is
registered in
FAIRsharing

• 30+ biomedical data sources
• 10B+ interlinked statements
• EBI, SIB, NCBI, DBCLS, NCBO, and many others
produce this content
chemicals/drugs/formulations,
genomes/genes/proteins, domains
Interactions, complexes & pathways
animal models and phenotypes
Disease, genetic markers, treatments
Terminologies & publications
35
Alison Callahan, Jose Cruz-Toledo, Peter Ansell, Michel Dumontier:
Bio2RDF Release 2: Improved Coverage, Interoperability and
Provenance of Life Science Linked Data. ESWC 2013: 200-212
Linked Data for the Life Sciences
Bio2RDF is an open source project that uses semantic web
technologies to make it easier to reuse biomedical data

Query multiple databases on the biological web of data
Phenotypes of
knock-out
mouse models
for the targets
of a selected
drug (Imatinib)

Explore we know, and formulate hypotheses about what we don’t
Finding melanoma drugs through a probabilistic knowledge graph.
PeerJ Computer Science. 2017. 3:e106 https://doi.org/10.7717/peerj-cs.106
by exploring a probabilistic
semantic knowledge graph
And validate them against
pipelines for drug discovery

Reproduce original research
AUC 0.91 across all therapeutic indications Scripts not available. Feature tables available.
Result: AUC 0.83 … doesn’t match! (but now you can see what exactly we did)
Towards FAIR protocols and workflows: the OpenPREDICT use case. 2020. PeerJ Computer Science 6:e281
https://doi.org/10.7717/peerj-cs.281

Explore disease pathophysiology and treatment

Mine distributed, access restricted FAIR datasets
in a privacy preserving manner
Maastricht Study + MUMC CBS
Goal is to learn high confidence determinants of health in a privacy preserving manner
over vertically partitioned data from the Maastricht Study and Statistics Netherlands.
The data are made available through FAIR data stations that provide access to
allowable subsets of data to authorized users of approved algorithms.
Establish a new social, legal, ethical and technological infrastructure for discovery
science in and across health and non-health settings, including scalable governance
and flexible consent to underpin the responsible use of Big Data.
s

FAIR data and services
to accelerate discovery science

Summary
FAIR represents a global initiative to enhance the discovery and reuse of all kinds of
digital resources. It is a work in progress and it needs you!
FAIR requires new social, legal, ethical, scientific and technological infrastructure:
– How does your research group or community make their data/findings FAIR?
– What support does your organization provide you?
– Are you making use of all the data and findings that you could?
– What is responsible data science and artificial intelligence?
Semantics, coupled with AI technologies, may enable humans, aided by intelligent
machine agents, to exploit the Internet of FAIR data and services, and hence to
accelerate discovery in biomedicine and in other disciplines.

Acknowledgements
FAIR
Dumontier Lab (Maastricht University, Stanford University, Carleton University)
MU: Seun Adekunle, Thales Bertaglia, Remzi Celebi, Yenisel Calana, Ricardo De Miranda Azevedo, Vincent Emonet, Lars Jacobs, Andreea Grigoriu,
Carlos Guerrero, Tim Hendriks, Massimiliano Grassi, Andine Havelange, Pedro Hernandez Serrano, Vikas Jaiman, Parveen Kumar, Lianne Ippel,
Alexander Malic, Helder Monteiro, Stefan Meier, Kody Moodley, Stuti Nayak, Hercules Panoutsopoulos, Linda Rieswijk, Carola Roubin, Nadine
Rouleaux, Claudia van open, Chang Sun, Johan van Soest, Binosha Weerarathna, Turgay Saba, Weiwei Wang, Jinzhou Yang, Amrapali Zaveri, Leto Peel,
Rohan Nanda, Visara Urovi, Andre Dekker, David Townend, Gijs van Dijck, Christopher Brewster
SU: Sandeep Ayyar, Remzi Celebi, Shima Dastgheib, Maulik Kamdar, David Odgers, Maryam Panahiazar, Amrapali Zaveri
CU: Alison Callahan, Jose Toledo-Cruz, Natalia Villaneuva-Rosales

michel.dumontier@maastrichtuniversity.nl
Website: http://maastrichtuniversity.nl/ids
The mission of the Institute of Data Science at Maastricht University is to foster a
collaborative environment for multi-disciplinary data science research,
interdisciplinary training, and data-driven innovation .
We tackle key scientific, technical, social, legal, ethical issues that advance our
understanding across a variety of disciplines and strengthen our communities in the
face of these developments.

CIKM2020 Keynote: Accelerating discovery science with an Internet of FAIR data and services

More Related Content

What's hot

Similar to CIKM2020 Keynote: Accelerating discovery science with an Internet of FAIR data and services

More from Michel Dumontier

Recently uploaded

CIKM2020 Keynote: Accelerating discovery science with an Internet of FAIR data and services

Editor's Notes