Using Wikidata as an open,
community-maintained database of
biomedical knowledge
Andrew Su, Ph.D.
@andrewsu
http://sulab.org
July 23, 2017
#BOSC2017
Slides: slideshare.net/andrewsu
(open source
tools for
open data)
The Gene Wiki project, circa 2008
2
Huss, PLoS Biol, 2008
Data imported
from structured
databases
Summarized
knowledge via
crowdsourcing
3
is to data
is to text
biomedical
Provide a database of the world’s
knowledge that anyone can edit
- Denny Vrandečić
Subclass of
Regulates
Physically
interacts with
Protein
Neural
development
Property:P279
Property:P128
Property:P129
Q8054
Q1345738
VLDL receptor Q1979313
Amyloid
beta A4 Q423510
Q13561329
http://www.wikidata.org/wiki/Q13561329
Decreased
expression in
Property:P1910
Schizophrenia Q41112
Bipolar disorder Q131755
Property:P279
Property:P128
Property:P129
Q8054
Q1345738
Q1979313
Q423510
Q13561329
Property:P1910
Q41112
Q131755
https://www.wikidata.org/wiki/Special:EntityData/Q13561329.json
7
Qualifiers
References
8
EMA
GWAS CentralPubChem
Simple data retrieval
9
39 genes
gene geneLabel gene geneLabel gene geneLabel gene geneLabel
Q5013317 COL22A1 Q18027370 IGSF3 Q18053559 CDHR3 Q14903974 SMAD3
Q14912759 SLC22A5 Q18045382 HPSE2 Q18045669 ATG3 Q18033889 IL1RL1
Q14914243 PSAP Q18048437 IL33 Q18035037 RAD50 Q17917202 ERBB4
Q14907990 SLC30A8 Q18051900 PYHIN1 Q18036984 FBXL7 Q18027836 IL6R
Q18025002 GAB1 Q17709208 ACO1 Q18033919 XPR1 Q18030185 NOTCH4
Q18035589 C6orf10 Q18027822 IL2RB Q15326496 RORA Q18030409 PDE4D
Q18054256 GSDMA Q18030364 PBX2 Q18042132 GSDMB Q18045645 IKZF4
Q18058487 C5orf56 Q18037773 ABI3BP Q18029145 MKLN1 Q18039979 KLHL5
Q18030785 PRKG1 Q18039623 CTNNA3 Q18036729 RAP1GAP2 Q18026947 HLA-DQA1
Q18033424 IL18R1 Q18046350 ZNF665 Q14878303 IL13
“Retrieve genes with
GWAS association
with asthma”
http://bit.ly/bosc2017_wikidata
Data integration
10
“Retrieve genes with
GWAS association
with asthma and gene
product is localized to
membrane”
gene geneLabel gene geneLabel gene geneLabel gene geneLabel
Q14912759 SLC22A5 Q18027370 IGSF3 Q18035037 RAD50 Q18027836 IL6R
Q14914243 PSAP Q18033424 IL18R1 Q18033919 XPR1 Q18030409 PDE4D
Q14907990 SLC30A8 Q18045382 HPSE2 Q18042132 GSDMB Q18030185 NOTCH4
Q18035589 C6orf10 Q18027822 IL2RB Q18036729 RAP1GAP2 Q18026947 HLA-DQA1
Q18054256 GSDMA Q18053559 CDHR3 Q18033889 IL1RL1
Q18030785 PRKG1 Q14903974 SMAD3 Q17917202 ERBB4
22 genes
http://bit.ly/bosc2017_wikidata
Computing on provenance
11
“Retrieve genes with
GWAS association
with asthma and gene
product is localized to
membrane (non-IEA)”
gene geneLabel gene geneLabel gene geneLabel
Q14912759 SLC22A5 Q18045382 HPSE2 Q17917202 ERBB4
Q14914243 PSAP Q18027822 IL2RB Q18027836 IL6R
Q14907990 SLC30A8 Q14903974 SMAD3 Q18030409 PDE4D
Q18027370 IGSF3 Q18035037 RAD50 Q18030185 NOTCH4
Q18033424 IL18R1 Q18036729 RAP1GAP2 Q18026947 HLA-DQA1
15 genes
http://bit.ly/bosc2017_wikidata
Leveraging the Disease Ontology structure
12
“Retrieve genes with GWAS
association with any
respiratory disease and
gene product is localized to
membrane (non-IEA)”
31 genes / 8 diseases
diseaseGALabel gene_counts geneList
asthma 15
SMAD3, RAP1GAP2, IL18R1, HPSE2,
SLC30A8, SLC22A5, PSAP, ERBB4, HLA-
DQA1, IGSF3, IL2RB, IL6R, NOTCH4, PDE4D,
RAD50
chronic obstructive pulmonary
disease 5 HLA-C, SFTPD, ANXA5, ANXA11, ATP2C2
lung cancer 3 TGM5, VTI1A, PHACTR2
interstitial lung disease 2 DSP, ATP11A
non-small-cell lung carcinoma 2 NALCN, DLST
nasopharynx carcinoma 2 ITGA9, TNFRSF19
adenocarcinoma of the lung 1 BTNL2
pulmonary emphysema 1 BICD1
http://bit.ly/bosc2017_wikidata
Opportunistic integration
13
diseaseGALabel exposureLabel
lung cancer arsenic pentoxide exposure
lung cancer HN1 exposure
lung cancer mechlorethamine exposure
lung cancer HN3 exposure
asthma Phenacyl chloride exposure
pulmonary emphysema phosgene exposure
“Retrieve genes with GWAS
association with any
respiratory disease and
gene product is localized to
membrane (non-IEA) and
show causative chemical
hazards”
4 diseases / 6 chemical hazards
http://bit.ly/bosc2017_wikidata
Small data to big data
14
?
Chlambase.org for the Chlamydia research community
15
Community-specific
knowledge
Genetic mutants, gene
expression, host-pathogen
interactions, orthologs, ….
Domain-specific applications based on Wikidata
16
Chlambase
Open source
17
github.com/SuLab/GeneWikiCentral
github.com/SuLab/wikidataintegrator – python module for Wikidata
github.com/SuLab/scheduled-bots – bot automation framework
github.com/SuLab/WikiGenomes.org
github.com/SuLab/ChlamBase.org
github.com/SuLab/Genewiki-ShEx – data models
github.com/SuLab/wdbiothings – wrapper for BioThings APIs
Expert interfaces
License
18
Crowd volunteers
and partners
Andra
Waagmeester
Lynn
Schriml
Elvira
Mitraka
U. Maryland, Baltimore
MicelioUBC
Paul Pavlidis
Ben GoodGreg Stupp Sebastian
Burgstaller
Tim
Putman
Ginger
Tsueng
Nuria
Queralt
Rosinach
bit.ly/genewikidata
sulab.org
Join us!
U. Washington
Kevin Hybiske

BOSC2017: Using Wikidata as an open, community-maintained database of biomedical knowledge

  • 1.
    Using Wikidata asan open, community-maintained database of biomedical knowledge Andrew Su, Ph.D. @andrewsu http://sulab.org July 23, 2017 #BOSC2017 Slides: slideshare.net/andrewsu (open source tools for open data)
  • 2.
    The Gene Wikiproject, circa 2008 2 Huss, PLoS Biol, 2008 Data imported from structured databases Summarized knowledge via crowdsourcing
  • 3.
  • 4.
    is to data isto text biomedical Provide a database of the world’s knowledge that anyone can edit - Denny Vrandečić
  • 5.
    Subclass of Regulates Physically interacts with Protein Neural development Property:P279 Property:P128 Property:P129 Q8054 Q1345738 VLDLreceptor Q1979313 Amyloid beta A4 Q423510 Q13561329 http://www.wikidata.org/wiki/Q13561329 Decreased expression in Property:P1910 Schizophrenia Q41112 Bipolar disorder Q131755
  • 6.
  • 7.
  • 8.
  • 9.
    Simple data retrieval 9 39genes gene geneLabel gene geneLabel gene geneLabel gene geneLabel Q5013317 COL22A1 Q18027370 IGSF3 Q18053559 CDHR3 Q14903974 SMAD3 Q14912759 SLC22A5 Q18045382 HPSE2 Q18045669 ATG3 Q18033889 IL1RL1 Q14914243 PSAP Q18048437 IL33 Q18035037 RAD50 Q17917202 ERBB4 Q14907990 SLC30A8 Q18051900 PYHIN1 Q18036984 FBXL7 Q18027836 IL6R Q18025002 GAB1 Q17709208 ACO1 Q18033919 XPR1 Q18030185 NOTCH4 Q18035589 C6orf10 Q18027822 IL2RB Q15326496 RORA Q18030409 PDE4D Q18054256 GSDMA Q18030364 PBX2 Q18042132 GSDMB Q18045645 IKZF4 Q18058487 C5orf56 Q18037773 ABI3BP Q18029145 MKLN1 Q18039979 KLHL5 Q18030785 PRKG1 Q18039623 CTNNA3 Q18036729 RAP1GAP2 Q18026947 HLA-DQA1 Q18033424 IL18R1 Q18046350 ZNF665 Q14878303 IL13 “Retrieve genes with GWAS association with asthma” http://bit.ly/bosc2017_wikidata
  • 10.
    Data integration 10 “Retrieve geneswith GWAS association with asthma and gene product is localized to membrane” gene geneLabel gene geneLabel gene geneLabel gene geneLabel Q14912759 SLC22A5 Q18027370 IGSF3 Q18035037 RAD50 Q18027836 IL6R Q14914243 PSAP Q18033424 IL18R1 Q18033919 XPR1 Q18030409 PDE4D Q14907990 SLC30A8 Q18045382 HPSE2 Q18042132 GSDMB Q18030185 NOTCH4 Q18035589 C6orf10 Q18027822 IL2RB Q18036729 RAP1GAP2 Q18026947 HLA-DQA1 Q18054256 GSDMA Q18053559 CDHR3 Q18033889 IL1RL1 Q18030785 PRKG1 Q14903974 SMAD3 Q17917202 ERBB4 22 genes http://bit.ly/bosc2017_wikidata
  • 11.
    Computing on provenance 11 “Retrievegenes with GWAS association with asthma and gene product is localized to membrane (non-IEA)” gene geneLabel gene geneLabel gene geneLabel Q14912759 SLC22A5 Q18045382 HPSE2 Q17917202 ERBB4 Q14914243 PSAP Q18027822 IL2RB Q18027836 IL6R Q14907990 SLC30A8 Q14903974 SMAD3 Q18030409 PDE4D Q18027370 IGSF3 Q18035037 RAD50 Q18030185 NOTCH4 Q18033424 IL18R1 Q18036729 RAP1GAP2 Q18026947 HLA-DQA1 15 genes http://bit.ly/bosc2017_wikidata
  • 12.
    Leveraging the DiseaseOntology structure 12 “Retrieve genes with GWAS association with any respiratory disease and gene product is localized to membrane (non-IEA)” 31 genes / 8 diseases diseaseGALabel gene_counts geneList asthma 15 SMAD3, RAP1GAP2, IL18R1, HPSE2, SLC30A8, SLC22A5, PSAP, ERBB4, HLA- DQA1, IGSF3, IL2RB, IL6R, NOTCH4, PDE4D, RAD50 chronic obstructive pulmonary disease 5 HLA-C, SFTPD, ANXA5, ANXA11, ATP2C2 lung cancer 3 TGM5, VTI1A, PHACTR2 interstitial lung disease 2 DSP, ATP11A non-small-cell lung carcinoma 2 NALCN, DLST nasopharynx carcinoma 2 ITGA9, TNFRSF19 adenocarcinoma of the lung 1 BTNL2 pulmonary emphysema 1 BICD1 http://bit.ly/bosc2017_wikidata
  • 13.
    Opportunistic integration 13 diseaseGALabel exposureLabel lungcancer arsenic pentoxide exposure lung cancer HN1 exposure lung cancer mechlorethamine exposure lung cancer HN3 exposure asthma Phenacyl chloride exposure pulmonary emphysema phosgene exposure “Retrieve genes with GWAS association with any respiratory disease and gene product is localized to membrane (non-IEA) and show causative chemical hazards” 4 diseases / 6 chemical hazards http://bit.ly/bosc2017_wikidata
  • 14.
    Small data tobig data 14 ?
  • 15.
    Chlambase.org for theChlamydia research community 15 Community-specific knowledge Genetic mutants, gene expression, host-pathogen interactions, orthologs, ….
  • 16.
    Domain-specific applications basedon Wikidata 16 Chlambase
  • 17.
    Open source 17 github.com/SuLab/GeneWikiCentral github.com/SuLab/wikidataintegrator –python module for Wikidata github.com/SuLab/scheduled-bots – bot automation framework github.com/SuLab/WikiGenomes.org github.com/SuLab/ChlamBase.org github.com/SuLab/Genewiki-ShEx – data models github.com/SuLab/wdbiothings – wrapper for BioThings APIs Expert interfaces License
  • 18.
    18 Crowd volunteers and partners Andra Waagmeester Lynn Schriml Elvira Mitraka U.Maryland, Baltimore MicelioUBC Paul Pavlidis Ben GoodGreg Stupp Sebastian Burgstaller Tim Putman Ginger Tsueng Nuria Queralt Rosinach bit.ly/genewikidata sulab.org Join us! U. Washington Kevin Hybiske