WikiPathways is a biological pathway database and describes the interactions between biochemical entities in biological processes [1,2,3,4]. It can be downloaded and used in various formats, one of which is the Resource Description Framework (RDF) [5].
The WikiPathways SPARQL endpoint can be found at http://sparql.wikipathways.org/. SPARQL allows you to query much of the content of the the WikiPathways data in a machine readable way, which has been used, for example, in the Open PHACTS project [6,7].
This book discusses how SPARQL can be used to extract information, using numerous example queries, like the following to get metadata about the data loaded into the SPARQL endpoint.
The following query provides some information about what is currently loaded in the public SPARQL endpoint at http://sparql.wikipathways.org:
SPARQL sparql/metadata.rq (run, edit)
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX void: <http://rdfs.org/ns/void#>
PREFIX pav: <http://purl.org/pav/>
select distinct ?dataset (str(?titleLit) as ?title) ?date ?license where {
?dataset a void:Dataset ;
dcterms:title ?titleLit ;
dcterms:license ?license ;
pav:createdOn ?date .
}
Which gives as output:
dataset | title | date | license |
http://data.wikipathways.org/20191210/rdf/ | WikiPathways RDF 20191210 | 2019-12-09T23:28:23.591Z | http://creativecommons.org/publicdomain/zero/1.0/ |
The give some idea of the content of the SPARQL endpoint, this section gives some overall statistics.
We can list the number of pathways for each species available in WikiPathways with this query:
SPARQL sparql/pathwayCountBySpecies.rq (run, edit)
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX wp: <http://vocabularies.wikipathways.org/wp#>
SELECT DISTINCT ?organism (str(?label) as ?name) (count(?pw) as ?pathwayCount)
WHERE {
?pw dc:title ?title ;
wp:organism ?organism ;
wp:organismName ?label .
}
ORDER BY DESC(?pathwayCount)
It shows us that there is a strong bias towards human pathways:
organism | name | pathwayCount |
http://purl.obolibrary.org/obo/NCBITaxon_9606 | Homo sapiens | 1044 |
http://purl.obolibrary.org/obo/NCBITaxon_9913 | Bos taurus | 274 |
http://purl.obolibrary.org/obo/NCBITaxon_10090 | Mus musculus | 194 |
http://purl.obolibrary.org/obo/NCBITaxon_10116 | Rattus norvegicus | 155 |
http://purl.obolibrary.org/obo/NCBITaxon_4932 | Saccharomyces cerevisiae | 115 |
http://purl.obolibrary.org/obo/NCBITaxon_7955 | Danio rerio | 83 |
http://purl.obolibrary.org/obo/NCBITaxon_6239 | Caenorhabditis elegans | 61 |
http://purl.obolibrary.org/obo/NCBITaxon_9598 | Pan troglodytes | 46 |
http://purl.obolibrary.org/obo/NCBITaxon_9615 | Canis familiaris | 44 |
http://purl.obolibrary.org/obo/NCBITaxon_9031 | Gallus gallus | 40 |
http://purl.obolibrary.org/obo/NCBITaxon_3702 | Arabidopsis thaliana | 31 |
http://purl.obolibrary.org/obo/NCBITaxon_7227 | Drosophila melanogaster | 30 |
http://purl.obolibrary.org/obo/NCBITaxon_7165 | Anopheles gambiae | 14 |
http://purl.obolibrary.org/obo/NCBITaxon_1773 | Mycobacterium tuberculosis | 12 |
http://purl.obolibrary.org/obo/NCBITaxon_4530 | Oryza sativa | 11 |
http://purl.obolibrary.org/obo/NCBITaxon_562 | Escherichia coli | 9 |
http://purl.obolibrary.org/obo/NCBITaxon_3694 | Populus trichocarpa | 5 |
http://purl.obolibrary.org/obo/NCBITaxon_9796 | Equus caballus | 5 |
http://purl.obolibrary.org/obo/NCBITaxon_4081 | Solanum lycopersicum | 4 |
http://purl.obolibrary.org/obo/NCBITaxon_4577 | Zea mays | 4 |
http://purl.obolibrary.org/obo/NCBITaxon_1423 | Bacillus subtilis | 2 |
http://purl.obolibrary.org/obo/NCBITaxon_5833 | Plasmodium falciparum | 1 |
http://purl.obolibrary.org/obo/NCBITaxon_5518 | Gibberella zeae | 1 |
Counting metabolites is tricky, as metabolites that are biologically the same (e.g. different charge startes) can have different identifiers. A further complications is that not all metabolites in WikiPathways always have stereochemistry defined, for example because it is biologically obvious, as for amino acids. But we can count the number of Wikidata identifiers to get a reasonable estimate:
SPARQL sparql/metaboliteCountBySpecies.rq (run, edit)
PREFIX gpml: <http://vocabularies.wikipathways.org/gpml#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
select (count(distinct ?wikidata) as ?count) (str(?label) as ?species) where {
?metabolite a wp:Metabolite ;
wp:bdbWikidata ?wikidata ;
dcterms:isPartOf ?pw .
?pw wp:organismName ?label .
} GROUP BY ?label ORDER BY DESC(?count)
This tells us:
count | species |
2893 | Homo sapiens |
843 | Bos taurus |
840 | Mus musculus |
489 | Rattus norvegicus |
439 | Arabidopsis thaliana |
338 | Saccharomyces cerevisiae |
169 | Danio rerio |
125 | Canis familiaris |
104 | Pan troglodytes |
97 | Mycobacterium tuberculosis |
81 | Caenorhabditis elegans |
75 | Gallus gallus |
69 | Oryza sativa |
65 | Escherichia coli |
63 | Drosophila melanogaster |
58 | Zea mays |
49 | Anopheles gambiae |
39 | Solanum lycopersicum |
31 | Populus trichocarpa |
20 | Equus caballus |
13 | Plasmodium falciparum |
11 | Gibberella zeae |
8 | Bacillus subtilis |