Pathway knowledge extracted from CORD-19 dataset for the fight against COVID-19. This work is published under the CC0 waiver to be freely used and redistributed.
Link to repo with more information: https://github.com/wikipathways/cord-19
Important Note: This notebook is under active development to present knowledge resources and tools to help tackle the COVID-19 outbreak. It is NOT a guide to public information. The content presented here has NOT been filtered or reviewed.
COVID-19 Open Research Dataset (CORD-19)
https://pages.semanticscholar.org/coronavirus-research
“In response to the COVID-19 pandemic, the Allen Institute for AI has partnered with leading research groups to prepare and distribute the COVID-19 Open Research Dataset (CORD-19), a free resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community.”
The PMC set of articles is based on this query.
In total there are 9996 unique PMCIDs in CORD-19 collection (based on a filtering of the metedata file)
Pathway Figure OCR (PFOCR)
Independently, the WikiPathways team, led by the Pico group at Gladstone has mined pathway figures at PMC from over the past 25 years (1995-2019) using a combination of image queries and machine learning, arriving at a set of 64,643 pathway figure images published and indexed by PMC.
cord19.pfocr <- readRDS("cord19_pfocr.rds")
sprintf("Of the 9996 PMC papers in CORD-19, %i papers contain a total of %i pathway figures.", length(unique(cord19.pfocr$pmcid)), length(unique(cord19.pfocr$figid)))
[1] "Of the 9996 PMC papers in CORD-19, 189 papers contain a total of 221 pathway figures."
We then developed an entity recognition pipeline tailored for human gene mentions commonly found in pathway figures. This pipeline involves optical character recognition (OCR) followed by a series of normalizations and tranformations applied to the OCR output while matching against a custom lexicon of human gene symbols. In addition to the genes we’ve recognized (as described below) we still have the raw OCR output as a JSON that may be of interest to the NLP community. We have collected the figure titles and captions associated with this set as well.
cord19.pfocr.genes <- readRDS("cord19_pfocr_genes.rds")
sprintf("Of the %i pathway figures, we could identify one or more genes from %i of them among %i PMC papers.", length(unique(cord19.pfocr$figid)), length(unique(cord19.pfocr.genes$figid)), length(unique(cord19.pfocr.genes$pmcid)))
[1] "Of the 221 pathway figures, we could identify one or more genes from 216 of them among 184 PMC papers."
sprintf("These %i pathway figures contain a total of %i gene mentions mapping to %i unique gene identifiers (NCBI Gene Entrez IDs).", length(unique(cord19.pfocr.genes$figid)), length(cord19.pfocr.genes$symbol), length(unique(cord19.pfocr.genes$entrez)))
[1] "These 216 pathway figures contain a total of 4818 gene mentions mapping to 1523 unique gene identifiers (NCBI Gene Entrez IDs)."
library(dplyr)
library(tidyr)
library(ggplot2)
cord19.pfocr.genes.cnt <- cord19.pfocr.genes %>%
group_by(hgnc_symbol) %>%
summarise(count = n()) %>%
arrange(desc(count), hgnc_symbol)
cord19.pfocr.genes.cnt$hgnc_symbol <- factor(cord19.pfocr.genes.cnt$hgnc_symbol, levels = cord19.pfocr.genes.cnt$hgnc_symbol)
p <- cord19.pfocr.genes.cnt %>%
top_n(40) %>%
ggplot(aes(hgnc_symbol, count)) +
geom_bar(fill = "#CC6699", stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Most common gene mentions")
print(p)
We have perfomed disease enrichment analysis against these figure-based gene sets to characterize their collective functions. We used the filtered “knowledge” set from Jensen’s Diseases resource.
cord19.pfocr.diseases <- readRDS("cord19_pfocr_diseases.rds")
cord19.pfocr.diseases.cnt <- cord19.pfocr.diseases %>%
group_by(jensenknow7) %>%
summarise(count = n()) %>%
arrange(desc(count), jensenknow7)
cord19.pfocr.diseases.cnt$jensenknow7 <- factor(cord19.pfocr.diseases.cnt$jensenknow7, levels = cord19.pfocr.diseases.cnt$jensenknow7)
p <- cord19.pfocr.diseases.cnt %>%
top_n(20) %>%
ggplot(aes(jensenknow7, count)) +
geom_bar(fill = "#0073C2FF", stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Most common enriched disease terms")
print(p)
Other than overlap with cancer-related signaling pathways, we observe a lot of auto-immune disease hits like Crohn’s, RA, Alopecia areata and Lupus.
LS0tCnRpdGxlOiAiR2VuZXMgaW4gUGF0aHdheSBGaWd1cmVzIGZyb20gQ09SRC0xOSIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQpQYXRod2F5IGtub3dsZWRnZSBleHRyYWN0ZWQgZnJvbSBDT1JELTE5IGRhdGFzZXQgZm9yIHRoZSBmaWdodCBhZ2FpbnN0IENPVklELTE5LiBUaGlzIHdvcmsgaXMgcHVibGlzaGVkIHVuZGVyIHRoZSBDQzAgd2FpdmVyIHRvIGJlIGZyZWVseSB1c2VkIGFuZCByZWRpc3RyaWJ1dGVkLgoKTGluayB0byByZXBvIHdpdGggbW9yZSBpbmZvcm1hdGlvbjogaHR0cHM6Ly9naXRodWIuY29tL3dpa2lwYXRod2F5cy9jb3JkLTE5IAoKSW1wb3J0YW50IE5vdGU6IFRoaXMgbm90ZWJvb2sgaXMgdW5kZXIgYWN0aXZlIGRldmVsb3BtZW50IHRvIHByZXNlbnQga25vd2xlZGdlIHJlc291cmNlcyBhbmQgdG9vbHMgdG8gaGVscCB0YWNrbGUgdGhlIENPVklELTE5IG91dGJyZWFrLiAqKkl0IGlzIE5PVCBhIGd1aWRlIHRvIHB1YmxpYyBpbmZvcm1hdGlvbi4qKiBUaGUgY29udGVudCBwcmVzZW50ZWQgaGVyZSBoYXMgKipOT1QqKiBiZWVuIGZpbHRlcmVkIG9yIHJldmlld2VkLgoKIyMgQ09WSUQtMTkgT3BlbiBSZXNlYXJjaCBEYXRhc2V0IChDT1JELTE5KQpodHRwczovL3BhZ2VzLnNlbWFudGljc2Nob2xhci5vcmcvY29yb25hdmlydXMtcmVzZWFyY2gKCiJJbiByZXNwb25zZSB0byB0aGUgQ09WSUQtMTkgcGFuZGVtaWMsIHRoZSBBbGxlbiBJbnN0aXR1dGUgZm9yIEFJIGhhcyBwYXJ0bmVyZWQgd2l0aCBsZWFkaW5nIHJlc2VhcmNoIGdyb3VwcyB0byBwcmVwYXJlIGFuZCBkaXN0cmlidXRlIHRoZSBDT1ZJRC0xOSBPcGVuIFJlc2VhcmNoIERhdGFzZXQgKENPUkQtMTkpLCBhIGZyZWUgcmVzb3VyY2Ugb2Ygb3ZlciAyOSwwMDAgc2Nob2xhcmx5IGFydGljbGVzLCBpbmNsdWRpbmcgb3ZlciAxMywwMDAgd2l0aCBmdWxsIHRleHQsIGFib3V0IENPVklELTE5IGFuZCB0aGUgY29yb25hdmlydXMgZmFtaWx5IG9mIHZpcnVzZXMgZm9yIHVzZSBieSB0aGUgZ2xvYmFsIHJlc2VhcmNoIGNvbW11bml0eS4iCgpUaGUgUE1DIHNldCBvZiBhcnRpY2xlcyBpcyBiYXNlZCBvbiBbdGhpcyBxdWVyeV0oaHR0cHM6Ly93d3cubmNiaS5ubG0ubmloLmdvdi9wbWMvP3Rlcm09JTIyQ09WSUQtMTklMjIrT1IrQ29yb25hdmlydXMrT1IrJTIyQ29yb25hK3ZpcnVzJTIyK09SKyUyMjIwMTktbkNvViUyMitPUislMjJTQVJTLUNvViUyMitPUislMjJNRVJTLUNvViUyMitPUislRTIlODAlOUNTZXZlcmUrQWN1dGUrUmVzcGlyYXRvcnkrU3luZHJvbWUlRTIlODAlOUQrT1IrJUUyJTgwJTlDTWlkZGxlK0Vhc3QrUmVzcGlyYXRvcnkrU3luZHJvbWUlRTIlODAlOUQpLgoKCioqKkluIHRvdGFsIHRoZXJlIGFyZSA5OTk2IHVuaXF1ZSBQTUNJRHMgaW4gQ09SRC0xOSBjb2xsZWN0aW9uIChiYXNlZCBvbiBhIGZpbHRlcmluZyBvZiB0aGUgW21ldGVkYXRhIGZpbGVdKGh0dHBzOi8vYWkyLXNlbWFudGljc2Nob2xhci1jb3JkLTE5LnMzLXVzLXdlc3QtMi5hbWF6b25hd3MuY29tLzIwMjAtMDMtMTMvYWxsX3NvdXJjZXNfbWV0YWRhdGFfMjAyMC0wMy0xMy5jc3YpKSoqKgoKIyMgUGF0aHdheSBGaWd1cmUgT0NSIChQRk9DUikKSW5kZXBlbmRlbnRseSwgdGhlIFtXaWtpUGF0aHdheXMgdGVhbV0oaHR0cHM6Ly93d3cud2lraXBhdGh3YXlzLm9yZy9pbmRleC5waHAvV2lraVBhdGh3YXlzOlRlYW0pLCBsZWQgYnkgdGhlIFtQaWNvIGdyb3VwXShodHRwczovL3Byb2ZpbGVzLnVjc2YuZWR1L2FsZXgucGljbykgYXQgR2xhZHN0b25lIGhhcyBtaW5lZCBwYXRod2F5IGZpZ3VyZXMgYXQgUE1DIGZyb20gb3ZlciB0aGUgcGFzdCAyNSB5ZWFycyAoMTk5NS0yMDE5KSB1c2luZyBhIGNvbWJpbmF0aW9uIG9mIGltYWdlIHF1ZXJpZXMgYW5kIG1hY2hpbmUgbGVhcm5pbmcsIGFycml2aW5nIGF0IGEgc2V0IG9mIDY0LDY0MyBwYXRod2F5IGZpZ3VyZSBpbWFnZXMgcHVibGlzaGVkIGFuZCBpbmRleGVkIGJ5IFBNQy4gIAoKYGBge3IsIGVjaG89VFJVRSxtZXNzYWdlPUZBTFNFfQpjb3JkMTkucGZvY3IgPC0gcmVhZFJEUygiY29yZDE5X3Bmb2NyLnJkcyIpCnNwcmludGYoIk9mIHRoZSA5OTk2IFBNQyBwYXBlcnMgaW4gQ09SRC0xOSwgJWkgcGFwZXJzIGNvbnRhaW4gYSB0b3RhbCBvZiAlaSBwYXRod2F5IGZpZ3VyZXMuIiwgbGVuZ3RoKHVuaXF1ZShjb3JkMTkucGZvY3IkcG1jaWQpKSwgbGVuZ3RoKHVuaXF1ZShjb3JkMTkucGZvY3IkZmlnaWQpKSkKYGBgCgpXZSB0aGVuIGRldmVsb3BlZCBhbiBlbnRpdHkgcmVjb2duaXRpb24gcGlwZWxpbmUgdGFpbG9yZWQgZm9yIGh1bWFuIGdlbmUgbWVudGlvbnMgY29tbW9ubHkgZm91bmQgaW4gcGF0aHdheSBmaWd1cmVzLiBUaGlzIHBpcGVsaW5lIGludm9sdmVzIG9wdGljYWwgY2hhcmFjdGVyIHJlY29nbml0aW9uIChPQ1IpIGZvbGxvd2VkIGJ5IGEgc2VyaWVzIG9mIG5vcm1hbGl6YXRpb25zIGFuZCB0cmFuZm9ybWF0aW9ucyBhcHBsaWVkIHRvIHRoZSBPQ1Igb3V0cHV0IHdoaWxlIG1hdGNoaW5nIGFnYWluc3QgYSBjdXN0b20gbGV4aWNvbiBvZiBodW1hbiBnZW5lIHN5bWJvbHMuICpJbiBhZGRpdGlvbiB0byB0aGUgZ2VuZXMgd2UndmUgcmVjb2duaXplZCAoYXMgZGVzY3JpYmVkIGJlbG93KSB3ZSBzdGlsbCBoYXZlIHRoZSByYXcgT0NSIG91dHB1dCBhcyBhIEpTT04gdGhhdCBtYXkgYmUgb2YgaW50ZXJlc3QgdG8gdGhlIE5MUCBjb21tdW5pdHkuIFdlIGhhdmUgY29sbGVjdGVkIHRoZSBmaWd1cmUgdGl0bGVzIGFuZCBjYXB0aW9ucyBhc3NvY2lhdGVkIHdpdGggdGhpcyBzZXQgYXMgd2VsbC4qCgpgYGB7ciwgZWNobz1UUlVFLG1lc3NhZ2U9RkFMU0V9CmNvcmQxOS5wZm9jci5nZW5lcyA8LSByZWFkUkRTKCJjb3JkMTlfcGZvY3JfZ2VuZXMucmRzIikKc3ByaW50ZigiT2YgdGhlICAlaSBwYXRod2F5IGZpZ3VyZXMsIHdlIGNvdWxkIGlkZW50aWZ5IG9uZSBvciBtb3JlIGdlbmVzIGZyb20gJWkgb2YgdGhlbSBhbW9uZyAlaSBQTUMgcGFwZXJzLiIsIGxlbmd0aCh1bmlxdWUoY29yZDE5LnBmb2NyJGZpZ2lkKSksIGxlbmd0aCh1bmlxdWUoY29yZDE5LnBmb2NyLmdlbmVzJGZpZ2lkKSksIGxlbmd0aCh1bmlxdWUoY29yZDE5LnBmb2NyLmdlbmVzJHBtY2lkKSkpCmBgYAoKYGBge3IsIGVjaG89VFJVRSxtZXNzYWdlPUZBTFNFfQpzcHJpbnRmKCJUaGVzZSAlaSBwYXRod2F5IGZpZ3VyZXMgY29udGFpbiBhIHRvdGFsIG9mICVpIGdlbmUgbWVudGlvbnMgbWFwcGluZyB0byAlaSB1bmlxdWUgZ2VuZSBpZGVudGlmaWVycyAoTkNCSSBHZW5lIEVudHJleiBJRHMpLiIsIGxlbmd0aCh1bmlxdWUoY29yZDE5LnBmb2NyLmdlbmVzJGZpZ2lkKSksIGxlbmd0aChjb3JkMTkucGZvY3IuZ2VuZXMkc3ltYm9sKSwgbGVuZ3RoKHVuaXF1ZShjb3JkMTkucGZvY3IuZ2VuZXMkZW50cmV6KSkpCgpsaWJyYXJ5KGRwbHlyKQpsaWJyYXJ5KHRpZHlyKQpsaWJyYXJ5KGdncGxvdDIpCmNvcmQxOS5wZm9jci5nZW5lcy5jbnQgPC0gY29yZDE5LnBmb2NyLmdlbmVzICU+JQogIGdyb3VwX2J5KGhnbmNfc3ltYm9sKSAlPiUKICBzdW1tYXJpc2UoY291bnQgPSBuKCkpICU+JQogIGFycmFuZ2UoZGVzYyhjb3VudCksIGhnbmNfc3ltYm9sKSAKCmNvcmQxOS5wZm9jci5nZW5lcy5jbnQkaGduY19zeW1ib2wgPC0gZmFjdG9yKGNvcmQxOS5wZm9jci5nZW5lcy5jbnQkaGduY19zeW1ib2wsIGxldmVscyA9IGNvcmQxOS5wZm9jci5nZW5lcy5jbnQkaGduY19zeW1ib2wpCgpwIDwtIGNvcmQxOS5wZm9jci5nZW5lcy5jbnQgJT4lCiAgdG9wX24oNDApICU+JQogIGdncGxvdChhZXMoaGduY19zeW1ib2wsIGNvdW50KSkgKyAKICBnZW9tX2JhcihmaWxsID0gIiNDQzY2OTkiLCBzdGF0ID0gImlkZW50aXR5IikgKyAKICB0aGVtZShheGlzLnRleHQueCA9IGVsZW1lbnRfdGV4dChhbmdsZSA9IDQ1LCBoanVzdCA9IDEpKSArCiAgZ2d0aXRsZSgiTW9zdCBjb21tb24gZ2VuZSBtZW50aW9ucyIpCnByaW50KHApCgpgYGAKCgpXZSBoYXZlIHBlcmZvbWVkIGRpc2Vhc2UgZW5yaWNobWVudCBhbmFseXNpcyBhZ2FpbnN0IHRoZXNlIGZpZ3VyZS1iYXNlZCBnZW5lIHNldHMgdG8gY2hhcmFjdGVyaXplIHRoZWlyIGNvbGxlY3RpdmUgZnVuY3Rpb25zLiBXZSB1c2VkIHRoZSBmaWx0ZXJlZCAia25vd2xlZGdlIiBzZXQgZnJvbSBbSmVuc2VuJ3MgRGlzZWFzZXMgcmVzb3VyY2VdKGh0dHBzOi8vZGlzZWFzZXMuamVuc2VubGFiLm9yZy9Eb3dubG9hZHMpLgoKYGBge3IsIGVjaG89VFJVRSxtZXNzYWdlPUZBTFNFfQpjb3JkMTkucGZvY3IuZGlzZWFzZXMgPC0gcmVhZFJEUygiY29yZDE5X3Bmb2NyX2Rpc2Vhc2VzLnJkcyIpCgpjb3JkMTkucGZvY3IuZGlzZWFzZXMuY250IDwtIGNvcmQxOS5wZm9jci5kaXNlYXNlcyAlPiUKICBncm91cF9ieShqZW5zZW5rbm93NykgJT4lCiAgc3VtbWFyaXNlKGNvdW50ID0gbigpKSAlPiUKICBhcnJhbmdlKGRlc2MoY291bnQpLCBqZW5zZW5rbm93NykgCgpjb3JkMTkucGZvY3IuZGlzZWFzZXMuY250JGplbnNlbmtub3c3IDwtIGZhY3Rvcihjb3JkMTkucGZvY3IuZGlzZWFzZXMuY250JGplbnNlbmtub3c3LCBsZXZlbHMgPSBjb3JkMTkucGZvY3IuZGlzZWFzZXMuY250JGplbnNlbmtub3c3KQoKcCA8LSBjb3JkMTkucGZvY3IuZGlzZWFzZXMuY250ICU+JQogIHRvcF9uKDIwKSAlPiUKICBnZ3Bsb3QoYWVzKGplbnNlbmtub3c3LCBjb3VudCkpICsgCiAgZ2VvbV9iYXIoZmlsbCA9ICIjMDA3M0MyRkYiLCBzdGF0ID0gImlkZW50aXR5IikgKyAKICB0aGVtZShheGlzLnRleHQueCA9IGVsZW1lbnRfdGV4dChhbmdsZSA9IDQ1LCBoanVzdCA9IDEpKSArCiAgZ2d0aXRsZSgiTW9zdCBjb21tb24gZW5yaWNoZWQgZGlzZWFzZSB0ZXJtcyIpCnByaW50KHApCmBgYApPdGhlciB0aGFuIG92ZXJsYXAgd2l0aCBjYW5jZXItcmVsYXRlZCBzaWduYWxpbmcgcGF0aHdheXMsIHdlIG9ic2VydmUgYSBsb3Qgb2YgYXV0by1pbW11bmUgZGlzZWFzZSBoaXRzIGxpa2UgQ3JvaG4ncywgUkEsIEFsb3BlY2lhIGFyZWF0YSBhbmQgTHVwdXMuCg==