Human Protein Atlas in R (2024)

Laurent Gatto1

1de Duve Institute, UCLouvain, Belgium

1 May 2024

Abstract

The Human Protein Atlas (HPA) is a systematic study oh the human proteome using antibody-based proteomics. Multiple tissues and cell lines are systematically assayed affinity-purified antibodies and confocal microscopy. The hpar package is an R interface to the HPA project. It distributes three data sets, provides functionality to query these and to access detailed information pages, including confocal microscopy images available on the HPA web page.

Package

hpar 1.46.0

1.1 The HPA project

From the Human Protein Atlas(Uhlén et al. 2005; Uhlen et al. 2010) site:

The Swedish Human Protein Atlas project, funded by the Knut andAlice Wallenberg Foundation, has been set up to allow for asystematic exploration of the human proteome using Antibody-BasedProteomics. This is accomplished by combining high-throughputgeneration of affinity-purified antibodies with protein profiling ina multitude of tissues and cells assembled in tissuemicroarrays. Confocal microscopy analysis using human cell lines isperformed for more detailed protein localisation. The program hoststhe Human Protein Atlas portal with expression profiles of humanproteins in tissues and cells.

The hpar package provides access to HPA data from the Rinterface. It also distributes the following data sets:

Several flat files are distributed by the HPA project and availablewithin the package as data.frames, other datasets are availablethrough a search query on the HPA website. The description below istaken from the HPA site:

  • hpaNormalTissue: Normal tissue data. Expression profiles forproteins in human tissues based on immunohistochemisty usingtissue micro arrays. The tab-separated file includes Ensemblgene identifier (“Gene”), tissue name (“Tissue”), annotated celltype (“Cell type”), expression value (“Level”), and the genereliability of the expression value (“Reliability”).

  • hpaNormalTissue16.1: Same as above, for version 16.1.

  • hpaCancer: Pathology data. Staining profiles for proteins inhuman tumor tissue based on immunohistochemisty using tissuemicro arrays and log-rank P value for Kaplan-Meier analysis ofcorrelation between mRNA expression level and patient survival.The tab-separated file includes Ensembl gene identifier(“Gene”), gene name (“Gene name”), tumor name (“Cancer”), thenumber of patients annotated for different staining levels(“High”, “Medium”, “Low” & “Not detected”) and log-rank p valuesfor patient survival and mRNA correlation(“prognostic - favourable”, “unprognostic - favourable”,“prognostic - unfavourable”, “unprognostic - unfavourable”).

  • hpaCancer16.1: Same as above, for version 16.1

  • rnaConsensusTissue: RNA consensus tissue gene data. Consensustranscript expression levels summarized per gene in 54 tissuesbased on transcriptomics data from HPA and GTEx. The consensusnormalized expression (“nTPM”) value is calculated as themaximum nTPM value for each gene in the two data sources. Fortissues with multiple sub-tissues (brain regions, lymphoidtissues and intestine) the maximum of all sub-tissues is usedfor the tissue type. The tab-separated file includes Ensemblgene identifier (“Gene”), analysed sample (“Tissue”) andnormalized expression (“nTPM”).

  • rnaHpaTissue: RNA HPA tissue gene data. Transcript expressionlevels summarized per gene in 256 tissues based on RNA-seq. Thetab-separated file includes Ensembl gene identifier (“Gene”),analysed sample (“Tissue”), transcripts per million (“TPM”),protein-transcripts per million (“pTPM”) and normalizedexpression (“nTPM”).

  • rnaGtexTissue: RNA GTEx tissue gene data. Transcript expressionlevels summarized per gene in 37 tissues based on RNA-seq. Thetab-separated file includes Ensembl gene identifier (“Gene”),analysed sample (“Tissue”), transcripts per million (“TPM”),protein-transcripts per million (“pTPM”) and normalizedexpression (“nTPM”). The data was obtained from GTEx.

  • rnaFantomTissue: RNA FANTOM tissue gene data. Transcriptexpression levels summarized per gene in 60 tissues based onCAGE data. The tab-separated file includes Ensembl geneidentifier (“Gene”), analysed sample (“Tissue”), tags permillion (“Tags per million”), scaled-tags per million(“Scaled tags per million”) and normalized expression(“nTPM”). The data was obtained from FANTOM5.

  • rnaGeneTissue21.0: RNA HPA tissue gene data. Transcriptexpression levels summarized per gene in 37 tissues based onRNA-seq, for hpa version 21.0. The tab-separated file includesEnsembl gene identifier (“Gene”), analysed sample (“Tissue”),transcripts per million (“TPM”), protein-transcripts per million(“pTPM”).

  • rnaGeneCellLine: RNA HPA cell line gene data. Transcriptexpression levels summarized per gene in 69 cell lines. Thetab-separated file includes Ensembl gene identifier (“Gene”),analysed sample (“Cell line”), transcripts per million (“TPM”),protein-coding transcripts per million (“pTPM”) and normalizedexpression (“nTPM”).

  • rnaGeneCellLine16.1: Same as above, for version 16.1.

  • hpaSubcellularLoc: Subcellular location data. Subcellularlocation of proteins based on immunofluorescently stainedcells. The tab-separated file includes the following columns:Ensembl gene identifier (“Gene”), name of gene (“Gene name”),gene reliability score (“Reliability”), enhanced locations(“Enhanced”), supported locations (“Supported”), Approvedlocations (“Approved”), uncertain locations (“Uncertain”),locations with single-cell variation in intensity(“Single-cell variation intensity”), locations with spatialsingle-cell variation (“Single-cell variation spatial”),locations with observed cell cycle dependency (type can be oneor more of biological definition, custom data or correlation)(“Cell cycle dependency”), Gene Ontology Cellular Component termidentifier (“GO id”).

  • hpaSubcellularLoc16.1: Same as above, for version 16.1.

  • hpaSubcellularLoc14: Same as above, for version 14.

  • hpaSecretome: Secretome data. The human secretome is heredefined as all Ensembl genes with at least one predictedsecreted transcript according to HPA predictions. The completeinformation about the HPA Secretome data is given onhttps://www.proteinatlas.org/humanproteome/blood/secretome. Thisdataset has 315 columns and includes the Ensembl gene identifier(“Gene”). Information about the additionnal variables can befound here by clicking onShow/hide columns.

The hpar::allHparData() returns a list of all datasets (see below).

1.2 HPA data usage policy

The use of data and images from the HPA in publications andpresentations is permitted provided that the following conditions aremet:

  • The publication and/or presentation are solely for informational andnon-commercial purposes.
  • The source of the data and/or image is referred to the HPAsite1 www.proteinatlas.org and/or one or more of our publicationsare cited.

1.3 Installation

hpar is available through the Bioconductorproject. Details about the package and the installation procedure canbe found on itslanding page. Toinstall using the dedicated Bioconductor infrastructure, run :

## install BiocManager only oneinstall.packages("BiocManager")## install hparBiocManager::install("hpar")

After installation, hpar will have to be explicitlyloaded with

library("hpar")
## This is hpar version 1.46.0,## based on the Human Protein Atlas## Version: 21.1## Release data: 2022.05.31## Ensembl build: 103.38## See '?hpar' or 'vignette('hpar')' for details.

so that all the package’s functionality and data is available to theuser.

2.1 Data sets

A table descibing all dataset available in the package can be accessedwith the allHparData() function.

hpa_data <- allHparData()

The Title variable corresponds to names of the data that can bedownloaded localled and cached as part of the ExperimentHubinfrastructure.

head(normtissue <- hpaNormalTissue())
## see ?hpar and browseVignettes('hpar') for documentation
## loading from cache
## Gene Gene.name Tissue Cell.type Level## 1 ENSG00000000003 TSPAN6 adipose tissue adipocytes Not detected## 2 ENSG00000000003 TSPAN6 adrenal gland glandular cells Not detected## 3 ENSG00000000003 TSPAN6 appendix glandular cells Medium## 4 ENSG00000000003 TSPAN6 appendix lymphoid tissue Not detected## 5 ENSG00000000003 TSPAN6 bone marrow hematopoietic cells Not detected## 6 ENSG00000000003 TSPAN6 breast adipocytes Not detected## Reliability## 1 Approved## 2 Approved## 3 Approved## 4 Approved## 5 Approved## 6 Approved

Note that given that the hpa data is distributed as par theExperimentHub infrastructure, it is also possible to query it directlyfor relevant datasets.

library("ExperimentHub")eh <- ExperimentHub()query(eh, "hpar")
## ExperimentHub with 15 records## # snapshotDate(): 2024-04-29## # $dataprovider: Human Protein Atlas## # $species: hom*o sapiens## # $rdataclass: data.frame## # additional mcols(): taxonomyid, genome, description,## # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,## # rdatapath, sourceurl, sourcetype ## # retrieve records with, e.g., 'object[["EH7765"]]' ## ## title ## EH7765 | hpaCancer16.1 ## EH7766 | hpaCancer ## EH7767 | hpaNormalTissue16.1## EH7768 | hpaNormalTissue ## EH7769 | hpaSecretome ## ... ... ## EH7775 | rnaGeneCellLine16.1## EH7776 | rnaGeneCellLine ## EH7777 | rnaGeneTissue21.0 ## EH7778 | rnaGtexTissue ## EH7779 | rnaHpaTissue

2.2 HPA interface

Each data described above is a data.frame and can be easilymanipulated using standard R or BiocStyle::CRANpkg("tidyverse")tidyverse functionality.

names(normtissue)
## [1] "Gene" "Gene.name" "Tissue" "Cell.type" "Level" ## [6] "Reliability"
## Number of geneslength(unique(normtissue$Gene))
## [1] 15323
## Number of cell typeslength(unique(normtissue$Cell.type))
## [1] 141
## Number of tissueslength(unique(normtissue$Tissue))
## [1] 63
## Number of genes highlighly and reliably expressed in each cell type## in each tissue.library("dplyr")normtissue |> filter(Reliability == "Approved", Level == "High") |> count(Cell.type, Tissue) |> arrange(desc(n)) |> head()
## Cell.type Tissue n## 1 glandular cells gallbladder 1578## 2 glandular cells duodenum 1551## 3 glandular cells small intestine 1532## 4 glandular cells rectum 1490## 5 glandular cells colon 1404## 6 cells in tubules kidney 1357

We will illustrate additional datasets using the TSPAN6 (tetraspanin6) gene (ENSG00000000003) as example.

id <- "ENSG00000000003"subcell <- hpaSubcellularLoc()
## see ?hpar and browseVignettes('hpar') for documentation
## loading from cache
rna <- rnaGeneCellLine()
## see ?hpar and browseVignettes('hpar') for documentation## loading from cache
## Compine protein immunohistochemisty data, with the subcellular## location and RNA expression levels.filter(normtissue, Gene == id) |> full_join(filter(subcell, Gene == id)) |> full_join(filter(rna, Gene == id)) |> head()
## Joining with `by = join_by(Gene, Gene.name, Reliability)`
## Joining with `by = join_by(Gene, Gene.name)`
## Warning in full_join(full_join(filter(normtissue, Gene == id), filter(subcell, : Detected an unexpected many-to-many relationship between `x` and `y`.## ℹ Row 1 of `x` matches multiple rows in `y`.## ℹ Row 1 of `y` matches multiple rows in `x`.## ℹ If a many-to-many relationship is expected, set `relationship =## "many-to-many"` to silence this warning.
## Gene Gene.name Tissue Cell.type Level Reliability## 1 ENSG00000000003 TSPAN6 adipose tissue adipocytes Not detected Approved## 2 ENSG00000000003 TSPAN6 adipose tissue adipocytes Not detected Approved## 3 ENSG00000000003 TSPAN6 adipose tissue adipocytes Not detected Approved## 4 ENSG00000000003 TSPAN6 adipose tissue adipocytes Not detected Approved## 5 ENSG00000000003 TSPAN6 adipose tissue adipocytes Not detected Approved## 6 ENSG00000000003 TSPAN6 adipose tissue adipocytes Not detected Approved## Main.location Additional.location Extracellular.location## 1 Cell Junctions;Cytosol Nucleoli fibrillar center ## 2 Cell Junctions;Cytosol Nucleoli fibrillar center ## 3 Cell Junctions;Cytosol Nucleoli fibrillar center ## 4 Cell Junctions;Cytosol Nucleoli fibrillar center ## 5 Cell Junctions;Cytosol Nucleoli fibrillar center ## 6 Cell Junctions;Cytosol Nucleoli fibrillar center ## Enhanced Supported Approved Uncertain## 1 Cell Junctions;Cytosol;Nucleoli fibrillar center ## 2 Cell Junctions;Cytosol;Nucleoli fibrillar center ## 3 Cell Junctions;Cytosol;Nucleoli fibrillar center ## 4 Cell Junctions;Cytosol;Nucleoli fibrillar center ## 5 Cell Junctions;Cytosol;Nucleoli fibrillar center ## 6 Cell Junctions;Cytosol;Nucleoli fibrillar center ## Single.cell.variation.intensity Single.cell.variation.spatial## 1 Cytosol ## 2 Cytosol ## 3 Cytosol ## 4 Cytosol ## 5 Cytosol ## 6 Cytosol ## Cell.cycle.dependency## 1 ## 2 ## 3 ## 4 ## 5 ## 6 ## GO.id## 1 Cell Junctions (GO:0030054);Cytosol (GO:0005829);Nucleoli fibrillar center (GO:0001650)## 2 Cell Junctions (GO:0030054);Cytosol (GO:0005829);Nucleoli fibrillar center (GO:0001650)## 3 Cell Junctions (GO:0030054);Cytosol (GO:0005829);Nucleoli fibrillar center (GO:0001650)## 4 Cell Junctions (GO:0030054);Cytosol (GO:0005829);Nucleoli fibrillar center (GO:0001650)## 5 Cell Junctions (GO:0030054);Cytosol (GO:0005829);Nucleoli fibrillar center (GO:0001650)## 6 Cell Junctions (GO:0030054);Cytosol (GO:0005829);Nucleoli fibrillar center (GO:0001650)## Cell.line TPM pTPM nTPM## 1 A-431 21.3 25.6 24.8## 2 A549 20.5 24.4 23.0## 3 AF22 78.8 96.9 80.6## 4 AN3-CA 38.7 47.1 42.2## 5 ASC diff 19.2 22.1 28.7## 6 ASC TERT1 11.3 13.1 16.6

It is also possible to directly open the HPA page for a specific gene(see figure below).

browseHPA(id)

Human Protein Atlas in R (1)

The HPA web page for the tetraspanin 6 gene (ENSG00000000003).

2.3 HPA release information

Information about the HPA release used to build the installedhpar package can be accessed with getHpaVersion,getHpaDate and getHpaEnsembl. Full release details can be found onthe HPA release historypage.

getHpaVersion()
## version ## "21.1"
getHpaDate()
## date ## "2022.05.31"
getHpaEnsembl()
## ensembl ## "103.38"

Let’s compare the subcellular localisation annotation obtained fromthe HPA subcellular location data set and the information available inthe Bioconductor annotation packages.

id <- "ENSG00000001460"filter(subcell, Gene == id)
## Gene Gene.name Reliability Main.location Additional.location## 1 ENSG00000001460 STPG1 Approved Nucleoplasm ## Extracellular.location Enhanced Supported Approved Uncertain## 1 Nucleoplasm ## Single.cell.variation.intensity Single.cell.variation.spatial## 1 ## Cell.cycle.dependency GO.id## 1 Nucleoplasm (GO:0005654)

Below, we first extract all cellular component GO terms available forid from the org.Hs.eg.db human annotation andthen retrieve their term definitions using the GO.dbdatabase.

library("org.Hs.eg.db")library("GO.db")ans <- AnnotationDbi::select(org.Hs.eg.db, keys = id, columns = c("ENSEMBL", "GO", "ONTOLOGY"), keytype = "ENSEMBL")
## 'select()' returned 1:many mapping between keys and columns
ans <- ans[ans$ONTOLOGY == "CC", ]ans
## ENSEMBL GO EVIDENCE ONTOLOGY## 2 ENSG00000001460 GO:0005634 IEA CC## 3 ENSG00000001460 GO:0005739 IEA CC
sapply(as.list(GOTERM[ans$GO]), slot, "Term")
## GO:0005634 GO:0005739 ## "nucleus" "mitochondrion"
## R version 4.4.0 beta (2024-04-15 r86425)## Platform: x86_64-pc-linux-gnu## Running under: Ubuntu 22.04.4 LTS## ## Matrix products: default## BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so ## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0## ## locale:## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## time zone: America/New_York## tzcode source: system (glibc)## ## attached base packages:## [1] stats4 stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages:## [1] ExperimentHub_2.12.0 AnnotationHub_3.12.0 BiocFileCache_2.12.0## [4] dbplyr_2.5.0 hpar_1.46.0 GO.db_3.19.1 ## [7] org.Hs.eg.db_3.19.1 AnnotationDbi_1.66.0 IRanges_2.38.0 ## [10] S4Vectors_0.42.0 Biobase_2.64.0 BiocGenerics_0.50.0 ## [13] dplyr_1.1.4 BiocStyle_2.32.0 ## ## loaded via a namespace (and not attached):## [1] KEGGREST_1.44.0 xfun_0.43 bslib_0.7.0 ## [4] htmlwidgets_1.6.4 crosstalk_1.2.1 vctrs_0.6.5 ## [7] tools_4.4.0 generics_0.1.3 curl_5.2.1 ## [10] tibble_3.2.1 fansi_1.0.6 RSQLite_2.3.6 ## [13] blob_1.2.4 pkgconfig_2.0.3 lifecycle_1.0.4 ## [16] GenomeInfoDbData_1.2.12 compiler_4.4.0 Biostrings_2.72.0 ## [19] GenomeInfoDb_1.40.0 htmltools_0.5.8.1 sass_0.4.9 ## [22] yaml_2.3.8 pillar_1.9.0 crayon_1.5.2 ## [25] jquerylib_0.1.4 DT_0.33 cachem_1.0.8 ## [28] mime_0.12 tidyselect_1.2.1 digest_0.6.35 ## [31] purrr_1.0.2 bookdown_0.39 BiocVersion_3.19.1 ## [34] fastmap_1.1.1 cli_3.6.2 magrittr_2.0.3 ## [37] utf8_1.2.4 withr_3.0.0 filelock_1.0.3 ## [40] UCSC.utils_1.0.0 rappdirs_0.3.3 bit64_4.0.5 ## [43] rmarkdown_2.26 XVector_0.44.0 httr_1.4.7 ## [46] bit_4.0.5 png_0.1-8 memoise_2.0.1 ## [49] evaluate_0.23 knitr_1.46 rlang_1.1.3 ## [52] glue_1.7.0 DBI_1.2.2 BiocManager_1.30.22 ## [55] jsonlite_1.8.8 R6_2.5.1 zlibbioc_1.50.0

Uhlen, Mathias, Per Oksvold, Linn fa*gerberg, Emma Lundberg, Kalle Jonasson, Mattias Forsberg, Martin Zwahlen, et al. 2010. “Towards a knowledge-based Human Protein Atlas.” Nature Biotechnology 28 (12): 1248–50. https://doi.org/10.1038/nbt1210-1248.

Uhlén, Mathias, Erik Björling, Charlotta Agaton, Cristina Al-Khalili A. Szigyarto, Bahram Amini, Elisabet Andersen, Ann-Catrin C. Andersson, et al. 2005. “A human protein atlas for normal and cancer tissues based on antibody proteomics.” Molecular & Cellular Proteomics : MCP 4 (12): 1920–32. https://doi.org/10.1074/mcp.M500279-MCP200.

Human Protein Atlas in R (2024)

References

Top Articles
Haal Dungeons of Dreadrock voor Switch: Beste Prijs na Vergelijking - Cdkeynl.nl
21740, MD Real Estate & Homes for Sale | realtor.com®
Sprinter Tyrone's Unblocked Games
Regal Amc Near Me
Jailbase Orlando
Senior Tax Analyst Vs Master Tax Advisor
Southeast Iowa Buy Sell Trade
Boggle Brain Busters Bonus Answers
RuneScape guide: Capsarius soul farming made easy
T&G Pallet Liquidation
Irving Hac
Readyset Ochsner.org
Shuiby aslam - ForeverMissed.com Online Memorials
Insidekp.kp.org Hrconnect
Darksteel Plate Deepwoken
Belle Delphine Boobs
Kaomoji Border
Viprow Golf
Wicked Local Plymouth Police Log 2022
Troy Bilt Mower Carburetor Diagram
Costco Great Oaks Gas Price
How to Watch the Fifty Shades Trilogy and Rom-Coms
Concordia Apartment 34 Tarkov
Why do rebates take so long to process?
Joan M. Wallace - Baker Swan Funeral Home
Drug Test 35765N
Seeking Arrangements Boston
Blackboard Login Pjc
Yale College Confidential 2027
Safeway Aciu
Bfri Forum
Gridwords Factoring 1 Answers Pdf
Delta Rastrear Vuelo
Quality Tire Denver City Texas
Ixl Lausd Northwest
Bitchinbubba Face
Pokemon Reborn Locations
Final Jeopardy July 25 2023
Check From Po Box 1111 Charlotte Nc 28201
Tryst Houston Tx
Craigs List Palm Springs
Craigs List Hartford
My Eschedule Greatpeople Me
CrossFit 101
Cara Corcione Obituary
Enter The Gungeon Gunther
Hughie Francis Foley – Marinermath
Bellelement.com Review: Real Store or A Scam? Read This
Craigslist Charles Town West Virginia
Cryptoquote Solver For Today
Noelleleyva Leaks
Mike De Beer Twitter
Latest Posts
Article information

Author: Nathanael Baumbach

Last Updated:

Views: 5826

Rating: 4.4 / 5 (75 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Nathanael Baumbach

Birthday: 1998-12-02

Address: Apt. 829 751 Glover View, West Orlando, IN 22436

Phone: +901025288581

Job: Internal IT Coordinator

Hobby: Gunsmithing, Motor sports, Flying, Skiing, Hooping, Lego building, Ice skating

Introduction: My name is Nathanael Baumbach, I am a fantastic, nice, victorious, brave, healthy, cute, glorious person who loves writing and wants to share my knowledge and understanding with you.