HPA Subcellular dataset

Do you want to explore/work with the HPA Subcellular atlas images? Tired to get one-liners as answers when you ask for it? Looking for some more comprehensive information that what you get when you are redirected to a github URL? Don't worry: some of us got your back!

This growing document is going to guide you through all the aspects of the HPA Subcellular dataset: what it is, where you can get/access it and how can you use it or generate your own data from it.

Subcellular atlas

You can find and play with the HPA Subcellular atlas by entering the "Subcellular" section at the HPA website: https://www.proteinatlas.org/ . There you are going to find a lot of fascinating information (particularly if you have biology knowledge) that we are not going to address here. What we are interested in is the dataset that feeds that website: IF images and a huge csv collecting all the information gathered about them.

How to access the HPA Subcellular data from the HPA

IMPORTANT: you don't need to access directly the HPA data, all relevant files have been already downloaded and are placed in our lab storage. The following information is more to understand where it all comes from.

To directly access the data from the HPA servers you need to request access to: - https://lims.proteinatlas.org/: the HPA annotation software. You have to ask permission to Emma Lundberg and contact the HPA personel (Kalle von Feilitzen). You also NEED to be connected to the Scilifelab VPN to access LIMS. - https://if.proteinatlas.org/: you need to ask separate permission to access the files. Again, contact the HPA personel (Kalle von Feilitzen). You DON'T need to be connected to Scilifelab VPN to access this data.

You can check your access directly via browser at https://if.proteinatlas.org/ . You will find there a bunch of folders with images and the IF-image.csv with all the meta-data regarding them.

Now, normally you will not need to access the HPA servers on your own: we have already downloaded most of the dataset in our lab storage and you can work with it faster and more efficiently there. You can request access to the lab storage by following the instructions in the file storage_information.pdf: the project/dataset name is HPA

`IF-image.csv` file

This file contains all relevant information about the dataset images in CSV format. You will always find the IF-image.csv file in the base folder of the dataset. The most regularly used columns are:

filename: the location of the file in the dataset (replace the /archive/ part by /images/). This is the prefix for the blue (nuclei), red (microtubules), yellow (ER) and green (protein) channels of each image (Note: check Channel_scheme bellow)
status: the status of the image according to the annotation process in LIMS. Check the file Overview_data_and_indentifiers_HPA.pdf in the util folder for a lot of interesting information about how this works.
locations: in which subcellular compartments the protein is localized. This has been annotated by humans. Check the file SOP _annotation _2019.pdf in the util folder to understand the different categories.
unspecific: there are several reasons why an image might be flagged as unspecific. Check the files Overview_data_and_indentifiers_HPA.pdf and SOP _annotation _2019.pdf in the util folder to learn more.
antibody, ensembl_ids: related antibody and Ensembl gene ids for the image. Check the file Overview_data_and_indentifiers_HPA.pdf in the util folder for a more detailed explanation.
atlas_name: cell line. Check the file SOP _annotation _2019.pdf for the list of cell lines.
versions, earliest_version, latest_version: the HPA is a growing initiative that constantly adds new images to the dataset. They add a controlled batch of new images periodically, increasing the dataset version. At the time I'm writing this information, the current dataset version is v24. You can see when an image was added or re-annotated by playing with these 3 columns.
x, y, z: coordinates of the image in the well plate.
cell_count: number of cells segmented by the HPA cell segmentation model ( https://github.com/CellProfiling/HPA-Cell-Segmentation ).
Channel scheme: channel scheme of the images. Not all images follow the same scheme as the classical subcellular images (blue -nuclei, red - microtubules, yellow -ER and green - protein); "Cilia" scheme, for example, is different.

This file is big (>600MB), so if you want to work with a subset of its images you can use a python library like pandas to filter what you want. The filtering can be complex: feel free to ask advice from your colleagues or the HPA personel (Kalle von Feilitzen)

Public image files

Under the folder images you will find downloaded all the HPA publicly released images (state 35, check Overview_data_and_indentifiers_HPA.pdf in the util folder).

The (original) images in the HPA servers are stored in TIFF format, compressed. We have converted them in a more useful uncompressed PNG 16bit format. Although most of the images are 2048 x 2048, there are some old plates with a 1728 x 1728 resolution, you have to account for that.

For each image prefix you will find a blue (nuclei), red (microtubules), yellow (ER) and green (protein) image channel, for the "Normal" scheme.

Additional image works

There are other folders where you can find several items produced from the original images:

segmentations: this folder contains all nuclei and cell segmentations masks created using the HPACellSegmentator software.
crops: this folder contains each single cell crop from the above mentioned cell segmentations. In detail:
Each cell mask crop of size 1024 x 1024 (this is just the mask)
Each cell image (one per channel) crop of size 1024 x 1024 (this is the original image cropped with the cell at its center)
Each cell image masked (one per channel) crop of size 1024 x 1024 (this is the original image cropped with the cell at its center AND masked by the cell shape)
A general crop_info.csv in the crops base folder where you can find the cell position in the crop, for all cells.
v25: the images/segmentations/crops newly added/modified in the v25 version. All its contents have been merged with the aforementioned general folders, it's just a way to have direct access to the latest version of the HPA in case you want to compare with older versions you might have downloaded already.
private: this folder contains annotated but unpublished images for v23. Note: all this dataset is to not be shared with anyone outside Lundberg's lab, but this particular sub-set is even more delicate to this regard, as is data not available anywhere else and MUST be kept private.
micronuclei: this folder contains micronuclei detection, segmentation and quantification for all the v24 HPA images. In detail:
The segmentations folder constains all micronuclei segmentations for each FOV of the HPA v24 dataset. If the FOV name is not there, there were no micronuclei segmented
A general micronuclei_info.csv in the micronuclei base folder where you can find the data of each micronuclei. This data contains positional, morphological and intensity information of the micronuclei and related nuclei.
other: a catchall folder with information that was deemed useful by some people before organizing the things properly. Go there an explore at your own risk!

Utility scripts

util/download_simple.py: this utility script allows you to download the IF-image.csv, do some filtering over it and download all resulting images, all directly from the HPA servers. It's mainly used to update the dataset to the latest version when it's published.

WebDAV (lab storage) scripts

Although they work on their own, most of these scripts are guidelines in case you want to integrate direct connection to the lab storage into your own scripts.

util/webdav_download_file_list.py: this script allows you to download all paths included in a file list from the lab storage. Take it as an example, you can modify it as you see fit to your own needs.
util/webdav_download_folder.py: this script allows you to download a folder (and all it's contents) from the lab storage. Take it as an example, you can modify it as you see fit to your own needs.
util/webdav_stream_file_list.py: this script serves as an example of how you could do "streaming" file by file from the lab storage. You can of course run multiple streaming list in parallel, keep a certain number of files before removing them, etc, etc...
util/webdav_upload_folder.py: this script allows you to upload the contents of a folder (does not create the base folder itself) to a pre-existing folder in the lab storage. Take it as an example, you can modify it as you see fit to your own needs.

Additional resources

You might want to check this other resources related to several produced items in the HPA dataset:

HPACellSegmentatorPortable [ https://github.com/CellProfiling/HPACellSegmentatorPortable ]: this is the code used to segment the HPA images (it can generate the derived crops if you want).
cell_croper [ https://github.com/CellProfiling/ell_code_template/tree/master/examples/cell_cropper ]: this is the code to just generate crops from any arbritary set of images/segmentation mask.
cellpose_segmentation [ https://github.com/CellProfiling/ell_code_template/tree/master/examples/cellpose_segmentation ]: this is the code for a newer segmentation method, in case you want a different segmentation for the HPA images.
micronuclei_segmentation [ https://github.com/CellProfiling/ell_code_template/tree/master/examples/micronuclei_segmentation ]: this is the code for detect and segment micronuclei from HPA-like images.
micronuclei_quantification [ https://github.com/CellProfiling/ell_code_template/tree/master/project_utils/micronuclei ]: this is the code to calculate the information related to the segmented micronuclei.