1 Downloading Data

The first step in the project is obtaining the data that we will analyze, as well as processing it so it is in the format we need for further study.

1.1 Downloading through recount 3

recount3 is a project that provides access to over 750,000 publicly available human and mouse RNA-seq samples, all uniformly processed. The original documentation can be found here.

For this project, we will be working with BRCA data from The Cancer Genome Atlas Program (TCGA). We will begin by loading the recount3 package and its dependencies.

## Load recount3 R package
library("recount3")

The data available in the recount3 project can be explored here. To download the data we are interested in, we will use the following code.

# Create RangedSummarizedExperiment (RSE) object
rse_BRCA <- recount3::create_rse_manual(
              project = "BRCA",
              project_home = "data_sources/tcga",
              organism = "human",
              annotation = "gencode_v29",
              type = "gene"
          )

rse_BRCA
## class: RangedSummarizedExperiment 
## dim: 64837 1256 
## metadata(8): time_created recount3_version ... annotation recount3_url
## assays(1): raw_counts
## rownames(64837): ENSG00000278704.1 ENSG00000277400.1 ... ENSG00000182484.15_PAR_Y ENSG00000227159.8_PAR_Y
## rowData names(10): source type ... havana_gene tag
## colnames(1256): 0b1cf6de-f98f-4020-baa7-c4fc518e5f00 20c90e97-e900-488e-af6a-6648d9a8bb01 ...
##   a340e38d-af36-4b13-add3-50be6819d4fe a79dd5b2-f97a-4b24-b762-fea1cd5238fe
## colData names(937): rail_id external_id ... recount_seq_qc.errq BigWigURL

This data set includes 63856 genes, and 1256 samples.

1.2 Obtaining read counts

The RSE object contains the raw base-pair counts, to compute the read counts, we use the following code.

# Saving read counts as an assay in the rse object
assay(rse_BRCA, "counts") <- compute_read_counts(rse_BRCA)

1.3 Exploring data

Our data is stored as a RangedSummarizedExperiment object, which includes data for both the samples (columns), and the genes (rows), as well as metadata and the assays (counts, in this case). We will first perform quality control, and then select some sample attributes for our model.

# Number of samples and number of sample attributes 
dim(colData(rse_BRCA))
## [1] 1256  937

For our 1256 samples, there are 937 columns, corresponding to 937 possible attributes to include in our model. To reduce the options, we will first filter by removing those attributes with only NA values.

# Keep only non-empty columns
colData(rse_BRCA) <- colData(rse_BRCA)[, colSums(is.na(colData(rse_BRCA))) != nrow(colData(rse_BRCA))]

dim(colData(rse_BRCA))
## [1] 1256  388

This leaves us with 388 columns. These attributes have terms separated by periods, we can explore the source of each attribute by selecting the first keyword of each one to see what kind of information we have.

# Split each string by '.'
split_col_names <- strsplit(colnames(colData(rse_BRCA)), ".", fixed=TRUE)

# Keep only first element 
split_col_names <- sapply(split_col_names, getElement, 1)

# Counting the results
table(split_col_names)
## split_col_names
##       BigWigURL     external_id         rail_id recount_project      recount_qc  recount_seq_qc           study 
##               1               1               1               5              79              12               1 
##            tcga 
##             288

Our main sources of information are from recount and from tcga.

1.4 Quality control

Let’s look at both recount_qc and recount_seq_qc for quality control. The meaning of the different fields can be found in the documentation.

colnames(colData(rse_BRCA))[grepl('^recount_seq_qc', colnames(colData(rse_BRCA)))]
##  [1] "recount_seq_qc.min_len"                  "recount_seq_qc.max_len"                 
##  [3] "recount_seq_qc.avg_len"                  "recount_seq_qc.#distinct_quality_values"
##  [5] "recount_seq_qc.#bases"                   "recount_seq_qc.%a"                      
##  [7] "recount_seq_qc.%c"                       "recount_seq_qc.%g"                      
##  [9] "recount_seq_qc.%t"                       "recount_seq_qc.%n"                      
## [11] "recount_seq_qc.avgq"                     "recount_seq_qc.errq"
head(colnames(colData(rse_BRCA))[grepl('^recount_qc', colnames(colData(rse_BRCA)))], 10)
##  [1] "recount_qc.aligned_reads%.chrm"                 "recount_qc.aligned_reads%.chrx"                
##  [3] "recount_qc.aligned_reads%.chry"                 "recount_qc.bc_auc.all_reads_all_bases"         
##  [5] "recount_qc.bc_auc.all_reads_annotated_bases"    "recount_qc.bc_auc.unique_reads_all_bases"      
##  [7] "recount_qc.bc_auc.unique_reads_annotated_bases" "recount_qc.bc_auc.all_%"                       
##  [9] "recount_qc.bc_auc.unique_%"                     "recount_qc.bc_frag.count"
# The field `recount_seq_qc.avgq` corresponds to the weighted average over Phred quality scores
summary(rse_BRCA$recount_seq_qc.avgq)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   21.10   36.90   37.40   37.31   37.90   39.00

All Phred scores are above 20, meaning all samples are estimated to be at least 99% accurate.

# Exploring unmapped reads 
summary(rse_BRCA$'recount_qc.star.%_of_reads_unmapped:_other')
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01000 0.03000 0.04000 0.05249 0.05000 1.31000
summary(rse_BRCA$'recount_qc.star.%_of_reads_unmapped:_too_many_mismatches')
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0       0       0       0
summary(rse_BRCA$'recount_qc.star.%_of_reads_unmapped:_too_short')
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.030   4.990   5.650   6.319   6.522  56.120

There are few unmapped reads, except for some that are too short.

# Exploring mismatch rate
summary(rse_BRCA$'recount_qc.star.mismatch_rate_per_base,_%')
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1400  0.2500  0.2900  0.3026  0.3400  0.7100

Very small percentage of mismatches per base. Overall, it seems that the samples are good quality, and we will proceed with the selection of attributes for our model.