1 Downloading Data
The first step in the project is obtaining the data that we will analyze, as well as processing it so it is in the format we need for further study.
1.1 Downloading through recount 3
recount3
is a project that provides access to over 750,000 publicly available human and mouse RNA-seq samples, all uniformly processed. The original documentation can be found here.
For this project, we will be working with BRCA data from The Cancer Genome Atlas Program (TCGA). We will begin by loading the recount3
package and its dependencies.
## Load recount3 R package
library("recount3")
The data available in the recount3
project can be explored here. To download the data we are interested in, we will use the following code.
# Create RangedSummarizedExperiment (RSE) object
<- recount3::create_rse_manual(
rse_BRCA project = "BRCA",
project_home = "data_sources/tcga",
organism = "human",
annotation = "gencode_v29",
type = "gene"
)
rse_BRCA
## class: RangedSummarizedExperiment
## dim: 64837 1256
## metadata(8): time_created recount3_version ... annotation recount3_url
## assays(1): raw_counts
## rownames(64837): ENSG00000278704.1 ENSG00000277400.1 ... ENSG00000182484.15_PAR_Y ENSG00000227159.8_PAR_Y
## rowData names(10): source type ... havana_gene tag
## colnames(1256): 0b1cf6de-f98f-4020-baa7-c4fc518e5f00 20c90e97-e900-488e-af6a-6648d9a8bb01 ...
## a340e38d-af36-4b13-add3-50be6819d4fe a79dd5b2-f97a-4b24-b762-fea1cd5238fe
## colData names(937): rail_id external_id ... recount_seq_qc.errq BigWigURL
This data set includes 63856 genes, and 1256 samples.
1.2 Obtaining read counts
The RSE
object contains the raw base-pair counts, to compute the read counts, we use the following code.
# Saving read counts as an assay in the rse object
assay(rse_BRCA, "counts") <- compute_read_counts(rse_BRCA)
1.3 Exploring data
Our data is stored as a RangedSummarizedExperiment
object, which includes data for both the samples (columns), and the genes (rows), as well as metadata and the assays (counts, in this case). We will first perform quality control, and then select some sample attributes for our model.
# Number of samples and number of sample attributes
dim(colData(rse_BRCA))
## [1] 1256 937
For our 1256 samples, there are 937 columns, corresponding to 937 possible attributes to include in our model. To reduce the options, we will first filter by removing those attributes with only NA values.
# Keep only non-empty columns
colData(rse_BRCA) <- colData(rse_BRCA)[, colSums(is.na(colData(rse_BRCA))) != nrow(colData(rse_BRCA))]
dim(colData(rse_BRCA))
## [1] 1256 388
This leaves us with 388 columns. These attributes have terms separated by periods, we can explore the source of each attribute by selecting the first keyword of each one to see what kind of information we have.
# Split each string by '.'
<- strsplit(colnames(colData(rse_BRCA)), ".", fixed=TRUE)
split_col_names
# Keep only first element
<- sapply(split_col_names, getElement, 1)
split_col_names
# Counting the results
table(split_col_names)
## split_col_names
## BigWigURL external_id rail_id recount_project recount_qc recount_seq_qc study
## 1 1 1 5 79 12 1
## tcga
## 288
Our main sources of information are from recount
and from tcga
.
1.4 Quality control
Let’s look at both recount_qc
and recount_seq_qc
for quality control. The meaning of the different fields can be found in the documentation.
colnames(colData(rse_BRCA))[grepl('^recount_seq_qc', colnames(colData(rse_BRCA)))]
## [1] "recount_seq_qc.min_len" "recount_seq_qc.max_len"
## [3] "recount_seq_qc.avg_len" "recount_seq_qc.#distinct_quality_values"
## [5] "recount_seq_qc.#bases" "recount_seq_qc.%a"
## [7] "recount_seq_qc.%c" "recount_seq_qc.%g"
## [9] "recount_seq_qc.%t" "recount_seq_qc.%n"
## [11] "recount_seq_qc.avgq" "recount_seq_qc.errq"
head(colnames(colData(rse_BRCA))[grepl('^recount_qc', colnames(colData(rse_BRCA)))], 10)
## [1] "recount_qc.aligned_reads%.chrm" "recount_qc.aligned_reads%.chrx"
## [3] "recount_qc.aligned_reads%.chry" "recount_qc.bc_auc.all_reads_all_bases"
## [5] "recount_qc.bc_auc.all_reads_annotated_bases" "recount_qc.bc_auc.unique_reads_all_bases"
## [7] "recount_qc.bc_auc.unique_reads_annotated_bases" "recount_qc.bc_auc.all_%"
## [9] "recount_qc.bc_auc.unique_%" "recount_qc.bc_frag.count"
# The field `recount_seq_qc.avgq` corresponds to the weighted average over Phred quality scores
summary(rse_BRCA$recount_seq_qc.avgq)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.10 36.90 37.40 37.31 37.90 39.00
All Phred scores are above 20, meaning all samples are estimated to be at least 99% accurate.
# Exploring unmapped reads
summary(rse_BRCA$'recount_qc.star.%_of_reads_unmapped:_other')
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01000 0.03000 0.04000 0.05249 0.05000 1.31000
summary(rse_BRCA$'recount_qc.star.%_of_reads_unmapped:_too_many_mismatches')
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 0 0 0
summary(rse_BRCA$'recount_qc.star.%_of_reads_unmapped:_too_short')
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.030 4.990 5.650 6.319 6.522 56.120
There are few unmapped reads, except for some that are too short.
# Exploring mismatch rate
summary(rse_BRCA$'recount_qc.star.mismatch_rate_per_base,_%')
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1400 0.2500 0.2900 0.3026 0.3400 0.7100
Very small percentage of mismatches per base. Overall, it seems that the samples are good quality, and we will proceed with the selection of attributes for our model.