Gene lists¶

About¶

We recommend restricting analysis to protein-coding genes, and bundle premade lists of coding genes for human and mouse with the scHPF code. The prep CLI command optionally uses these lists to filter input data. Although ENSEMBL ids are theoretically unambiguous and consistent across releases (ie stable identifiers), you may want to generate your own list from a different annotation (that matches your alignment GENCODE version) or with different parameters for gene inclusion (eg including lncRNA).

Premade lists¶

The scHPF code includes tab-delimited lists of ENSEMBL ids and names for genes with protein coding, T-cell receptor constant, or immunoglobulin constant biotypes for human and mouse.

Premade lists can be found in the code’s resources folder:

Human (GENCODE v24, v29, v31)

Mouse (GENCODE vM10, vM19)

Format¶

Example tab-delimited gene list:

ENSG00000186092     OR4F5
ENSG00000284733     OR4F29
ENSG00000284662     OR4F16
ENSG00000187634     SAMD11
ENSG00000188976     NOC2L
ENSG00000187961     KLHL17

By default, the prep command assumes a two-column, tab-delimited text file of ENSEMBL gene ids and names, and uses the first column (assumed to be ENSEMBL id) to filter genes. See the prep command documentation for other options.

Note

ENSEMBL ids may end in a period followed by an unstable version number (eg ENSG00000186092.6). By default, the prep command ignores anything after the period. This means [ENS-ID].[VERSION] is equivalent to [ENS-ID] . See the prep command for other options.

Making custom gene lists¶

Although ENSEMBL ids aim to be unambiguous and consistent across releases (ie stable identifiers), you may want to generate your own list from a different annotation or with different parameters for gene inclusion.

Example creation script¶

Reference files of ids and names for genes with with protein_coding, TR_C_gene, or IG_C_gene biotypes in the GENCODE main annotation (in this case gencode.v29.annotation.gtf) were generated as follows:

# Select genes with feature gene and level 1 or 2
awk '{if($3=="gene" && $0~"level (1|2);"){print $0}}' gencode.v29.annotation.gtf > gencode.v29.annotation.gene_l1l2.gtf

# Only include biotypes protein_coding, TR_C_g* and IG_C_g*
awk '{if($12~"TR_C_g" || $12~"IG_C_g" || $12~"protein_coding"){print $0}}' gencode.v29.annotation.gene_l1l2.gtf > gencode.v29.annotation.gene_l1l2.pc_TRC_IGC.gtf

# Retrieve ENSEMBL gene id and name
awk '{{OFS="\t"}{gsub(/"/, "", $10); gsub(/;/, "", $10); gsub(/"/, "", $14); gsub(/;/, "", $14); print $10, $14}}' gencode.v29.annotation.gene_l1l2.pc_TRC_IGC.gtf > gencode.v29.annotation.gene_l1l2.pc_TRC_IGC.stripped.txt

Note

For older GENCODE versions, you may need to adjust the field indices in the third line of code (for example changing all instances of $14 to $16).