scHPF prep

Basic usage

To preprocess genome-wide UMI counts for a typical run, use the command:

scHPF prep -i UMICOUNT_MATRIX -o OUTDIR -m 10 -w WHITELIST

As written, the command prepares a matrix of molecular counts for training and only includes genes that are:

  • on a whitelist, for example one of the lists of protein coding genes bundled in the scHPF code’s reference folder (-w/--whitelist)

  • that we observe in at at least 10 cells (-m/--min-cells).

After running this command, OUTDIR should contain a matrix market file, filtered.mtx, and an ordered list of genes, genes.txt. An optional prefix argument can be added, which is prepended to to the output file names.

Now we can train the model with the scHPF train utility.

Input matrix format

scHPF prep takes a molecular count matrix for an scRNA-seq experiment and formats it for training. The input matrix has two allowed formats:

  1. A whitespace-delimited matrix formatted as follows, with no header:

    ENSEMBL_ID    GENE_NAME    UMICOUNT_CELL0    UMICOUNT_CELL1 ...
    
  2. A loom file (see loompy docs). The loom file must have at least one of the row attributes Accession or Gene, where Accession is an ENSEMBL id and Gene is a gene name.

Whitelisting genes

About

We recommend restricting analysis to protein-coding genes. The -w/--whitelist option removes all genes in the input data that are not in a two column, tab-delimited text file of ENSEMBL gene ids and names. Symmetrically, the -b/--blacklist option removes all genes that are in a file.

Whitelists for human and mouse are provided in the resources folder, and details on formatting and custom lists are in the gene list documentation.

Attention

ENSEMBL ids may end in a period followed by an unstable version number (eg ENSG00000186092.6). By default, the prep command ignores anything after the period. This means [ENS-ID].[VERSION] is equivalent to [ENS-ID]. This behavior can be overwritten with the --no-split-on-dot flag.

Whitespace-delimited input matrix

For whitespace-delimited UMI-count files, filtering is performed using the input matrix’s first column (assumed to be a unique identifier) by default, but can be done with the gene name (next column) using the --filter-by-gene-name flag. This is useful for data that does not include a gene id.

loom input matrix

For loom files, we filter the loom Accession row attribute against the whitelist’s ENSEMBLE if Accession is present in the loom’s row attributes, and filter the loom’s Gene row attribute against the gene name in the whitelist otherwise.

Complete options

For complete options, see the complete CLI reference or use the -h option on the command line:

scHPF prep -h