scHPF prep¶
Basic usage¶
To preprocess genome-wide UMI counts for a typical run, use the command:
scHPF prep -i UMICOUNT_MATRIX -o OUTDIR -m 10 -w WHITELIST
As written, the command prepares a matrix of molecular counts for training and only includes genes that are:
on a whitelist, for example one of the lists of protein coding genes bundled in the scHPF code’s reference folder (
-w
/--whitelist
)that we observe in at at least 10 cells (
-m
/--min-cells
).
After running this command, OUTDIR
should contain a matrix market file,
filtered.mtx
, and an ordered list of genes, genes.txt
. An optional prefix
argument can be added, which is prepended to to the output file names.
Now we can train the model with the scHPF train
utility.
Input matrix format¶
scHPF prep
takes a molecular count matrix for an scRNA-seq experiment
and formats it for training. The input matrix has two allowed formats:
A whitespace-delimited matrix formatted as follows, with no header:
ENSEMBL_ID GENE_NAME UMICOUNT_CELL0 UMICOUNT_CELL1 ...A loom file (see loompy docs). The loom file must have at least one of the row attributes
Accession
orGene
, whereAccession
is an ENSEMBL id andGene
is a gene name.
Whitelisting genes¶
About¶
We recommend restricting analysis to protein-coding genes. The
-w
/--whitelist
option removes all genes in the input data that are not
in a two column, tab-delimited text file of ENSEMBL gene ids and names.
Symmetrically, the -b
/--blacklist
option removes all genes that are in
a file.
Whitelists for human and mouse are provided in the resources folder, and details on formatting and custom lists are in the gene list documentation.
Attention
ENSEMBL ids may end in a period followed by an unstable version
number (eg ENSG00000186092.6). By default, the prep command ignores anything
after the period. This means [ENS-ID].[VERSION]
is equivalent to
[ENS-ID]
. This behavior can be overwritten with the
--no-split-on-dot
flag.
Whitespace-delimited input matrix¶
For whitespace-delimited UMI-count files, filtering is performed using the input
matrix’s first column (assumed to be a unique identifier) by default, but can be
done with the gene name (next column) using the --filter-by-gene-name
flag.
This is useful for data that does not include a gene id.
loom input matrix¶
For loom files, we filter the loom Accession
row attribute against the
whitelist’s ENSEMBLE if Accession
is present in the loom’s row attributes,
and filter the loom’s Gene
row attribute against the gene name in the
whitelist otherwise.
Complete options¶
For complete options, see the complete CLI reference or use the
-h
option on the command line:
scHPF prep -h