Complete CLI Reference

scHPF prep

usage: scHPF prep [-h] -i INPUT [-o OUTDIR] [-p PREFIX] [-m MIN_CELLS]
                  [-w WHITELIST] [-b BLACKLIST] [-nvc N_VALIDATION_CELLS]
                  [-vgid VALIDATION_GROUP_IDS]
                  [--validation-max-group-frac VALIDATION_MAX_GROUP_FRAC]
                  [--filter-by-gene-name] [--no-split-on-dot]

Named Arguments

-i, --input

Input data. Currently accepts either: (1) a whitespace-delimited gene by cell UMI count matrix with 2 leading columns of gene attributes (ENSEMBL_ID and GENE_NAME respectively), or (2) a loom file with at least one of the row attributes Accession or Gene, where Accession is an ENSEMBL id and Gene is the name.

-o, --outdir

Output directory. Does not need to exist.

-p, --prefix

Prefix for output files. Optional.

Default: ''

-m, --min-cells

Minimum number of cells in which we must observe at least one transcript of a gene for the gene to pass filtering. If 0 <min_cells`< 1, sets threshold to be `min_cells * ncells, rounded to the nearest integer. [Default 0.01]

Default: 0.01

-w, --whitelist

Tab-delimited file where first column contains ENSEMBL gene ids to accept, and second column contains corresponding gene names. If given, genes not on the whitelist are filtered from the input matrix. Superseded by blacklist. Optional.

Default: ''

-b, --blacklist

Tab-delimited file where first column contains ENSEMBL gene ids to exclude, and second column is the corresponding gene name. Only performed if file given. Genes on the blacklist are excluded even if they are also on the whitelist. Optional.

Default: ''

-nvc, --n-validation-cells

Number of cells to randomly select for validation.

Default: 0

-vgid, --validation-group-ids

Single column file of cell group ids readable with np.readtxt. If –n-validation-cells is > 0, cells will be randomly selected approximately evenly across the groups in this file, under the constraint that at most –validation-min-group-frac * (ncells in group) are selected from every group.

--validation-max-group-frac

If -nvc>0 and validation-group-ids is a valid file, at most `validation-min-group-frac`*(ncells in group) cells are selected from each group.

Default: 0.5

--filter-by-gene-name

Use gene name rather than ENSEMBL id to filter (with whitelist or blacklist). Useful for datasets where only gene symbols are given. Applies to both whitelist and blacklist. Used by default when input is a loom file (unless there is an Accession attribute in the loom).

Default: False

--no-split-on-dot

Don’t split gene symbol or name on period before filtering whitelist and blacklist. We do this by default for ENSEMBL ids.

Default: False

scHPF train

usage: scHPF train [-h] -i INPUT [-o OUTDIR] [-p PREFIX] [-t NTRIALS]
                   [-v VALIDATION_CELLS] [-M MAX_ITER] [-m MIN_ITER]
                   [-e EPSILON] [-f CHECK_FREQ]
                   [--better-than-n-ago BETTER_THAN_N_AGO] [-a A] [-c C]
                   [--float32] [-bs BATCHSIZE] [-sl SMOOTH_LOSS] [-bts] [-sa]
                   [-rp] [--quiet]
                   nfactors

Named Arguments

-i, --input

Training data. Expects either the mtx file output by the prep command or a tab-separated tsv file formatted like:CELL_ID GENE_ID UMI_COUNT. In the later case, ids are assumed to be 0 indexed and we assume no duplicates.

-o, --outdir

Output directory for scHPF model. Will be created if does not exist.

-p, --prefix

Prefix for output files. Optional.

Default: ''

nfactors

Number of factors.

-t, --ntrials

Number of times to run scHPF, selecting the trial with best loss (on training data unless validation is given). [Default 1]

Default: 1

-v, --validation-cells

Cells to use to assess convergence and choose a model. Expects same format as -i/--input. Training data used by default.

-M, --max-iter

Maximum iterations. [Default 1000].

Default: 1000

-m, --min-iter

Minimum iterations. [Default 30]

Default: 30

-e, --epsilon

Minimum percent decrease in loss between checks to continue inference (convergence criteria). [Default 0.001].

Default: 0.001

-f, --check-freq

Number of iterations to run between convergence checks. [Default 10].

Default: 10

--better-than-n-ago

Stop condition if loss is getting worse. Stops training if loss is worse than better_than_n_ago`*`check-freq training steps ago and getting worse. Normally not necessary to change.

Default: 5

-a

Value for hyperparameter a. Setting to -2 will auto-set to 1/sqrt(nfactors)[Default 0.3]

Default: 0.3

-c

Value for hyperparameter c. Setting to -2 will auto-set to 1/sqrt(nfactors)[Default 0.3]

Default: 0.3

--float32

Use 32-bit floats instead of default 64-bit floats in variational distrubtions

Default: False

-bs, --batchsize

Number of cells to use per training round. All cells used if 0. Note that using batches changes the order of updates during inference.

Default: 0

-sl, --smooth-loss

Average loss over the last –smooth-loss interations. Intended for when using minibatches, where int(ncells/batchsize) is a reasonable value

Default: 1

-bts, --beta-theta-simultaneous

If False (default), compute beta update, then compute theta based on the updated beta. Note that if batching is used, this order is reverse. If True, update both beta and theta based on values from the last training round. The later slows the rate of convergence and sometimes results in better log-likelihoods, but may increase convergence time, especially for large numbers of cells.

Default: False

-sa, --save-all

Save all trials

Default: False

-rp, --reproject

Reproject data onto fixed global (gene) parameters after convergence, but before model selection. Recommended with batching

Default: False

--quiet

Don’t print intermediate llh.

Default: True

scHPF score

usage: scHPF score [-h] -m MODEL [-o OUTDIR] [-p PREFIX] [-g GENEFILE]
                   [--name-col NAME_COL]

Named Arguments

-m, --model

Saved scHPF model from train command. Should have extension`.joblib`

-o, --outdir

Output directory for score files. If not given, a new subdirectory of the dir containing the model will be made with the same name as the model file (without extension)

-p, --prefix

Prefix for output files. Optional.

Default: ''

-g, --genefile

Create an additional file with gene names ranked by score for each factor. Expects the gene.txt file output by the scHPF prep command or a similarly formatted tab-delimited file without headers. Uses the zero-indexed --name_col’th column as gene names. Optional.

--name-col

The zero-indexed column of genefile to use as a gene name when (optionally) ranking genes. If --name_col is greater than the index of --genefile’s last column, it is automatically reset to the last column’s index. [Default 1]

Default: 1

scHPF prep-like

usage: scHPF prep-like [-h] -i INPUT -r REFERENCE -o OUTDIR [-p PREFIX]
                       [--by-gene-name] [--no-split-on-dot]

Named Arguments

-i, --input

Input data to format. Currently accepts either: (1) a whitespace-delimited gene by cell UMI count matrix with 2 leading columns of gene attributes (ENSEMBL_ID and GENE_NAME respectively), or (2) a loom file with at least one of the row attributes Accession or Gene, where Accession is an ENSEMBL id and Gene is the name.

-r, --reference

Two-column tab-delimited file of ENSEMBL ids and gene names to select from input and order like. All genes in reference must be present in input.

-o, --outdir

Output directory. Does not need to exist.

-p, --prefix

Prefix for output files. Optional.

Default: ''

--by-gene-name

Use gene name rather than ENSEMBL id to when matching against reference. Useful for datasets where only gene symbols are given. Used by default when input is a loom file (unless there is an Accession attr in the loom).

Default: False

--no-split-on-dot

Don’t split gene symbol or name on period before when matching to reference. We do this by default for ENSEMBL ids.

Default: False

scHPF project

usage: scHPF project [-h] -m MODEL -i INPUT [-o OUTDIR] [-p PREFIX]
                     [--recalc-bp] [--max-iter MAX_ITER] [--min-iter MIN_ITER]
                     [--epsilon EPSILON] [--check-freq CHECK_FREQ]

Named Arguments

-m, --model

The model to project onto.

-i, --input

Data to project onto model. Expects either the mtx file output by the prep or prep-like commands or a tab-delimitted tsv file formated like: CELL_ID GENE_ID UMI_COUNT. In the later case, ids are assumed to be 0 indexed and we assume no duplicates.

-o, --outdir

Output directory for projected scHPF model. Will be created if does not exist.

-p, --prefix

Prefix for output files. Optional.

Default: ''

--recalc-bp

Recalculate hyperparameter bp for the new data

Default: False

--max-iter

Maximum iterations. [Default 500].

Default: 500

--min-iter

Minimum iterations. [Default 10]

Default: 10

--epsilon

Minimum percent decrease in loss between checks to continue inference (convergence criteria). [Default 0.001].

Default: 0.001

--check-freq

Number of iterations to run between convergence checks. [Default 10].

Default: 10