Complete CLI Reference¶
scHPF prep¶
usage: scHPF prep [-h] -i INPUT [-o OUTDIR] [-p PREFIX] [-m MIN_CELLS]
[-w WHITELIST] [-b BLACKLIST] [-nvc N_VALIDATION_CELLS]
[-vgid VALIDATION_GROUP_IDS]
[--validation-max-group-frac VALIDATION_MAX_GROUP_FRAC]
[--filter-by-gene-name] [--no-split-on-dot]
Named Arguments¶
- -i, --input
Input data. Currently accepts either: (1) a whitespace-delimited gene by cell UMI count matrix with 2 leading columns of gene attributes (ENSEMBL_ID and GENE_NAME respectively), or (2) a loom file with at least one of the row attributes Accession or Gene, where Accession is an ENSEMBL id and Gene is the name.
- -o, --outdir
Output directory. Does not need to exist.
- -p, --prefix
Prefix for output files. Optional.
Default: “”
- -m, --min-cells
Minimum number of cells in which we must observe at least one transcript of a gene for the gene to pass filtering. If 0 <min_cells`< 1, sets threshold to be `min_cells * ncells, rounded to the nearest integer. [Default 0.01]
Default: 0.01
- -w, --whitelist
Tab-delimited file where first column contains ENSEMBL gene ids to accept, and second column contains corresponding gene names. If given, genes not on the whitelist are filtered from the input matrix. Superseded by blacklist. Optional.
Default: “”
- -b, --blacklist
Tab-delimited file where first column contains ENSEMBL gene ids to exclude, and second column is the corresponding gene name. Only performed if file given. Genes on the blacklist are excluded even if they are also on the whitelist. Optional.
Default: “”
- -nvc, --n-validation-cells
Number of cells to randomly select for validation.
Default: 0
- -vgid, --validation-group-ids
Single column file of cell group ids readable with np.readtxt. If –n-validation-cells is > 0, cells will be randomly selected approximately evenly across the groups in this file, under the constraint that at most –validation-min-group-frac * (ncells in group) are selected from every group.
- --validation-max-group-frac
If -nvc>0 and validation-group-ids is a valid file, at most `validation-min-group-frac`*(ncells in group) cells are selected from each group.
Default: 0.5
- --filter-by-gene-name
Use gene name rather than ENSEMBL id to filter (with whitelist or blacklist). Useful for datasets where only gene symbols are given. Applies to both whitelist and blacklist. Used by default when input is a loom file (unless there is an Accession attribute in the loom).
Default: False
- --no-split-on-dot
Don’t split gene symbol or name on period before filtering whitelist and blacklist. We do this by default for ENSEMBL ids.
Default: False
scHPF train¶
usage: scHPF train [-h] -i INPUT [-o OUTDIR] [-p PREFIX] [-t NTRIALS]
[-v VALIDATION_CELLS] [-M MAX_ITER] [-m MIN_ITER]
[-e EPSILON] [-f CHECK_FREQ]
[--better-than-n-ago BETTER_THAN_N_AGO] [-a A] [-c C]
[--float32] [-bs BATCHSIZE] [-sl SMOOTH_LOSS] [-bts] [-sa]
[-rp] [--quiet]
nfactors
Named Arguments¶
- -i, --input
Training data. Expects either the mtx file output by the prep command or a tab-separated tsv file formatted like:CELL_ID GENE_ID UMI_COUNT. In the later case, ids are assumed to be 0 indexed and we assume no duplicates.
- -o, --outdir
Output directory for scHPF model. Will be created if does not exist.
- -p, --prefix
Prefix for output files. Optional.
Default: “”
- nfactors
Number of factors.
- -t, --ntrials
Number of times to run scHPF, selecting the trial with best loss (on training data unless validation is given). [Default 1]
Default: 1
- -v, --validation-cells
Cells to use to assess convergence and choose a model. Expects same format as
-i/--input
. Training data used by default.- -M, --max-iter
Maximum iterations. [Default 1000].
Default: 1000
- -m, --min-iter
Minimum iterations. [Default 30]
Default: 30
- -e, --epsilon
Minimum percent decrease in loss between checks to continue inference (convergence criteria). [Default 0.001].
Default: 0.001
- -f, --check-freq
Number of iterations to run between convergence checks. [Default 10].
Default: 10
- --better-than-n-ago
Stop condition if loss is getting worse. Stops training if loss is worse than better_than_n_ago`*`check-freq training steps ago and getting worse. Normally not necessary to change.
Default: 5
- -a
Value for hyperparameter a. Setting to -2 will auto-set to 1/sqrt(nfactors)[Default 0.3]
Default: 0.3
- -c
Value for hyperparameter c. Setting to -2 will auto-set to 1/sqrt(nfactors)[Default 0.3]
Default: 0.3
- --float32
Use 32-bit floats instead of default 64-bit floats in variational distrubtions
Default: False
- -bs, --batchsize
Number of cells to use per training round. All cells used if 0. Note that using batches changes the order of updates during inference.
Default: 0
- -sl, --smooth-loss
Average loss over the last –smooth-loss interations. Intended for when using minibatches, where int(ncells/batchsize) is a reasonable value
Default: 1
- -bts, --beta-theta-simultaneous
If False (default), compute beta update, then compute theta based on the updated beta. Note that if batching is used, this order is reverse. If True, update both beta and theta based on values from the last training round. The later slows the rate of convergence and sometimes results in better log-likelihoods, but may increase convergence time, especially for large numbers of cells.
Default: False
- -sa, --save-all
Save all trials
Default: False
- -rp, --reproject
Reproject data onto fixed global (gene) parameters after convergence, but before model selection. Recommended with batching
Default: False
- --quiet
Don’t print intermediate llh.
Default: True
scHPF score¶
usage: scHPF score [-h] -m MODEL [-o OUTDIR] [-p PREFIX] [-g GENEFILE]
[--name-col NAME_COL]
Named Arguments¶
- -m, --model
Saved scHPF model from train command. Should have extension`.joblib`
- -o, --outdir
Output directory for score files. If not given, a new subdirectory of the dir containing the model will be made with the same name as the model file (without extension)
- -p, --prefix
Prefix for output files. Optional.
Default: “”
- -g, --genefile
Create an additional file with gene names ranked by score for each factor. Expects the gene.txt file output by the scHPF prep command or a similarly formatted tab-delimited file without headers. Uses the zero-indexed
--name_col
’th column as gene names. Optional.- --name-col
The zero-indexed column of genefile to use as a gene name when (optionally) ranking genes. If
--name_col
is greater than the index of--genefile
’s last column, it is automatically reset to the last column’s index. [Default 1]Default: 1
scHPF prep-like¶
usage: scHPF prep-like [-h] -i INPUT -r REFERENCE -o OUTDIR [-p PREFIX]
[--by-gene-name] [--no-split-on-dot]
Named Arguments¶
- -i, --input
Input data to format. Currently accepts either: (1) a whitespace-delimited gene by cell UMI count matrix with 2 leading columns of gene attributes (ENSEMBL_ID and GENE_NAME respectively), or (2) a loom file with at least one of the row attributes Accession or Gene, where Accession is an ENSEMBL id and Gene is the name.
- -r, --reference
Two-column tab-delimited file of ENSEMBL ids and gene names to select from input and order like. All genes in reference must be present in input.
- -o, --outdir
Output directory. Does not need to exist.
- -p, --prefix
Prefix for output files. Optional.
Default: “”
- --by-gene-name
Use gene name rather than ENSEMBL id to when matching against reference. Useful for datasets where only gene symbols are given. Used by default when input is a loom file (unless there is an Accession attr in the loom).
Default: False
- --no-split-on-dot
Don’t split gene symbol or name on period before when matching to reference. We do this by default for ENSEMBL ids.
Default: False
scHPF project¶
usage: scHPF project [-h] -m MODEL -i INPUT [-o OUTDIR] [-p PREFIX]
[--recalc-bp] [--max-iter MAX_ITER] [--min-iter MIN_ITER]
[--epsilon EPSILON] [--check-freq CHECK_FREQ]
Named Arguments¶
- -m, --model
The model to project onto.
- -i, --input
Data to project onto model. Expects either the mtx file output by the prep or prep-like commands or a tab-delimitted tsv file formated like: CELL_ID GENE_ID UMI_COUNT. In the later case, ids are assumed to be 0 indexed and we assume no duplicates.
- -o, --outdir
Output directory for projected scHPF model. Will be created if does not exist.
- -p, --prefix
Prefix for output files. Optional.
Default: “”
- --recalc-bp
Recalculate hyperparameter bp for the new data
Default: False
- --max-iter
Maximum iterations. [Default 500].
Default: 500
- --min-iter
Minimum iterations. [Default 10]
Default: 10
- --epsilon
Minimum percent decrease in loss between checks to continue inference (convergence criteria). [Default 0.001].
Default: 0.001
- --check-freq
Number of iterations to run between convergence checks. [Default 10].
Default: 10