Command-line interface¶
DriverPower uses a command-line interface. To view available sub-commands and usage of DriverPower, type
$ driverpower -h
usage: driverpower [-h] [-v] {model,infer} ...
DriverPower v1.0.0: Combined burden and functional impact tests for coding
and non-coding cancer driver discovery
optional arguments:
-h, --help show this help message and exit
-v, --version Print the version of DriverPower
The DriverPower sub-commands include:
{model,infer}
model Build the background mutation model
infer Infer driver elements
The model
sub-command¶
The model
sub-command is used to build the background mutation model with training data.
Command-line options for this command can be viewed as follows:
$ driverpower model -h
usage: driverpower model [-h] --feature str --response str [--featImp str]
--method {GLM,GBM} [--featImpCut float]
[--gbmParam str] [--gbmFold int] [--name str]
[--modelDir str]
optional arguments:
-h, --help show this help message and exit
input data:
--feature str Path to the training feature table (default: None)
--response str Path to the training response table (default: None)
--featImp str Path to the feature importance table [optional]
(default: None)
parameters:
--method {GLM,GBM} Algorithms to use (default: None)
--featImpCut float Cutoff of feature importance score [optional] (default:
0.5)
--gbmParam str Path to the parameter pickle [optional] (default: None)
--gbmFold int Train gbm with k-fold, k>=2 [optional] (default: 3)
--name str Identifier for output files [optional] (default:
DriverPower)
--modelDir str Directory of output model files [optional] (default:
./output/)
Input
- Required data:
--feature
and--response
for training elements. - Optional data:
--featImp
.
- Required data:
Output
- all output files are in
--modelDir
. [name].GLM.model.pkl
: the GLM model file.[name].GBM.model.fold[k]
: the GBM model files.[name].[GLM|GBM].model_info.pkl
: the corresponding GLM or GBM model information.[name].feature_importance.tsv
: the feature importance table, which is returned when no input--featImp
. For GLM, feature importance is the number of times a feature is used by randomized lasso. For GBM, feature importance is the average gain of the feature across all gradient boosting trees.
- all output files are in
Important
Please DO NOT rename the model files because their names are recorded in model information.
Model files can be moved to another directory as long as --modelDir
is specified in infer
.
Parameters
--method
: [required] method used for the background model. Must be GLM or GBM.--featImpCut
: [optional] cutoff for feature importance score. Features with score >= cutoff will be used in the model. Only used for GLM. Default is 0.5.--gbmParam
: [optional] path to the parameter pickle for GBM. The pickle file must contain a valid python dictionary for XGBoost parameters.--gbmFold
: [optional] Number of model fold to train for GBM. Fold must be an integer >= 2. Default value is 3.--name
: [optional] Prefix for output files. Default is ‘DriverPower’.--modelDir
: [optional] Directory for output model and model information files. Default is ‘./output/’.
Notes
- Input data requirements can be found at data format.
--gbmFold
controls the split of training data by k-fold cross-validation. For example, default 3-fold model means the training data are divided into 3 equal folds. Each time, two folds of data are used to train the model and the rest fold is used to validate the model. Hence, three model files will be generated at the end. The prediction will then be the average of three models.- Training phase can take hours for large training set and consume a large amount of memories. For example, using our default training set (~1M elements and 1373 features), training randomized lasso plus GLM with single core takes about 2-10 hours and 80G RAMs. Training 3-fold GBM with 15 cores takes about 10-24 hours and 80G RAMs.
- The default pickle file for
--gbmParam
is generated as follow:
import pickle
# Default parameter for XGBoost used in DriverPower
param = {'max_depth': 8,
'eta': 0.05, # learning rate
'subsample': 0.6,
'nthread': 15, # number of threads; recommend using the number of available CPUs
'objective': 'count:poisson',
'max_delta_step': 1.2,
'eval_metric': 'poisson-nloglik',
'silent': 1,
'verbose_eval': 100, # print evaluation every 100 rounds
'early_stopping_rounds': 5,
'num_boost_round': 5000 # max number of rounds; usually stop within 1500 rounds
}
# Dump to pickle
with open('xgb_param.pkl', 'wb') as f:
pickle.dump(param, f)
The infer
sub-command¶
The infer
sub-command is used to call drivers from test data with pre-trained models.
Command-line options for this command can be viewed as follows:
$ driverpower infer -h
usage: driverpower infer [-h] --feature str --response str --modelInfo str
[--funcScore str]
[--method {auto,binomial,negative_binomial}]
[--scale float] [--funcScoreCut str] [--geoMean bool]
[--modelDir str] [--name str] [--outDir str]
optional arguments:
-h, --help show this help message and exit
input data:
--feature str Path to the test feature table (default: None)
--response str Path to the test response table (default: None)
--modelInfo str Path to the model information (default: None)
--funcScore str Path to the functional score table [optional]
(default: None)
parameters:
--method {auto,binomial,negative_binomial}
Test method to use [optional] (default: auto)
--scale float Scaling factor for theta in negative binomial
distribution [optional] (default: 1)
--funcScoreCut str Score name:cutoff pairs for all scores
e.g.,"CADD:0.01;DANN:0.03;EIGEN:0.3" [optional]
(default: None)
--geoMean bool Use geometric mean in test [optional] (default: True)
--modelDir str Directory of the trained model(s) [optional] (default:
None)
--name str Identifier for output files [optional] (default:
DriverPower)
--outDir str Directory of output files [optional] (default:
./output/)
Input
- Required data:
--feature
and--response
for test elements;--modelInfo
fromdriverpower model
. - Optional data:
--funcScore
. Only required for test with functional adjustment.
- Required data:
Output
- Driver discovery result saved in
"[outDir]/[name].result.tsv"
.
- Driver discovery result saved in
Parameters
--method
: [optional] probability distribution used to generate p-values (binomial, negative_binomial or auto). Default is auto, which means decision will be made automatically based on the dispersion test of training data.--scale
: [optional] scaling factor of theta for negative binomial distribution. The theta is calculated from dispersion test. Default is 1. Only used for negative binomial distribution.--funcScoreCut
: [optional] Cutoff of each functional impact scores. The format of this parameter is a string in “NAME1:CUTOFF1;NAME2:CUTOFF2…”, such as “CADD:0.01;DANN:0.03;EIGEN:0.3”. Cutoff must in (0,1] and the name must match column names of--funcScore
.--geoMean
: [optional] Whether to use geometric mean of nMut and nSample in test. Default isTrue
.--modelDir
: [optional] Directory of model files fromdriverpower model
. Only required when models have been moved to a different directory.--name
: [optional] Prefix for output files. Default is ‘DriverPower’.--outDir
: [optional] Directory for output files. Default is ‘./output/’.
Notes