nncore.engine
Engine
- class nncore.engine.engine.Engine(model, data_loaders, stages=None, hooks=None, buffer_size=100000, logger=None, work_dir=None, seed=None, meta=None, amp=None, debug=False, **kwargs)[source]
An engine that can take over the whole training, validation, and testing process, with all the baby-sitting works (stage control, optimizer configuration, lr scheduling, checkpoint management, metrics & tensorboard writing, etc.) done automatically.
- Parameters:
model (
nn.Module
| cfg | str) – The model or config of the model. Theforward
method of the model should return a dict containing a_avg_factor
field indicating the number of samples in the current batch, and optionally a_out
field denoting the model outputs to be collected and evaluated.data_loaders (dict | str) – The configs of data loaders for training, validation, and testing. The dict should be in the format of
dict(train=train_loader, val=val_loader, test=test_loader)
.stages (list[dict] | dict | None, optional) –
The stage config or list of stage configs to be scheduled. Each stage config should be a dict containing the following fields:
epochs (int): Number of epochs in the stage.
optimizer (
optim.Optimizer
| dict): The optimizer or an optimizer config containing the following fields:type (str): Type of the optimizer, which can be accessed via
torch.optim
attributes, e.g.'SGD'
.configs for the optimizer, e.g.
lr=0.01, momentum=0.9
.
lr_schedule (dict, optional): The learning rate schedule config containing the following fields:
type (str): Type of the learning rate schedule. Expected values include
'epoch'
and'iter'
, indicating updating learning rates every epoch or iteration.policy (str): The learning rate policy to use. Currently supported policies include
step
,cosine
,exp
,poly
, andinv
.configs for the learning rate policy, e.g.
target_lr=0
. Please refer toLrUpdaterHook
for full configs.
warmup (dict, optional): The warm-up policy config containing the following fields:
type (str): Type of the warm-up schedule. Expected values include
'epoch'
and'iter'
, indicating warming up forstep
epochs for iterations.policy (str): The warm-up policy to use. Currently supported policies include
linear
,exp
andconstant
.step (int): Number of iterations to warm-up.
ratio (float): The ratio of learning rate to start with. Expected values are in the range of
0 ~ 1
.
validation (dict, optional): The validation config containing the following fields:
interval (int, optional): The interval of performing validation.
0
means not performing validation. Default:0
.offset (int, optional): The number of epochs to skip before counting the interval. Default:
0
.
Default:
None
.hooks (list[
Hook
| dict | str] | None, optional) – The list of extra hooks to be registered. Each hook can be represented as aHook
, a dict or a str. Default:None
.buffer_size (int, optional) – Maximum size of the buffer. Default:
100000
.logger (
logging.Logger
| str | None, optional) – The logger or name of the logger to use. Default:None
.work_dir (str | None, optional) – Path to the working directory. If not specified, the default working directory will be used. Default:
None
.seed (int | None, optional) – The random seed to use in data loaders. Default:
None
.meta (any | None, optional) – A dictionary-like object containing meta data of this engine. Default:
None
.amp (dict | str | bool | None, optional) – Whether to use automatic mixed precision training. Default:
None
.debug (bool, optional) – Whether to activate debug mode. Default:
False
.
Example
>>> # Build model >>> model = build_model() ... >>> # Build data loaders >>> train_loader = build_dataloader(split='train') >>> val_loader = build_dataloader(split='val') >>> data_loaders = dict(train=train_loader, val=val_loader) ... >>> # Configure stages: >>> # [Stage 1] Train the model for 5 epochs using Adam optimizer with >>> # a fixed learning rate (1e-3) and a linear warm-up policy. >>> # [Stage 2] Train the model for another 3 epochs using SGD with >>> # momentum optimizer and an iter-based cosine learning rate >>> # schedule. Perform validation after every training epoch. >>> stages = [ ... dict( ... epochs=5, ... optimizer=dict(type='Adam', lr=1e-3), ... warmup=dict(type='iter', policy='linear', steps=500)), ... dict( ... epochs=3, ... optimizer=dict(type='SGD', lr=1e-3, momentum=0.9), ... lr_schedule=dict(type='iter', policy='cosine'), ... validation=dict(interval=1)) ... ] ... >>> # Initialize and launch engine >>> engine = Engine(model, data_loaders, stages=stages) >>> engine.launch()
- register_hook(hook, before=None, overwrite=True, **kwargs)[source]
Register a hook or a list of hooks into the engine.
- Parameters:
hook (list |
Hook
| dict | str) – The hook or list of hooks to be registered. Each hook can be represented as aHook
, a dict or a str.before (str, optional) – Name of the hook to be inserted before. If not specified, the new hook will be added to the end of hook list. Default:
None
.overwrite (bool, optional) – Whether to overwrite the old hook with the same name if exists. Default:
True
.
- unregister_hook(hook)[source]
Unregister a hook or a list of hooks from the engine.
- Parameters:
hook (list |
Hook
| str) – The hook or list of hooks to be unregistered. Each hook can be represented as aHook
or a str.
- load_checkpoint(checkpoint, **kwargs)[source]
Load checkpoint from a file or an URL.
- Parameters:
checkpoint (dict | str) – A dict, a filename, an URL or a
torchvision://<model_name>
str indicating the checkpoint.
- resume(checkpoint, **kwargs)[source]
Resume from a checkpoint file.
- Parameters:
checkpoint (dict | str) – A dict, a filename or an URL indicatin the checkpoint.
Buffer
- class nncore.engine.buffer.Buffer(max_size=100000, logger=None)[source]
A buffer that tracks a series of values and provide access to smoothed scalar values over a window.
- Parameters:
max_size (int, optional) – Maximal number of internal values that can be stored in the buffer. When the capacity of the buffer is exhausted, old values will be removed. Default:
100000
.logger (
logging.Logger
| str | None, optional) – The logger or name of the logger to use. Default:None
.
- update(key, value, warning=True)[source]
Add a new value. If the length of the buffer exceeds
self._max_size
, the oldest element will be removed from the buffer.- Parameters:
key (str) – The key of the values.
value (any) – The new value.
warning (bool, optional) – Whether to display warning when removing values. Default:
True
.
- count(key)[source]
Return the number of values according to the key.
- Parameters:
key (str) – The key of the values.
- latest(key)[source]
Return the latest value in the buffer.
- Parameters:
key (str) – The key of the values.
- median(key, window_size=None)[source]
Return the median of the latest
window_size
values in the buffer.- Parameters:
key (str) – The key of the values.
window_size (int | None, optional) – The window size of the values to be computed. If not specified, all the values will be taken into account. Default:
None
.
- Returns:
The median of the latest
window_size
values.- Return type:
float
- mean(key, window_size=None)[source]
Return the mean of the latest
window_size
values in the buffer.- Parameters:
key (str) – The key of the values.
window_size (int | None, optional) – The window size of the values to be computed. If not specified, all the values will be taken into account. Default:
None
.
- Returns:
The mean of the latest
window_size
values.- Return type:
float
- sum(key, window_size=None)[source]
Return the sum of the latest
window_size
values in the buffer.- Parameters:
key (str) – The key of the values.
window_size (int | None, optional) – The window size of the values to be computed. If not specified, all the values will be taken into account. Default:
None
.
- Returns:
The sum of the latest
window_size
values.- Return type:
float
- avg(key, factor='_avg_factor', window_size=None)[source]
Return the average of the latest
window_size
values in the buffer. Note that since not all the values in the buffer are count from the same number of samples, the exact average of these values should be computed with the number of samples.- Parameters:
key (str) – The key of the values.
factor (str, optional) – The key of average factor. Default:
'_avg_factor'
.window_size (int | None, optional) – The window size of the values to be computed. If not specified, all the values will be taken into account. Default:
None
.
- Returns:
The average of the latest
window_size
values.- Return type:
float
Comm
- nncore.engine.comm.init_dist(launcher=None, backend='nccl', method='spawn', **kwargs)[source]
Initialize a distributed process group.
- Parameters:
launcher (str | None, optional) – Launcher for the process group. Expected values include
'torch'
,'slurm'
, andNone
. If not specified, this method will try to determine the launcher automatically. Default:None
.backend (
dist.Backend
| str, optional) – The distribution backend to use. This field should be given as adist.Backend
object or a str which can be accessed viadist.Backend
attributes. Depending on build-time configurations, valid values are'nccl'
and'gloo'
. If using multiple processes per machine withnccl
backend, each process must have exclusive access to every GPU it uses, as sharing GPUs between processes can result in deadlocks. Default:'nccl'
.method (str, optional) – The method used to start subprocesses. Expected values include
'spawn'
,'fork'
, and'forkserver'
. Default:'spawn'
.
- Returns:
The launcher and backend info.
- Return type:
str | None
- nncore.engine.comm.get_launcher()[source]
Detect the launcher of the current process.
- Returns:
The name of the launcher.
- Return type:
str | None
- nncore.engine.comm.is_elastic()[source]
Check whether the current process was launched with
dist.elastic
.- Returns:
Whether the current process was launched with
dist.elastic
.- Return type:
bool
- nncore.engine.comm.is_slurm()[source]
Check whether the current process was launched with Slurm.
- Returns:
Whether the current process was launched with Slurm.
- Return type:
bool
- nncore.engine.comm.is_distributed()[source]
Check whether the current process is distributed.
- Returns:
Whether the current process is distributed.
- Return type:
bool
- nncore.engine.comm.get_rank(group=None)[source]
Get the rank of the current process in a process group.
- Parameters:
group (
dist.ProcessGroup
| None, optional) – The process group to use. If not specified, the default process group will be used. Default:None
.- Returns:
The process rank.
- Return type:
int
- nncore.engine.comm.get_world_size(group=None)[source]
Get the world size of a process group.
- Parameters:
group (
dist.ProcessGroup
| None, optional) – The process group to use. If not specified, the default process group will be used. Default:None
.- Returns:
The world size.
- Return type:
int
- nncore.engine.comm.get_dist_info(group=None)[source]
Get the rank of the current process and the world size of a process group.
- Parameters:
group (
dist.ProcessGroup
| None, optional) – The process group to use. If not specified, the default process group will be used. Default:None
.- Returns:
The process rank and the world size.
- Return type:
tuple[int]
- nncore.engine.comm.is_main_process()[source]
Check whether the current process is the main process.
- Returns:
Whether the current process is the main process.
- Return type:
bool
- nncore.engine.comm.broadcast(data=None, src=0, group=None)[source]
Perform
dist.broadcast
on arbitrary serializable data.- Parameters:
data (any, optional) – Any serializable object.
src (int, optional) – The source rank. Default:
0
.group (
dist.ProcessGroup
| None, optional) – The process group to use. If not specified, the default process group will be used. Default:None
.
- Returns:
The data broadcasted from the source rank.
- Return type:
any
- nncore.engine.comm.all_gather(data, group=None)[source]
Perform
dist.all_gather
on arbitrary serializable data.- Parameters:
data (any) – Any serializable object.
group (
dist.ProcessGroup
| None, optional) – The process group to use. If not specified, the default process group will be used. Default:None
.
- Returns:
The list of data gathered from each rank.
- Return type:
list
- nncore.engine.comm.gather(data, dst=0, group=None)[source]
Perform
dist.gather
on arbitrary serializable data.- Parameters:
data (any) – Any serializable object.
dst (int, optional) – The destination rank. Default:
0
.group (
dist.ProcessGroup
| None, optional) – The process group to use. If not specified, the default process group will be used. Default:None
.
- Returns:
On
dst
, it should be a list of data gathered from each rank. Otherwise,None
.- Return type:
list | None
Hooks
- class nncore.engine.hooks.Hook(name=None)[source]
Base class for hooks that can be registered into
Engine
.Each hook can implement several methods. In hook methods, users should provide an argument
engine
to access more properties about the context. All hooks will be called one by one according to the order inengine.hooks
.
- class nncore.engine.hooks.CheckpointHook(interval=1, save_optimizer=True, create_symlink=False, out=None)[source]
Save checkpoints periodically during training. Checkpoint of the last epoch will always be saved regardless of
interval
.- Parameters:
interval (int, optional) – The interval of epochs to save checkpoints. Default:
1
.save_optimizer (bool, optional) – Whether to incorperate optimizer statuses into checkpoints. Default:
True
.create_symlink (bool, optional) – Whether to create a symlink to the latest checkpoint. This argument is invalid on Windows due to the limitations of its file system. Default:
False
.out (str | None, optional) – Path to the output directory. If not specified,
enging.work_dir
will be used as the default path. Default:None
.
- class nncore.engine.hooks.ClosureHook(name, func)[source]
Customize the hooks using self-defined functions.
- Parameters:
name (list[str] | str) – Name or a list of names of the hooks. Expected values include
'before_launch'
,'after_launch'
,'before_stage'
,'after_stage'
,'before_epoch'
,'after_epoch'
,'before_iter'
,'after_iter'
,'before_train_epoch'
,'after_train_epoch'
,'before_val_epoch'
,'after_val_epoch'
,'before_train_iter'
,'after_train_iter'
,'before_val_iter'
, and'after_val_iter'
func (list[function] | function) – A function or a list of functions for the hooks. These functions should receive an argument
engine
to access more properties about the context.
- class nncore.engine.hooks.EvalHook(interval=1, run_test=False, high_keys=[], low_keys=[])[source]
Perform evaluation periodically during training.
- Parameters:
interval (int, optional) – The interval of epochs to perform evaluation. Default:
1
.run_test (bool, optional) – Whether to run the model on the test split before performing evaluation. Default:
False
.high_keys (list[str], optional) – The list of metrics (higher is better) to be compared. Default:
[]
.low_keys (list[str], optional) – The list of metrics (lower is better) to be compared. Default:
[]
.
- class nncore.engine.hooks.CommandLineWriter[source]
Write logs to commandline using
logging.Logger
.
- class nncore.engine.hooks.EventWriterHook(interval=50, writers=['CommandLineWriter', 'JSONWriter', 'TensorboardWriter'])[source]
Write logs periodically during training. This hook relies on
TimerHook
and it works with severalWriter
objects to log metrics, images, videos, audios, etc. In distributed training, only the main process will write the logs.- Parameters:
interval (int, optional) – The interval of iterations to write logs. Default:
50
.writers (list[
Writer
] or list[str], optional) – The list of writers or name of writers to use. Currently supported writers includeCommandLineWriter
,JSONWriter
andTensorboardWriter
. Default:['CommandLineWriter']
.
- class nncore.engine.hooks.JSONWriter(filename='metrics.json')[source]
Write logs to JSON files.
- Parameters:
filename (str, optional) – Path to the output JSON file. Default:
'metrics.json'
.
- class nncore.engine.hooks.TensorboardWriter(log_dir=None, input_to_model=None, **kwargs)[source]
Write logs to Tensorboard.
- Parameters:
log_dir (str, optional) – Directory of the tensorboard logs. Default:
None
.input_to_model (any, optional) – The input data, data_loader or name of the data_loader for constructing the model graph. If not specified, the graph will not be added. Please check
torch.utils.tensorboard.SummaryWriter.add_graph
for more details about adding a graph to tensorboard. Default:None
.
- class nncore.engine.hooks.LrUpdaterHook(name=None)[source]
Update learning rate periodically during training. Currently supported learning rate and warm-up policies are
step
,cosine
,exp
,poly
,inv
, andlinear
,exp
,constant
respectively.- Learning rate policy configs:
step: step (list[int]), gamma (float, Default:
0.1
)cosine: target_lr (float, Default:
0
)exp: gamma (float)
poly: power (float, Default:
1
), min_lr (float, Default:0
)inv: gamma (float), power (float, Default:
1
)
- Warm-up policy configs:
linear: ratio (float)
exp: ratio (float)
constant: ratio (float)
- class nncore.engine.hooks.EmptyCacheHook(names=[])[source]
Empty cache periodically during training.
- Parameters:
names (list[str], optional) – The list of hook names to empty cache. Expected values include
'before_launch'
,'after_launch'
,'before_stage'
,'after_stage'
,'before_epoch'
,'after_epoch'
,'before_iter'
,'after_iter'
,'before_train_epoch'
,'after_train_epoch'
,'before_val_epoch'
,'after_val_epoch'
,'before_train_iter'
,'after_train_iter'
,'before_val_iter'
, and'after_val_iter'
. Default:[]
.
- class nncore.engine.hooks.OptimizerHook(interval=1, coalesce=True, bucket_size_mb=-1, grad_scale=None)[source]
Perform back propagation and update parameters of the model periodically. This hook supports CPU, single GPU and distributed training.
- Parameters:
interval (int, optional) – The interval of iterations to update parameters. Default:
1
.coalesce (bool, optional) – Whether to coalesce the weights in distributed training. Default:
True
.bucket_size_mb (int, optional) – Size of the bucket.
-1
means not restricting the bucket size. Default:-1
.grad_scale (dict | bool | None, optional) – Whether to scale the gradients. If not specified, this module will automatically scale the gradients when amp is activated. Default:
None
.
- class nncore.engine.hooks.PreciseBNHook(interval=1, num_iters=200)[source]
Compute Precise-BN using EMA periodically during training. This hook will also run in the end of training.
- Parameters:
interval (int, optional) – The interval of epochs to compute the stats. Default:
1
.num_iters (int, optional) – Number of iterations to compute the stats. This number will be overwritten by the length of training data loader. Default:
200
.
Builder
- nncore.engine.builder.build_dataloader(cfg, seed=None, dist=None, group=None, **kwargs)[source]
Build a data loader from a dict. The dataset should be registered in
DATASETS
.- Parameters:
cfg (dict) – The config of the dataset.
seed (int | None, optional) – The random seed to use. Default:
None
.dist (bool | None, optional) – Whether the data loader is distributed. If not specified, this method will determine it automatically. Default:
None
.group (
dist.ProcessGroup
| None, optional) – The process group to use. If not specified, the default process group will be used. Default:None
.
- Returns:
The constructed data loader.
- Return type:
DataLoader
Utils
- nncore.engine.utils.generate_random_seed(sync=True, src=0, group=None)[source]
Generate a random seed.
- Parameters:
sync (bool, optional) – Whether to synchronize the random seed among the processes in the group in distributed settings. Default:
True
.src (int, optional) – The source rank of the process in distributed settings. This argument is valid only when
sync==True
. Default:0
.group (
dist.ProcessGroup
| None, optional) – The process group to use in distributed settings. This argument is valid only whensync==True
. If not specified, the default process group will be used. Default:None
.
- Returns:
The generated random seed.
- Return type:
int
- nncore.engine.utils.set_random_seed(seed=None, benchmark=False, deterministic=False, **kwargs)[source]
Set random seed for
random
,numpy
, andtorch
packages. Ifseed
is not specified, this method will generate and return a new random seed.- Parameters:
seed (int | None, optional) – The random seed to use. If not specified, a new random seed will be generated. Default:
None
.benchmark (bool, optional) – Whether to enable benchmark mode. Default:
False
.deterministic (bool, optional) – Whether to enable deterministic mode. Default:
False
.
- Returns:
The actually used random seed.
- Return type:
int
- nncore.engine.utils.get_checkpoint(file_or_url, map_location=None, **kwargs)[source]
Get checkpoint from a file or an URL.
- Parameters:
file_or_url (str) – The filename or URL of the checkpoint.
map_location (str | None, optional) – Same as the
torch.load
interface. Default:None
.
- Returns:
The loaded checkpoint.
- Return type:
OrderedDict
| dict
- nncore.engine.utils.load_checkpoint(model, checkpoint, map_location=None, strict=False, keys=None, logger=None, **kwargs)[source]
Load checkpoint from a file or an URL.
- Parameters:
model (
nn.Module
) – The module to load checkpoint.checkpoint (dict | str) – A dict, a filename, an URL or a
torchvision://<model_name>
str indicating the checkpoint.map_location (str | None, optional) – Same as the
torch.load
interface. Default:None
.strict (bool, optional) – Whether to allow different params for the model and checkpoint. If
True
, raise an error when the params do not match exactly. Default:False
.keys (list[str] | None, optional) – The list of parameter keys to load. Default:
None
.logger (
logging.Logger
| str | None, optional) – The logger or name of the logger for displaying error messages. Default:None
.
- Returns:
The loaded checkpoint.
- Return type:
OrderedDict
| dict
- nncore.engine.utils.save_checkpoint(model, filename, optimizer=None, meta=None)[source]
Save checkpoint to a file.
The checkpoint object will have 3 fields:
meta
,state_dict
andoptimizer
, wheremeta
contains the version of nncore and the time info by default.- Parameters:
model (
nn.Module
) – The model whose params are to be saved.filename (str) – Path to the checkpoint file.
optimizer (
optim.Optimizer
| None, optional) – The optimizer to be saved. Default:None
.meta (dict | None, optional) – The metadata to be saved. Default:
None
.
- Returns:
The saved checkpoint.
- Return type:
dict