nncore.engine

Engine

class nncore.engine.engine.Engine(model, data_loaders, stages=None, hooks=None, buffer_size=100000, logger=None, work_dir=None, seed=None, meta=None, amp=None, debug=False, **kwargs)[source]

An engine that can take over the whole training, validation, and testing process, with all the baby-sitting works (stage control, optimizer configuration, lr scheduling, checkpoint management, metrics & tensorboard writing, etc.) done automatically.

Parameters:
  • model (nn.Module | cfg | str) – The model or config of the model. The forward method of the model should return a dict containing a _avg_factor field indicating the number of samples in the current batch, and optionally a _out field denoting the model outputs to be collected and evaluated.

  • data_loaders (dict | str) – The configs of data loaders for training, validation, and testing. The dict should be in the format of dict(train=train_loader, val=val_loader, test=test_loader).

  • stages (list[dict] | dict | None, optional) –

    The stage config or list of stage configs to be scheduled. Each stage config should be a dict containing the following fields:

    • epochs (int): Number of epochs in the stage.

    • optimizer (optim.Optimizer | dict): The optimizer or an optimizer config containing the following fields:

      • type (str): Type of the optimizer, which can be accessed via torch.optim attributes, e.g. 'SGD'.

      • configs for the optimizer, e.g. lr=0.01, momentum=0.9.

    • lr_schedule (dict, optional): The learning rate schedule config containing the following fields:

      • type (str): Type of the learning rate schedule. Expected values include 'epoch' and 'iter', indicating updating learning rates every epoch or iteration.

      • policy (str): The learning rate policy to use. Currently supported policies include step, cosine, exp, poly, and inv.

      • configs for the learning rate policy, e.g. target_lr=0. Please refer to LrUpdaterHook for full configs.

    • warmup (dict, optional): The warm-up policy config containing the following fields:

      • type (str): Type of the warm-up schedule. Expected values include 'epoch' and 'iter', indicating warming up for step epochs for iterations.

      • policy (str): The warm-up policy to use. Currently supported policies include linear, exp and constant.

      • step (int): Number of iterations to warm-up.

      • ratio (float): The ratio of learning rate to start with. Expected values are in the range of 0 ~ 1.

    • validation (dict, optional): The validation config containing the following fields:

      • interval (int, optional): The interval of performing validation. 0 means not performing validation. Default: 0.

      • offset (int, optional): The number of epochs to skip before counting the interval. Default: 0.

    Default: None.

  • hooks (list[Hook | dict | str] | None, optional) – The list of extra hooks to be registered. Each hook can be represented as a Hook, a dict or a str. Default: None.

  • buffer_size (int, optional) – Maximum size of the buffer. Default: 100000.

  • logger (logging.Logger | str | None, optional) – The logger or name of the logger to use. Default: None.

  • work_dir (str | None, optional) – Path to the working directory. If not specified, the default working directory will be used. Default: None.

  • seed (int | None, optional) – The random seed to use in data loaders. Default: None.

  • meta (any | None, optional) – A dictionary-like object containing meta data of this engine. Default: None.

  • amp (dict | str | bool | None, optional) – Whether to use automatic mixed precision training. Default: None.

  • debug (bool, optional) – Whether to activate debug mode. Default: False.

Example

>>> # Build model
>>> model = build_model()
...
>>> # Build data loaders
>>> train_loader = build_dataloader(split='train')
>>> val_loader = build_dataloader(split='val')
>>> data_loaders = dict(train=train_loader, val=val_loader)
...
>>> # Configure stages:
>>> # [Stage 1] Train the model for 5 epochs using Adam optimizer with
>>> # a fixed learning rate (1e-3) and a linear warm-up policy.
>>> # [Stage 2] Train the model for another 3 epochs using SGD with
>>> # momentum optimizer and an iter-based cosine learning rate
>>> # schedule. Perform validation after every training epoch.
>>> stages = [
...     dict(
...         epochs=5,
...         optimizer=dict(type='Adam', lr=1e-3),
...         warmup=dict(type='iter', policy='linear', steps=500)),
...     dict(
...         epochs=3,
...         optimizer=dict(type='SGD', lr=1e-3, momentum=0.9),
...         lr_schedule=dict(type='iter', policy='cosine'),
...         validation=dict(interval=1))
... ]
...
>>> # Initialize and launch engine
>>> engine = Engine(model, data_loaders, stages=stages)
>>> engine.launch()
register_hook(hook, before=None, overwrite=True, **kwargs)[source]

Register a hook or a list of hooks into the engine.

Parameters:
  • hook (list | Hook | dict | str) – The hook or list of hooks to be registered. Each hook can be represented as a Hook, a dict or a str.

  • before (str, optional) – Name of the hook to be inserted before. If not specified, the new hook will be added to the end of hook list. Default: None.

  • overwrite (bool, optional) – Whether to overwrite the old hook with the same name if exists. Default: True.

unregister_hook(hook)[source]

Unregister a hook or a list of hooks from the engine.

Parameters:

hook (list | Hook | str) – The hook or list of hooks to be unregistered. Each hook can be represented as a Hook or a str.

load_checkpoint(checkpoint, **kwargs)[source]

Load checkpoint from a file or an URL.

Parameters:

checkpoint (dict | str) – A dict, a filename, an URL or a torchvision://<model_name> str indicating the checkpoint.

resume(checkpoint, **kwargs)[source]

Resume from a checkpoint file.

Parameters:

checkpoint (dict | str) – A dict, a filename or an URL indicatin the checkpoint.

evaluate()[source]

Perform evaluation. This methods is expected to be called after validation or testing.

launch(eval=False, **kwargs)[source]

Launch the engine.

Parameters:

eval (bool, optional) – Whether to run evaluation only. Default: False.

Buffer

class nncore.engine.buffer.Buffer(max_size=100000, logger=None)[source]

A buffer that tracks a series of values and provide access to smoothed scalar values over a window.

Parameters:
  • max_size (int, optional) – Maximal number of internal values that can be stored in the buffer. When the capacity of the buffer is exhausted, old values will be removed. Default: 100000.

  • logger (logging.Logger | str | None, optional) – The logger or name of the logger to use. Default: None.

update(key, value, warning=True)[source]

Add a new value. If the length of the buffer exceeds self._max_size, the oldest element will be removed from the buffer.

Parameters:
  • key (str) – The key of the values.

  • value (any) – The new value.

  • warning (bool, optional) – Whether to display warning when removing values. Default: True.

count(key)[source]

Return the number of values according to the key.

Parameters:

key (str) – The key of the values.

clear()[source]

Remove all values from the buffer.

latest(key)[source]

Return the latest value in the buffer.

Parameters:

key (str) – The key of the values.

median(key, window_size=None)[source]

Return the median of the latest window_size values in the buffer.

Parameters:
  • key (str) – The key of the values.

  • window_size (int | None, optional) – The window size of the values to be computed. If not specified, all the values will be taken into account. Default: None.

Returns:

The median of the latest window_size values.

Return type:

float

mean(key, window_size=None)[source]

Return the mean of the latest window_size values in the buffer.

Parameters:
  • key (str) – The key of the values.

  • window_size (int | None, optional) – The window size of the values to be computed. If not specified, all the values will be taken into account. Default: None.

Returns:

The mean of the latest window_size values.

Return type:

float

sum(key, window_size=None)[source]

Return the sum of the latest window_size values in the buffer.

Parameters:
  • key (str) – The key of the values.

  • window_size (int | None, optional) – The window size of the values to be computed. If not specified, all the values will be taken into account. Default: None.

Returns:

The sum of the latest window_size values.

Return type:

float

avg(key, factor='_avg_factor', window_size=None)[source]

Return the average of the latest window_size values in the buffer. Note that since not all the values in the buffer are count from the same number of samples, the exact average of these values should be computed with the number of samples.

Parameters:
  • key (str) – The key of the values.

  • factor (str, optional) – The key of average factor. Default: '_avg_factor'.

  • window_size (int | None, optional) – The window size of the values to be computed. If not specified, all the values will be taken into account. Default: None.

Returns:

The average of the latest window_size values.

Return type:

float

Comm

nncore.engine.comm.init_dist(launcher=None, backend='nccl', method='spawn', **kwargs)[source]

Initialize a distributed process group.

Parameters:
  • launcher (str | None, optional) – Launcher for the process group. Expected values include 'torch', 'slurm', and None. If not specified, this method will try to determine the launcher automatically. Default: None.

  • backend (dist.Backend | str, optional) – The distribution backend to use. This field should be given as a dist.Backend object or a str which can be accessed via dist.Backend attributes. Depending on build-time configurations, valid values are 'nccl' and 'gloo'. If using multiple processes per machine with nccl backend, each process must have exclusive access to every GPU it uses, as sharing GPUs between processes can result in deadlocks. Default: 'nccl'.

  • method (str, optional) – The method used to start subprocesses. Expected values include 'spawn', 'fork', and 'forkserver'. Default: 'spawn'.

Returns:

The launcher and backend info.

Return type:

str | None

nncore.engine.comm.get_launcher()[source]

Detect the launcher of the current process.

Returns:

The name of the launcher.

Return type:

str | None

nncore.engine.comm.is_elastic()[source]

Check whether the current process was launched with dist.elastic.

Returns:

Whether the current process was launched with dist.elastic.

Return type:

bool

nncore.engine.comm.is_slurm()[source]

Check whether the current process was launched with Slurm.

Returns:

Whether the current process was launched with Slurm.

Return type:

bool

nncore.engine.comm.is_distributed()[source]

Check whether the current process is distributed.

Returns:

Whether the current process is distributed.

Return type:

bool

nncore.engine.comm.get_rank(group=None)[source]

Get the rank of the current process in a process group.

Parameters:

group (dist.ProcessGroup | None, optional) – The process group to use. If not specified, the default process group will be used. Default: None.

Returns:

The process rank.

Return type:

int

nncore.engine.comm.get_world_size(group=None)[source]

Get the world size of a process group.

Parameters:

group (dist.ProcessGroup | None, optional) – The process group to use. If not specified, the default process group will be used. Default: None.

Returns:

The world size.

Return type:

int

nncore.engine.comm.get_dist_info(group=None)[source]

Get the rank of the current process and the world size of a process group.

Parameters:

group (dist.ProcessGroup | None, optional) – The process group to use. If not specified, the default process group will be used. Default: None.

Returns:

The process rank and the world size.

Return type:

tuple[int]

nncore.engine.comm.is_main_process()[source]

Check whether the current process is the main process.

Returns:

Whether the current process is the main process.

Return type:

bool

nncore.engine.comm.sync(group=None)[source]

Synchronize all processes in a process group.

nncore.engine.comm.broadcast(data=None, src=0, group=None)[source]

Perform dist.broadcast on arbitrary serializable data.

Parameters:
  • data (any, optional) – Any serializable object.

  • src (int, optional) – The source rank. Default: 0.

  • group (dist.ProcessGroup | None, optional) – The process group to use. If not specified, the default process group will be used. Default: None.

Returns:

The data broadcasted from the source rank.

Return type:

any

nncore.engine.comm.all_gather(data, group=None)[source]

Perform dist.all_gather on arbitrary serializable data.

Parameters:
  • data (any) – Any serializable object.

  • group (dist.ProcessGroup | None, optional) – The process group to use. If not specified, the default process group will be used. Default: None.

Returns:

The list of data gathered from each rank.

Return type:

list

nncore.engine.comm.gather(data, dst=0, group=None)[source]

Perform dist.gather on arbitrary serializable data.

Parameters:
  • data (any) – Any serializable object.

  • dst (int, optional) – The destination rank. Default: 0.

  • group (dist.ProcessGroup | None, optional) – The process group to use. If not specified, the default process group will be used. Default: None.

Returns:

On dst, it should be a list of data gathered from each rank. Otherwise, None.

Return type:

list | None

nncore.engine.comm.main_only(func)[source]

A decorator that makes a function can only be executed in the main process.

Hooks

class nncore.engine.hooks.Hook(name=None)[source]

Base class for hooks that can be registered into Engine.

Each hook can implement several methods. In hook methods, users should provide an argument engine to access more properties about the context. All hooks will be called one by one according to the order in engine.hooks.

class nncore.engine.hooks.CheckpointHook(interval=1, save_optimizer=True, create_symlink=False, out=None)[source]

Save checkpoints periodically during training. Checkpoint of the last epoch will always be saved regardless of interval.

Parameters:
  • interval (int, optional) – The interval of epochs to save checkpoints. Default: 1.

  • save_optimizer (bool, optional) – Whether to incorperate optimizer statuses into checkpoints. Default: True.

  • create_symlink (bool, optional) – Whether to create a symlink to the latest checkpoint. This argument is invalid on Windows due to the limitations of its file system. Default: False.

  • out (str | None, optional) – Path to the output directory. If not specified, enging.work_dir will be used as the default path. Default: None.

class nncore.engine.hooks.ClosureHook(name, func)[source]

Customize the hooks using self-defined functions.

Parameters:
  • name (list[str] | str) – Name or a list of names of the hooks. Expected values include 'before_launch', 'after_launch', 'before_stage', 'after_stage', 'before_epoch', 'after_epoch', 'before_iter', 'after_iter', 'before_train_epoch', 'after_train_epoch', 'before_val_epoch', 'after_val_epoch', 'before_train_iter', 'after_train_iter', 'before_val_iter', and 'after_val_iter'

  • func (list[function] | function) – A function or a list of functions for the hooks. These functions should receive an argument engine to access more properties about the context.

class nncore.engine.hooks.EvalHook(interval=1, run_test=False, high_keys=[], low_keys=[])[source]

Perform evaluation periodically during training.

Parameters:
  • interval (int, optional) – The interval of epochs to perform evaluation. Default: 1.

  • run_test (bool, optional) – Whether to run the model on the test split before performing evaluation. Default: False.

  • high_keys (list[str], optional) – The list of metrics (higher is better) to be compared. Default: [].

  • low_keys (list[str], optional) – The list of metrics (lower is better) to be compared. Default: [].

class nncore.engine.hooks.CommandLineWriter[source]

Write logs to commandline using logging.Logger.

class nncore.engine.hooks.EventWriterHook(interval=50, writers=['CommandLineWriter', 'JSONWriter', 'TensorboardWriter'])[source]

Write logs periodically during training. This hook relies on TimerHook and it works with several Writer objects to log metrics, images, videos, audios, etc. In distributed training, only the main process will write the logs.

Parameters:
  • interval (int, optional) – The interval of iterations to write logs. Default: 50.

  • writers (list[Writer] or list[str], optional) – The list of writers or name of writers to use. Currently supported writers include CommandLineWriter, JSONWriter and TensorboardWriter. Default: ['CommandLineWriter'].

class nncore.engine.hooks.JSONWriter(filename='metrics.json')[source]

Write logs to JSON files.

Parameters:

filename (str, optional) – Path to the output JSON file. Default: 'metrics.json'.

class nncore.engine.hooks.TensorboardWriter(log_dir=None, input_to_model=None, **kwargs)[source]

Write logs to Tensorboard.

Parameters:
  • log_dir (str, optional) – Directory of the tensorboard logs. Default: None.

  • input_to_model (any, optional) – The input data, data_loader or name of the data_loader for constructing the model graph. If not specified, the graph will not be added. Please check torch.utils.tensorboard.SummaryWriter.add_graph for more details about adding a graph to tensorboard. Default: None.

class nncore.engine.hooks.WandbWriter(**kwargs)[source]

Write logs to Weight & Bias.

class nncore.engine.hooks.LrUpdaterHook(name=None)[source]

Update learning rate periodically during training. Currently supported learning rate and warm-up policies are step, cosine, exp, poly, inv, and linear, exp, constant respectively.

Learning rate policy configs:
  • step: step (list[int]), gamma (float, Default: 0.1)

  • cosine: target_lr (float, Default: 0)

  • exp: gamma (float)

  • poly: power (float, Default: 1), min_lr (float, Default: 0)

  • inv: gamma (float), power (float, Default: 1)

Warm-up policy configs:
  • linear: ratio (float)

  • exp: ratio (float)

  • constant: ratio (float)

class nncore.engine.hooks.EmptyCacheHook(names=[])[source]

Empty cache periodically during training.

Parameters:

names (list[str], optional) – The list of hook names to empty cache. Expected values include 'before_launch', 'after_launch', 'before_stage', 'after_stage', 'before_epoch', 'after_epoch', 'before_iter', 'after_iter', 'before_train_epoch', 'after_train_epoch', 'before_val_epoch', 'after_val_epoch', 'before_train_iter', 'after_train_iter', 'before_val_iter', and 'after_val_iter'. Default: [].

class nncore.engine.hooks.OptimizerHook(interval=1, coalesce=True, bucket_size_mb=-1, grad_scale=None)[source]

Perform back propagation and update parameters of the model periodically. This hook supports CPU, single GPU and distributed training.

Parameters:
  • interval (int, optional) – The interval of iterations to update parameters. Default: 1.

  • coalesce (bool, optional) – Whether to coalesce the weights in distributed training. Default: True.

  • bucket_size_mb (int, optional) – Size of the bucket. -1 means not restricting the bucket size. Default: -1.

  • grad_scale (dict | bool | None, optional) – Whether to scale the gradients. If not specified, this module will automatically scale the gradients when amp is activated. Default: None.

class nncore.engine.hooks.PreciseBNHook(interval=1, num_iters=200)[source]

Compute Precise-BN using EMA periodically during training. This hook will also run in the end of training.

Parameters:
  • interval (int, optional) – The interval of epochs to compute the stats. Default: 1.

  • num_iters (int, optional) – Number of iterations to compute the stats. This number will be overwritten by the length of training data loader. Default: 200.

class nncore.engine.hooks.SamplerSeedHook(name=None)[source]

Update sampler seeds every epoch. This hook is normally used in distributed training.

class nncore.engine.hooks.TimerHook[source]

Compute and save timings into enging.buffer during training.

Builder

nncore.engine.builder.build_dataloader(cfg, seed=None, dist=None, group=None, **kwargs)[source]

Build a data loader from a dict. The dataset should be registered in DATASETS.

Parameters:
  • cfg (dict) – The config of the dataset.

  • seed (int | None, optional) – The random seed to use. Default: None.

  • dist (bool | None, optional) – Whether the data loader is distributed. If not specified, this method will determine it automatically. Default: None.

  • group (dist.ProcessGroup | None, optional) – The process group to use. If not specified, the default process group will be used. Default: None.

Returns:

The constructed data loader.

Return type:

DataLoader

nncore.engine.builder.build_hook(cfg, **kwargs)[source]

Build a hook from a dict or str. The hook should be registered in HOOKS.

Parameters:

cfg (dict | str) – The config or name of the hook.

Returns:

The constructed hook.

Return type:

Hook

Utils

nncore.engine.utils.generate_random_seed(sync=True, src=0, group=None)[source]

Generate a random seed.

Parameters:
  • sync (bool, optional) – Whether to synchronize the random seed among the processes in the group in distributed settings. Default: True.

  • src (int, optional) – The source rank of the process in distributed settings. This argument is valid only when sync==True. Default: 0.

  • group (dist.ProcessGroup | None, optional) – The process group to use in distributed settings. This argument is valid only when sync==True. If not specified, the default process group will be used. Default: None.

Returns:

The generated random seed.

Return type:

int

nncore.engine.utils.set_random_seed(seed=None, benchmark=False, deterministic=False, **kwargs)[source]

Set random seed for random, numpy, and torch packages. If seed is not specified, this method will generate and return a new random seed.

Parameters:
  • seed (int | None, optional) – The random seed to use. If not specified, a new random seed will be generated. Default: None.

  • benchmark (bool, optional) – Whether to enable benchmark mode. Default: False.

  • deterministic (bool, optional) – Whether to enable deterministic mode. Default: False.

Returns:

The actually used random seed.

Return type:

int

nncore.engine.utils.get_checkpoint(file_or_url, map_location=None, **kwargs)[source]

Get checkpoint from a file or an URL.

Parameters:
  • file_or_url (str) – The filename or URL of the checkpoint.

  • map_location (str | None, optional) – Same as the torch.load interface. Default: None.

Returns:

The loaded checkpoint.

Return type:

OrderedDict | dict

nncore.engine.utils.load_checkpoint(model, checkpoint, map_location=None, strict=False, keys=None, logger=None, **kwargs)[source]

Load checkpoint from a file or an URL.

Parameters:
  • model (nn.Module) – The module to load checkpoint.

  • checkpoint (dict | str) – A dict, a filename, an URL or a torchvision://<model_name> str indicating the checkpoint.

  • map_location (str | None, optional) – Same as the torch.load interface. Default: None.

  • strict (bool, optional) – Whether to allow different params for the model and checkpoint. If True, raise an error when the params do not match exactly. Default: False.

  • keys (list[str] | None, optional) – The list of parameter keys to load. Default: None.

  • logger (logging.Logger | str | None, optional) – The logger or name of the logger for displaying error messages. Default: None.

Returns:

The loaded checkpoint.

Return type:

OrderedDict | dict

nncore.engine.utils.save_checkpoint(model, filename, optimizer=None, meta=None)[source]

Save checkpoint to a file.

The checkpoint object will have 3 fields: meta, state_dict and optimizer, where meta contains the version of nncore and the time info by default.

Parameters:
  • model (nn.Module) – The model whose params are to be saved.

  • filename (str) – Path to the checkpoint file.

  • optimizer (optim.Optimizer | None, optional) – The optimizer to be saved. Default: None.

  • meta (dict | None, optional) – The metadata to be saved. Default: None.

Returns:

The saved checkpoint.

Return type:

dict