doubt.models.tree package

Submodules

doubt.models.tree.forest module

Quantile regression forests

class doubt.models.tree.forest.QuantileRegressionForest(n_estimators: int = 100, criterion: str = 'mse', splitter: str = 'best', max_features: Optional[Union[int, float, str]] = None, max_depth: Optional[int] = None, min_samples_split: Union[int, float] = 2, min_samples_leaf: Union[int, float] = 5, min_weight_fraction_leaf: float = 0.0, max_leaf_nodes: Optional[int] = None, n_jobs: int = - 1, random_seed: Optional[int] = None, verbose: bool = False)

Bases: doubt.models._model.BaseModel

A random forest for regression which can output quantiles as well.

Parameters
  • n_estimators (int, optional) – The number of trees in the forest. Defaults to 100.

  • criterion (string, optional) – The function to measure the quality of a split. Supported criteria are ‘mse’ for the mean squared error, which is equal to variance reduction as feature selection criterion, and ‘mae’ for the mean absolute error. Defaults to ‘mse’.

  • splitter (string, optional) – The strategy used to choose the split at each node. Supported strategies are ‘best’ to choose the best split and ‘random’ to choose the best random split. Defaults to ‘best’.

  • max_features (int, float, string or None, optional) –

    The number of features to consider when looking for the best split:

    • If int, then consider max_features features at each split.

    • If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.

    • If ‘auto’, then max_features=n_features.

    • If ‘sqrt’, then max_features=sqrt(n_features).

    • If ‘log2’, then max_features=log2(n_features).

    • If None, then max_features=n_features.

    Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features. Defaults to None.

  • max_depth (int or None, optional) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Defaults to None.

  • min_samples_split (int or float, optional) –

    The minimum number of samples required to split an internal node:

    • If int, then consider min_samples_split as the minimum number.

    • If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) are the minimum number of samples for each split. Defaults to 2.

  • min_samples_leaf (int or float, optional) –

    The minimum number of samples required to be at a leaf node:

    • If int, then consider min_samples_leaf as the minimum number.

    • If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node. Defaults to 5.

  • min_weight_fraction_leaf (float, optional) – The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided. Defaults to 0.0.

  • max_leaf_nodes (int or None, optional) – Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes. Defaults to None.

  • n_jobs (int, optional) – The number of CPU cores used in fitting and predicting. If -1 then all available CPU cores will be used. Defaults to -1.

  • random_seed (int, RandomState instance or None, optional) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Defaults to None.

  • verbose (bool, optional) – Whether extra output should be printed during training and inference. Defaults to False.

Examples

Fitting and predicting follows scikit-learn syntax:

>>> from doubt.datasets import Concrete
>>> X, y = Concrete().split()
>>> forest = QuantileRegressionForest(random_seed=42,
...                                   max_leaf_nodes=8)
>>> forest.fit(X, y).predict(X).shape
(1030,)
>>> preds = forest.predict(np.ones(8))
>>> 16 < preds < 17
True

Instead of only returning the prediction, we can also return a prediction interval:

>>> preds, interval = forest.predict(np.ones(8), uncertainty=0.25)
>>> interval[0] < preds < interval[1]
True
fit(X, y, verbose: Optional[bool] = None)

Fit decision trees in parallel.

Parameters
  • X (array-like or sparse matrix) – The input samples, of shape [n_samples, n_features]. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

  • y (array-like) – The target values (class labels) as integers or strings, of shape [n_samples] or [n_samples, n_outputs].

  • verbose (bool or None, optional) – Whether extra output should be printed during training. If None then the initialised value of the verbose parameter will be used. Defaults to None.

predict(X: Sequence[Union[float, int]], uncertainty: Optional[float] = None, quantiles: Optional[Sequence[float]] = None, verbose: Optional[bool] = None) Union[numpy.ndarray, Tuple[numpy.ndarray, numpy.ndarray]]

Predict regression value for X.

Parameters
  • X (array-like or sparse matrix) – The input samples, of shape [n_samples, n_features]. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

  • uncertainty (float or None, optional) – Value ranging from 0 to 1. If None then no prediction intervals will be returned. Defaults to None.

  • quantiles (sequence of floats or None, optional) – List of quantiles to output, as an alternative to the uncertainty argument, and will not be used if that argument is set. If None then uncertainty is used. Defaults to None.

  • verbose (bool or None, optional) – Whether extra output should be printed during inference. If None then the initialised value of the verbose parameter will be used. Defaults to None.

Returns

Either array with predictions, of shape [n_samples,], or a pair of arrays with the first one being the predictions and the second one being the desired quantiles/intervals, of shape [2, n_samples] if uncertainty is not None, and [n_quantiles, n_samples] if quantiles is not None.

Return type

Array or pair of arrays

doubt.models.tree.tree module

Quantile regression trees

class doubt.models.tree.tree.BaseTreeQuantileRegressor(*, criterion, splitter, max_depth, min_samples_split, min_samples_leaf, min_weight_fraction_leaf, max_features, max_leaf_nodes, random_state, min_impurity_decrease, min_impurity_split, class_weight=None, ccp_alpha=0.0)

Bases: sklearn.tree._classes.BaseDecisionTree

fit(X: Sequence[Union[float, int]], y: Sequence[Union[float, int]], sample_weight: Optional[Sequence[Union[float, int]]] = None, check_input: bool = True, X_idx_sorted: Optional[Sequence[Union[float, int]]] = None)

Build a decision tree classifier from the training set (X, y).

Parameters
  • X (array-like or sparse matrix) – The training input samples, of shape [n_samples, n_features]. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csc_matrix.

  • y (array-like) – The target values (class labels) as integers or strings, of shape [n_samples] or [n_samples, n_outputs].

  • sample_weight (array-like or None, optional) – Sample weights of shape = [n_samples]. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. Splits are also ignored if they would result in any single class carrying a negative weight in either child node. Defaults to None.

  • check_input (boolean, optional) – Allow to bypass several input checking. Don’t use this parameter unless you know what you do. Defaults to True.

  • X_idx_sorted (array-like or None, optional) – The indexes of the sorted training input samples, of shape [n_samples, n_features]. If many tree are grown on the same dataset, this allows the ordering to be cached between trees. If None, the data will be sorted here. Don’t use this parameter unless you know what to do. Defaults to None.

predict(X: Sequence[Union[float, int]], uncertainty: Optional[float] = None, quantiles: Optional[Sequence[float]] = None, check_input: bool = True) Union[numpy.ndarray, Tuple[numpy.ndarray, numpy.ndarray]]

Predict regression value for X.

Parameters
  • X (array-like or sparse matrix) – The input samples, of shape [n_samples, n_features]. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

  • uncertainty (float or None, optional) – Value ranging from 0 to 1. If None then no prediction intervals will be returned. Defaults to None.

  • quantiles (sequence of floats or None, optional) – List of quantiles to output, as an alternative to the uncertainty argument, and will not be used if that argument is set. If None then uncertainty is used. Defaults to None.

  • check_input (boolean, optional) – Allow to bypass several input checking. Don’t use this parameter unless you know what you do. Defaults to True.

Returns

Either array with predictions, of shape [n_samples,], or a pair of arrays with the first one being the predictions and the second one being the desired quantiles/intervals, of shape [n_samples, 2] if uncertainty is not None, and [n_samples, n_quantiles] if quantiles is not None.

Return type

Array or pair of arrays

class doubt.models.tree.tree.QuantileRegressionTree(criterion: str = 'mse', splitter: str = 'best', max_features: Optional[Union[int, float, str]] = None, max_depth: Optional[int] = None, min_samples_split: Union[int, float] = 2, min_samples_leaf: Union[int, float] = 1, min_weight_fraction_leaf: float = 0.0, max_leaf_nodes: Optional[int] = None, random_seed: Optional[Union[int, numpy.random.mtrand.RandomState]] = None)

Bases: sklearn.tree._classes.DecisionTreeRegressor, doubt.models.tree.tree.BaseTreeQuantileRegressor

A decision tree regressor that provides quantile estimates.

Parameters
  • criterion (string, optional) – The function to measure the quality of a split. Supported criteria are ‘mse’ for the mean squared error, which is equal to variance reduction as feature selection criterion, and ‘mae’ for the mean absolute error. Defaults to ‘mse’.

  • splitter (string, optional) – The strategy used to choose the split at each node. Supported strategies are ‘best’ to choose the best split and ‘random’ to choose the best random split. Defaults to ‘best’.

  • max_features (int, float, string or None, optional) –

    The number of features to consider when looking for the best split: - If int, then consider max_features features at each split. - If float, then max_features is a percentage and

    int(max_features * n_features) features are considered at each split.

    • If ‘auto’, then max_features=n_features.

    • If ‘sqrt’, then max_features=sqrt(n_features).

    • If ‘log2’, then max_features=log2(n_features).

    • If None, then max_features=n_features.

    Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features. Defaults to None.

  • max_depth (int or None, optional) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Defaults to None.

  • min_samples_split (int or float, optional) –

    The minimum number of samples required to split an internal node: - If int, then consider min_samples_split as the minimum number. - If float, then min_samples_split is a percentage and

    ceil(min_samples_split * n_samples) are the minimum number of samples for each split. Defaults to 2.

  • min_samples_leaf (int or float, optional) –

    The minimum number of samples required to be at a leaf node: - If int, then consider min_samples_leaf as the minimum number. - If float, then min_samples_leaf is a percentage and

    ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node. Defaults to 1.

  • min_weight_fraction_leaf (float, optional) – The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided. Defaults to 0.0.

  • max_leaf_nodes (int or None, optional) – Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes. Defaults to None.

  • random_seed (int, RandomState instance or None, optional) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Defaults to None.

feature_importances_

The feature importances, of shape = [n_features]. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.

Type

array

max_features_

The inferred value of max_features.

Type

int

n_features_

The number of features when fit is performed.

Type

int

n_outputs_

The number of outputs when fit is performed.

Type

int

tree_

The underlying Tree object.

Type

Tree object

y_train_

Train target values.

Type

array-like

y_train_leaves_

Cache the leaf nodes that each training sample falls into. y_train_leaves_[i] is the leaf that y_train[i] ends up at.

Type

array-like

doubt.models.tree.utils module

Utility functions used in tree models

doubt.models.tree.utils.weighted_percentile(arr: Sequence[Union[float, int]], quantile: float, weights: Optional[Sequence[Union[float, int]]] = None, sorter: Optional[Sequence[Union[float, int]]] = None)

Returns the weighted percentile of an array.

See [1] for an explanation of this concept.

Parameters
  • arr (array-like) – Samples at which the quantile should be computed, of shape [n_samples,].

  • quantile (float) – Quantile, between 0.0 and 1.0.

  • weights (array-like, optional) – The weights, of shape = (n_samples,). Here weights[i] is the weight given to point a[i] while computing the quantile. If weights[i] is zero, a[i] is simply ignored during the percentile computation. If None then uniform weights will be used. Defaults to None.

  • sorter (array-like, optional) – Array of shape [n_samples,], indicating the indices sorting arr. Thus, if provided, we assume that arr[sorter] is sorted. If None then arr will be sorted. Defaults to None.

Returns

float

Weighted percentile of arr at quantile.

Return type

percentile

Raises

ValueError – If quantile is not between 0.0 and 1.0, or if arr and weights are of different lengths.

Sources:

[1]: https://en.wikipedia.org/wiki/Percentile#The_weighted_percentile_method

Module contents