doubt.models.tree package¶

Submodules¶

doubt.models.tree.forest module¶

Quantile regression forests

class doubt.models.tree.forest.QuantileRegressionForest(n_estimators: int = 100, criterion: str = 'mse', splitter: str = 'best', max_features: Optional[Union[int, float, str]] = None, max_depth: Optional[int] = None, min_samples_split: Union[int, float] = 2, min_samples_leaf: Union[int, float] = 5, min_weight_fraction_leaf: float = 0.0, max_leaf_nodes: Optional[int] = None, n_jobs: int = - 1, random_seed: Optional[int] = None, verbose: bool = False)¶

Bases: doubt.models._model.BaseModel

A random forest for regression which can output quantiles as well.

Parameters

n_estimators (int, optional) – The number of trees in the forest. Defaults to 100.
criterion (string, optional) – The function to measure the quality of a split. Supported criteria are ‘mse’ for the mean squared error, which is equal to variance reduction as feature selection criterion, and ‘mae’ for the mean absolute error. Defaults to ‘mse’.
splitter (string, optional) – The strategy used to choose the split at each node. Supported strategies are ‘best’ to choose the best split and ‘random’ to choose the best random split. Defaults to ‘best’.
max_features (int, float, string or None, optional) –
The number of features to consider when looking for the best split:
- If int, then consider max_features features at each split.
- If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.
- If ‘auto’, then max_features=n_features.
- If ‘sqrt’, then max_features=sqrt(n_features).
- If ‘log2’, then max_features=log2(n_features).
- If None, then max_features=n_features.
Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features. Defaults to None.
max_depth (int or None, optional) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Defaults to None.
min_samples_split (int or float, optional) –
The minimum number of samples required to split an internal node:
- If int, then consider min_samples_split as the minimum number.
- If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) are the minimum number of samples for each split. Defaults to 2.
min_samples_leaf (int or float, optional) –
The minimum number of samples required to be at a leaf node:
- If int, then consider min_samples_leaf as the minimum number.
- If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node. Defaults to 5.
min_weight_fraction_leaf (float, optional) – The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided. Defaults to 0.0.
max_leaf_nodes (int or None, optional) – Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes. Defaults to None.
n_jobs (int, optional) – The number of CPU cores used in fitting and predicting. If -1 then all available CPU cores will be used. Defaults to -1.
random_seed (int, RandomState instance or None, optional) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Defaults to None.
verbose (bool, optional) – Whether extra output should be printed during training and inference. Defaults to False.

Examples

Fitting and predicting follows scikit-learn syntax:

>>> from doubt.datasets import Concrete
>>> X, y = Concrete().split()
>>> forest = QuantileRegressionForest(random_seed=42,
...                                   max_leaf_nodes=8)
>>> forest.fit(X, y).predict(X).shape
(1030,)
>>> preds = forest.predict(np.ones(8))
>>> 16 < preds < 17
True

Instead of only returning the prediction, we can also return a prediction interval:

>>> preds, interval = forest.predict(np.ones(8), uncertainty=0.25)
>>> interval[0] < preds < interval[1]
True

fit(X, y, verbose: Optional[bool] = None)¶

Fit decision trees in parallel.

Parameters

X (array-like or sparse matrix) – The input samples, of shape [n_samples, n_features]. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.
y (array-like) – The target values (class labels) as integers or strings, of shape [n_samples] or [n_samples, n_outputs].
verbose (bool or None, optional) – Whether extra output should be printed during training. If None then the initialised value of the verbose parameter will be used. Defaults to None.

predict(X: Sequence[Union[float, int]], uncertainty: Optional[float] = None, quantiles: Optional[Sequence[float]] = None, verbose: Optional[bool] = None) → Union[numpy.ndarray, Tuple[numpy.ndarray, numpy.ndarray]]¶

Predict regression value for X.

Parameters

X (array-like or sparse matrix) – The input samples, of shape [n_samples, n_features]. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.
uncertainty (float or None, optional) – Value ranging from 0 to 1. If None then no prediction intervals will be returned. Defaults to None.
quantiles (sequence of floats or None, optional) – List of quantiles to output, as an alternative to the uncertainty argument, and will not be used if that argument is set. If None then uncertainty is used. Defaults to None.
verbose (bool or None, optional) – Whether extra output should be printed during inference. If None then the initialised value of the verbose parameter will be used. Defaults to None.

Returns

Either array with predictions, of shape [n_samples,], or a pair of arrays with the first one being the predictions and the second one being the desired quantiles/intervals, of shape [2, n_samples] if uncertainty is not None, and [n_quantiles, n_samples] if quantiles is not None.

Return type

Array or pair of arrays

doubt.models.tree.tree module¶

Quantile regression trees

class doubt.models.tree.tree.BaseTreeQuantileRegressor(*, criterion, splitter, max_depth, min_samples_split, min_samples_leaf, min_weight_fraction_leaf, max_features, max_leaf_nodes, random_state, min_impurity_decrease, min_impurity_split, class_weight=None, ccp_alpha=0.0)¶

Bases: sklearn.tree._classes.BaseDecisionTree

fit(X: Sequence[Union[float, int]], y: Sequence[Union[float, int]], sample_weight: Optional[Sequence[Union[float, int]]] = None, check_input: bool = True, X_idx_sorted: Optional[Sequence[Union[float, int]]] = None)¶

Build a decision tree classifier from the training set (X, y).

Parameters

X (array-like or sparse matrix) – The training input samples, of shape [n_samples, n_features]. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csc_matrix.
y (array-like) – The target values (class labels) as integers or strings, of shape [n_samples] or [n_samples, n_outputs].
sample_weight (array-like or None, optional) – Sample weights of shape = [n_samples]. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. Splits are also ignored if they would result in any single class carrying a negative weight in either child node. Defaults to None.
check_input (boolean, optional) – Allow to bypass several input checking. Don’t use this parameter unless you know what you do. Defaults to True.
X_idx_sorted (array-like or None, optional) – The indexes of the sorted training input samples, of shape [n_samples, n_features]. If many tree are grown on the same dataset, this allows the ordering to be cached between trees. If None, the data will be sorted here. Don’t use this parameter unless you know what to do. Defaults to None.

predict(X: Sequence[Union[float, int]], uncertainty: Optional[float] = None, quantiles: Optional[Sequence[float]] = None, check_input: bool = True) → Union[numpy.ndarray, Tuple[numpy.ndarray, numpy.ndarray]]¶

Predict regression value for X.

Parameters

X (array-like or sparse matrix) – The input samples, of shape [n_samples, n_features]. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.
uncertainty (float or None, optional) – Value ranging from 0 to 1. If None then no prediction intervals will be returned. Defaults to None.
quantiles (sequence of floats or None, optional) – List of quantiles to output, as an alternative to the uncertainty argument, and will not be used if that argument is set. If None then uncertainty is used. Defaults to None.
check_input (boolean, optional) – Allow to bypass several input checking. Don’t use this parameter unless you know what you do. Defaults to True.

Returns

Either array with predictions, of shape [n_samples,], or a pair of arrays with the first one being the predictions and the second one being the desired quantiles/intervals, of shape [n_samples, 2] if uncertainty is not None, and [n_samples, n_quantiles] if quantiles is not None.

Return type

Array or pair of arrays

class doubt.models.tree.tree.QuantileRegressionTree(criterion: str = 'mse', splitter: str = 'best', max_features: Optional[Union[int, float, str]] = None, max_depth: Optional[int] = None, min_samples_split: Union[int, float] = 2, min_samples_leaf: Union[int, float] = 1, min_weight_fraction_leaf: float = 0.0, max_leaf_nodes: Optional[int] = None, random_seed: Optional[Union[int, numpy.random.mtrand.RandomState]] = None)¶

Bases: sklearn.tree._classes.DecisionTreeRegressor, doubt.models.tree.tree.BaseTreeQuantileRegressor

A decision tree regressor that provides quantile estimates.

Parameters

criterion (string, optional) – The function to measure the quality of a split. Supported criteria are ‘mse’ for the mean squared error, which is equal to variance reduction as feature selection criterion, and ‘mae’ for the mean absolute error. Defaults to ‘mse’.
splitter (string, optional) – The strategy used to choose the split at each node. Supported strategies are ‘best’ to choose the best split and ‘random’ to choose the best random split. Defaults to ‘best’.
max_features (int, float, string or None, optional) –
The number of features to consider when looking for the best split: - If int, then consider max_features features at each split. - If float, then max_features is a percentage and

int(max_features * n_features) features are considered at each split.
- If ‘auto’, then max_features=n_features.
- If ‘sqrt’, then max_features=sqrt(n_features).
- If ‘log2’, then max_features=log2(n_features).
- If None, then max_features=n_features.
Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features. Defaults to None.
max_depth (int or None, optional) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Defaults to None.
min_samples_split (int or float, optional) –
The minimum number of samples required to split an internal node: - If int, then consider min_samples_split as the minimum number. - If float, then min_samples_split is a percentage and

ceil(min_samples_split * n_samples) are the minimum number of samples for each split. Defaults to 2.
min_samples_leaf (int or float, optional) –
The minimum number of samples required to be at a leaf node: - If int, then consider min_samples_leaf as the minimum number. - If float, then min_samples_leaf is a percentage and

ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node. Defaults to 1.
min_weight_fraction_leaf (float, optional) – The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided. Defaults to 0.0.
max_leaf_nodes (int or None, optional) – Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes. Defaults to None.
random_seed (int, RandomState instance or None, optional) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Defaults to None.

feature_importances_¶

The feature importances, of shape = [n_features]. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.

Type: array

max_features_¶

The inferred value of max_features.

Type: int

n_features_¶

The number of features when fit is performed.

Type: int

n_outputs_¶

The number of outputs when fit is performed.

Type: int

tree_¶

The underlying Tree object.

Type: Tree object

y_train_¶

Train target values.

Type: array-like

y_train_leaves_¶

Cache the leaf nodes that each training sample falls into. y_train_leaves_[i] is the leaf that y_train[i] ends up at.

Type: array-like

doubt.models.tree.utils module¶

Utility functions used in tree models

doubt.models.tree.utils.weighted_percentile(arr: Sequence[Union[float, int]], quantile: float, weights: Optional[Sequence[Union[float, int]]] = None, sorter: Optional[Sequence[Union[float, int]]] = None)¶

Returns the weighted percentile of an array.

See [1] for an explanation of this concept.

Parameters

arr (array-like) – Samples at which the quantile should be computed, of shape [n_samples,].
quantile (float) – Quantile, between 0.0 and 1.0.
weights (array-like, optional) – The weights, of shape = (n_samples,). Here weights[i] is the weight given to point a[i] while computing the quantile. If weights[i] is zero, a[i] is simply ignored during the percentile computation. If None then uniform weights will be used. Defaults to None.
sorter (array-like, optional) – Array of shape [n_samples,], indicating the indices sorting arr. Thus, if provided, we assume that arr[sorter] is sorted. If None then arr will be sorted. Defaults to None.

Returns

float: Weighted percentile of arr at quantile.

Return type

percentile

Raises

ValueError – If quantile is not between 0.0 and 1.0, or if arr and weights are of different lengths.

Sources:: [1]: https://en.wikipedia.org/wiki/Percentile#The_weighted_percentile_method

doubt.models.tree package¶

Submodules¶

doubt.models.tree.forest module¶

doubt.models.tree.tree module¶

doubt.models.tree.utils module¶

Module contents¶