doubt.models.tree package¶
Submodules¶
doubt.models.tree.forest module¶
Quantile regression forests
- class doubt.models.tree.forest.QuantileRegressionForest(n_estimators: int = 100, criterion: str = 'mse', splitter: str = 'best', max_features: Optional[Union[int, float, str]] = None, max_depth: Optional[int] = None, min_samples_split: Union[int, float] = 2, min_samples_leaf: Union[int, float] = 5, min_weight_fraction_leaf: float = 0.0, max_leaf_nodes: Optional[int] = None, n_jobs: int = - 1, random_seed: Optional[int] = None, verbose: bool = False)¶
Bases:
doubt.models._model.BaseModel
A random forest for regression which can output quantiles as well.
- Parameters
n_estimators (int, optional) – The number of trees in the forest. Defaults to 100.
criterion (string, optional) – The function to measure the quality of a split. Supported criteria are ‘mse’ for the mean squared error, which is equal to variance reduction as feature selection criterion, and ‘mae’ for the mean absolute error. Defaults to ‘mse’.
splitter (string, optional) – The strategy used to choose the split at each node. Supported strategies are ‘best’ to choose the best split and ‘random’ to choose the best random split. Defaults to ‘best’.
max_features (int, float, string or None, optional) –
The number of features to consider when looking for the best split:
If int, then consider max_features features at each split.
If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.
If ‘auto’, then max_features=n_features.
If ‘sqrt’, then max_features=sqrt(n_features).
If ‘log2’, then max_features=log2(n_features).
If None, then max_features=n_features.
Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than
max_features
features. Defaults to None.max_depth (int or None, optional) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Defaults to None.
min_samples_split (int or float, optional) –
The minimum number of samples required to split an internal node:
If int, then consider min_samples_split as the minimum number.
If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) are the minimum number of samples for each split. Defaults to 2.
min_samples_leaf (int or float, optional) –
The minimum number of samples required to be at a leaf node:
If int, then consider min_samples_leaf as the minimum number.
If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node. Defaults to 5.
min_weight_fraction_leaf (float, optional) – The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided. Defaults to 0.0.
max_leaf_nodes (int or None, optional) – Grow a tree with
max_leaf_nodes
in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes. Defaults to None.n_jobs (int, optional) – The number of CPU cores used in fitting and predicting. If -1 then all available CPU cores will be used. Defaults to -1.
random_seed (int, RandomState instance or None, optional) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Defaults to None.
verbose (bool, optional) – Whether extra output should be printed during training and inference. Defaults to False.
Examples
Fitting and predicting follows scikit-learn syntax:
>>> from doubt.datasets import Concrete >>> X, y = Concrete().split() >>> forest = QuantileRegressionForest(random_seed=42, ... max_leaf_nodes=8) >>> forest.fit(X, y).predict(X).shape (1030,) >>> preds = forest.predict(np.ones(8)) >>> 16 < preds < 17 True
Instead of only returning the prediction, we can also return a prediction interval:
>>> preds, interval = forest.predict(np.ones(8), uncertainty=0.25) >>> interval[0] < preds < interval[1] True
- fit(X, y, verbose: Optional[bool] = None)¶
Fit decision trees in parallel.
- Parameters
X (array-like or sparse matrix) – The input samples, of shape [n_samples, n_features]. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.
y (array-like) – The target values (class labels) as integers or strings, of shape [n_samples] or [n_samples, n_outputs].
verbose (bool or None, optional) – Whether extra output should be printed during training. If None then the initialised value of the verbose parameter will be used. Defaults to None.
- predict(X: Sequence[Union[float, int]], uncertainty: Optional[float] = None, quantiles: Optional[Sequence[float]] = None, verbose: Optional[bool] = None) Union[numpy.ndarray, Tuple[numpy.ndarray, numpy.ndarray]] ¶
Predict regression value for X.
- Parameters
X (array-like or sparse matrix) – The input samples, of shape [n_samples, n_features]. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.
uncertainty (float or None, optional) – Value ranging from 0 to 1. If None then no prediction intervals will be returned. Defaults to None.
quantiles (sequence of floats or None, optional) – List of quantiles to output, as an alternative to the uncertainty argument, and will not be used if that argument is set. If None then uncertainty is used. Defaults to None.
verbose (bool or None, optional) – Whether extra output should be printed during inference. If None then the initialised value of the verbose parameter will be used. Defaults to None.
- Returns
Either array with predictions, of shape [n_samples,], or a pair of arrays with the first one being the predictions and the second one being the desired quantiles/intervals, of shape [2, n_samples] if uncertainty is not None, and [n_quantiles, n_samples] if quantiles is not None.
- Return type
Array or pair of arrays
doubt.models.tree.tree module¶
Quantile regression trees
- class doubt.models.tree.tree.BaseTreeQuantileRegressor(*, criterion, splitter, max_depth, min_samples_split, min_samples_leaf, min_weight_fraction_leaf, max_features, max_leaf_nodes, random_state, min_impurity_decrease, min_impurity_split, class_weight=None, ccp_alpha=0.0)¶
Bases:
sklearn.tree._classes.BaseDecisionTree
- fit(X: Sequence[Union[float, int]], y: Sequence[Union[float, int]], sample_weight: Optional[Sequence[Union[float, int]]] = None, check_input: bool = True, X_idx_sorted: Optional[Sequence[Union[float, int]]] = None)¶
Build a decision tree classifier from the training set (X, y).
- Parameters
X (array-like or sparse matrix) – The training input samples, of shape [n_samples, n_features]. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csc_matrix.
y (array-like) – The target values (class labels) as integers or strings, of shape [n_samples] or [n_samples, n_outputs].
sample_weight (array-like or None, optional) – Sample weights of shape = [n_samples]. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. Splits are also ignored if they would result in any single class carrying a negative weight in either child node. Defaults to None.
check_input (boolean, optional) – Allow to bypass several input checking. Don’t use this parameter unless you know what you do. Defaults to True.
X_idx_sorted (array-like or None, optional) – The indexes of the sorted training input samples, of shape [n_samples, n_features]. If many tree are grown on the same dataset, this allows the ordering to be cached between trees. If None, the data will be sorted here. Don’t use this parameter unless you know what to do. Defaults to None.
- predict(X: Sequence[Union[float, int]], uncertainty: Optional[float] = None, quantiles: Optional[Sequence[float]] = None, check_input: bool = True) Union[numpy.ndarray, Tuple[numpy.ndarray, numpy.ndarray]] ¶
Predict regression value for X.
- Parameters
X (array-like or sparse matrix) – The input samples, of shape [n_samples, n_features]. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.
uncertainty (float or None, optional) – Value ranging from 0 to 1. If None then no prediction intervals will be returned. Defaults to None.
quantiles (sequence of floats or None, optional) – List of quantiles to output, as an alternative to the uncertainty argument, and will not be used if that argument is set. If None then uncertainty is used. Defaults to None.
check_input (boolean, optional) – Allow to bypass several input checking. Don’t use this parameter unless you know what you do. Defaults to True.
- Returns
Either array with predictions, of shape [n_samples,], or a pair of arrays with the first one being the predictions and the second one being the desired quantiles/intervals, of shape [n_samples, 2] if uncertainty is not None, and [n_samples, n_quantiles] if quantiles is not None.
- Return type
Array or pair of arrays
- class doubt.models.tree.tree.QuantileRegressionTree(criterion: str = 'mse', splitter: str = 'best', max_features: Optional[Union[int, float, str]] = None, max_depth: Optional[int] = None, min_samples_split: Union[int, float] = 2, min_samples_leaf: Union[int, float] = 1, min_weight_fraction_leaf: float = 0.0, max_leaf_nodes: Optional[int] = None, random_seed: Optional[Union[int, numpy.random.mtrand.RandomState]] = None)¶
Bases:
sklearn.tree._classes.DecisionTreeRegressor
,doubt.models.tree.tree.BaseTreeQuantileRegressor
A decision tree regressor that provides quantile estimates.
- Parameters
criterion (string, optional) – The function to measure the quality of a split. Supported criteria are ‘mse’ for the mean squared error, which is equal to variance reduction as feature selection criterion, and ‘mae’ for the mean absolute error. Defaults to ‘mse’.
splitter (string, optional) – The strategy used to choose the split at each node. Supported strategies are ‘best’ to choose the best split and ‘random’ to choose the best random split. Defaults to ‘best’.
max_features (int, float, string or None, optional) –
The number of features to consider when looking for the best split: - If int, then consider max_features features at each split. - If float, then max_features is a percentage and
int(max_features * n_features) features are considered at each split.
If ‘auto’, then max_features=n_features.
If ‘sqrt’, then max_features=sqrt(n_features).
If ‘log2’, then max_features=log2(n_features).
If None, then max_features=n_features.
Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features. Defaults to None.
max_depth (int or None, optional) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Defaults to None.
min_samples_split (int or float, optional) –
The minimum number of samples required to split an internal node: - If int, then consider min_samples_split as the minimum number. - If float, then min_samples_split is a percentage and
ceil(min_samples_split * n_samples) are the minimum number of samples for each split. Defaults to 2.
min_samples_leaf (int or float, optional) –
The minimum number of samples required to be at a leaf node: - If int, then consider min_samples_leaf as the minimum number. - If float, then min_samples_leaf is a percentage and
ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node. Defaults to 1.
min_weight_fraction_leaf (float, optional) – The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided. Defaults to 0.0.
max_leaf_nodes (int or None, optional) – Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes. Defaults to None.
random_seed (int, RandomState instance or None, optional) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Defaults to None.
- feature_importances_¶
The feature importances, of shape = [n_features]. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.
- Type
array
- max_features_¶
The inferred value of max_features.
- Type
int
- n_features_¶
The number of features when fit is performed.
- Type
int
- n_outputs_¶
The number of outputs when fit is performed.
- Type
int
- tree_¶
The underlying Tree object.
- Type
Tree object
- y_train_¶
Train target values.
- Type
array-like
- y_train_leaves_¶
Cache the leaf nodes that each training sample falls into. y_train_leaves_[i] is the leaf that y_train[i] ends up at.
- Type
array-like
doubt.models.tree.utils module¶
Utility functions used in tree models
- doubt.models.tree.utils.weighted_percentile(arr: Sequence[Union[float, int]], quantile: float, weights: Optional[Sequence[Union[float, int]]] = None, sorter: Optional[Sequence[Union[float, int]]] = None)¶
Returns the weighted percentile of an array.
See [1] for an explanation of this concept.
- Parameters
arr (array-like) – Samples at which the quantile should be computed, of shape [n_samples,].
quantile (float) – Quantile, between 0.0 and 1.0.
weights (array-like, optional) – The weights, of shape = (n_samples,). Here weights[i] is the weight given to point a[i] while computing the quantile. If weights[i] is zero, a[i] is simply ignored during the percentile computation. If None then uniform weights will be used. Defaults to None.
sorter (array-like, optional) – Array of shape [n_samples,], indicating the indices sorting arr. Thus, if provided, we assume that arr[sorter] is sorted. If None then arr will be sorted. Defaults to None.
- Returns
- float
Weighted percentile of arr at quantile.
- Return type
percentile
- Raises
ValueError – If quantile is not between 0.0 and 1.0, or if arr and weights are of different lengths.