doubt.datasets package¶

Submodules¶

doubt.datasets.airfoil module¶

Airfoil data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

class doubt.datasets.airfoil.Airfoil(cache: Optional[str] = '.dataset_cache')¶

Bases: doubt.datasets._dataset.BaseDataset

The NASA data set comprises different size NACA 0012 airfoils at various wind tunnel speeds and angles of attack. The span of the airfoil and the observer position were the same in all of the experiments.

Parameters: cache (str or None, optional) – The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to ‘.dataset_cache’.

cache¶

The name of the cache.

Type: str or None

shape¶

Dimensions of the data set

Type: tuple of integers

columns¶

List of column names in the data set

Type: list of strings

Features:

int:: Frequency, in Hertzs
float:: Angle of attack, in degrees
float:: Chord length, in meters
float:: Free-stream velocity, in meters per second
float:: Suction side displacement thickness, in meters

Targets:

float:: Scaled sound pressure level, in decibels

Source:

https://archive.ics.uci.edu/ml/datasets/Airfoil+Self-Noise

Examples

Load in the data set:

>>> dataset = Airfoil()
>>> dataset.shape
(1503, 6)

Split the data set into features and targets, as NumPy arrays:

>>> X, y = dataset.split()
>>> X.shape, y.shape
((1503, 5), (1503,))

Perform a train/test split, also outputting NumPy arrays:

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((1181, 5), (1181,), (322, 5), (322,))

Output the underlying Pandas DataFrame:

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

doubt.datasets.bike_sharing_daily module¶

Daily bike sharing data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

class doubt.datasets.bike_sharing_daily.BikeSharingDaily(cache: Optional[str] = '.dataset_cache')¶

Bases: doubt.datasets._dataset.BaseDataset

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.

Parameters: cache (str or None, optional) – The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to ‘.dataset_cache’.

cache¶

The name of the cache.

Type: str or None

shape¶

Dimensions of the data set

Type: tuple of integers

columns¶

List of column names in the data set

Type: list of strings

Features:

instant (int):

Record index

season (int):

The season, with 1 = winter, 2 = spring, 3 = summer and 4 = autumn

yr (int):

The year, with 0 = 2011 and 1 = 2012

mnth (int):

The month, from 1 to 12 inclusive

holiday (int):

Whether day is a holiday or not, binary valued

weekday (int):

The day of the week, from 0 to 6 inclusive

workingday (int):

Working day, 1 if day is neither weekend nor holiday, otherwise 0

weathersit (int):

Weather, encoded as

Clear, few clouds, partly cloudy
Mist and cloudy, mist and broken clouds, mist and few clouds
Light snow, light rain and thunderstorm and scattered clouds, light rain and scattered clouds
Heavy rain and ice pallets and thunderstorm and mist, or snow and fog

temp (float):

Max-min normalised temperature in Celsius, from -8 to +39

atemp (float):

Max-min normalised feeling temperature in Celsius, from -16 to +50

hum (float):

Scaled max-min normalised humidity, from 0 to 1

windspeed (float):

Scaled max-min normalised wind speed, from 0 to 1

Targets:

casual (int):: Count of casual users
registered (int):: Count of registered users
cnt (int):: Sum of casual and registered users

Source:

https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

Examples

Load in the data set:

>>> dataset = BikeSharingDaily()
>>> dataset.shape
(731, 15)

Split the data set into features and targets, as NumPy arrays:

>>> X, y = dataset.split()
>>> X.shape, y.shape
((731, 12), (731, 3))

Perform a train/test split, also outputting NumPy arrays:

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((574, 12), (574, 3), (157, 12), (157, 3))

Output the underlying Pandas DataFrame:

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

doubt.datasets.bike_sharing_hourly module¶

Hourly bike sharing data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

class doubt.datasets.bike_sharing_hourly.BikeSharingHourly(cache: Optional[str] = '.dataset_cache')¶

Bases: doubt.datasets._dataset.BaseDataset

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.

Parameters: cache (str or None, optional) – The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to ‘.dataset_cache’.

cache¶

The name of the cache.

Type: str or None

shape¶

Dimensions of the data set

Type: tuple of integers

columns¶

List of column names in the data set

Type: list of strings

Features:

instant (int):

Record index

season (int):

The season, with 1 = winter, 2 = spring, 3 = summer and 4 = autumn

yr (int):

The year, with 0 = 2011 and 1 = 2012

mnth (int):

The month, from 1 to 12 inclusive

hr (int):

The hour of the day, from 0 to 23 inclusive

holiday (int):

Whether day is a holiday or not, binary valued

weekday (int):

The day of the week, from 0 to 6 inclusive

workingday (int):

Working day, 1 if day is neither weekend nor holiday, otherwise 0

weathersit (int):

Weather, encoded as

Clear, few clouds, partly cloudy
Mist and cloudy, mist and broken clouds, mist and few clouds
Light snow, light rain and thunderstorm and scattered clouds, light rain and scattered clouds
Heavy rain and ice pallets and thunderstorm and mist, or snow and fog

temp (float):

Max-min normalised temperature in Celsius, from -8 to +39

atemp (float):

Max-min normalised feeling temperature in Celsius, from -16 to +50

hum (float):

Scaled max-min normalised humidity, from 0 to 1

windspeed (float):

Scaled max-min normalised wind speed, from 0 to 1

Targets:

casual (int):: Count of casual users
registered (int):: Count of registered users
cnt (int):: Sum of casual and registered users

Source:

https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

Examples

Load in the data set:

>>> dataset = BikeSharingHourly()
>>> dataset.shape
(17379, 16)

Split the data set into features and targets, as NumPy arrays:

>>> X, y = dataset.split()
>>> X.shape, y.shape
((17379, 13), (17379, 3))

Perform a train/test split, also outputting NumPy arrays:

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((13873, 13), (13873, 3), (3506, 13), (3506, 3))

Output the underlying Pandas DataFrame:

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

doubt.datasets.blog module¶

Blog post data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

class doubt.datasets.blog.Blog(cache: Optional[str] = '.dataset_cache')¶

Bases: doubt.datasets._dataset.BaseDataset

This data originates from blog posts. The raw HTML-documents of the blog posts were crawled and processed. The prediction task associated with the data is the prediction of the number of comments in the upcoming 24 hours. In order to simulate this situation, we choose a basetime (in the past) and select the blog posts that were published at most 72 hours before the selected base date/time. Then, we calculate all the features of the selected blog posts from the information that was available at the basetime, therefore each instance corresponds to a blog post. The target is the number of comments that the blog post received in the next 24 hours relative to the basetime.

In the train data, the basetimes were in the years 2010 and 2011. In the test data the basetimes were in February and March 2012. This simulates the real-world situtation in which training data from the past is available to predict events in the future.

The train data was generated from different basetimes that may temporally overlap. Therefore, if you simply split the train into disjoint partitions, the underlying time intervals may overlap. Therefore, the you should use the provided, temporally disjoint train and test splits in order to ensure that the evaluation is fair.

Parameters: cache (str or None, optional) – The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to ‘.dataset_cache’.

cache¶

The name of the cache.

Type: str or None

shape¶

Dimensions of the data set

Type: tuple of integers

columns¶

List of column names in the data set

Type: list of strings

Features:

Features 0-49 (float):: 50 features containing the average, standard deviation, minimum, maximum and median of feature 50-59 for the source of the current blog post, by which we mean the blog on which the post appeared. For example, myblog.blog.org would be the source of the post myblog.blog.org/post_2010_09_10
Feature 50 (int):: Total number of comments before basetime
Feature 51 (int):: Number of comments in the last 24 hours before the basetime
Feature 52 (int):: If T1 is the datetime 48 hours before basetime and T2 is the datetime 24 hours before basetime, then this is the number of comments in the time period between T1 and T2
Feature 53 (int):: Number of comments in the first 24 hours after the publication of the blog post, but before basetime
Feature 54 (int):: The difference between Feature 51 and Feature 52
Features 55-59 (int):: The same thing as Features 50-51, but for links (trackbacks) instead of comments
Feature 60 (float):: The length of time between the publication of the blog post and basetime
Feature 61 (int):: The length of the blog post
Features 62-261 (int):: The 200 bag of words features for 200 frequent words of the text of the blog post
Features 262-268 (int):: Binary indicators for the weekday (Monday-Sunday) of the basetime
Features 269-275 (int):: Binary indicators for the weekday (Monday-Sunday) of the date of publication of the blog post
Feature 276 (int):: Number of parent pages: we consider a blog post P as a parent of blog post B if B is a reply (trackback) to P
Features 277-279 (float):: Minimum, maximum and average of the number of comments the parents received

Targets:

int: The number of comments in the next 24 hours (relative to: baseline)

Source:

https://archive.ics.uci.edu/ml/datasets/BlogFeedback

Examples

Load in the data set:

>>> dataset = Blog()
>>> dataset.shape
(52397, 281)

Split the data set into features and targets, as NumPy arrays:

>>> X, y = dataset.split()
>>> X.shape, y.shape
((52397, 279), (52397,))

Perform a train/test split, also outputting NumPy arrays:

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((41949, 279), (41949,), (10448, 279), (10448,))

Output the underlying Pandas DataFrame:

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

doubt.datasets.concrete module¶

Concrete data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

class doubt.datasets.concrete.Concrete(cache: Optional[str] = '.dataset_cache')¶

Bases: doubt.datasets._dataset.BaseDataset

Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients.

Parameters: cache (str or None, optional) – The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to ‘.dataset_cache’.

cache¶

The name of the cache.

Type: str or None

shape¶

Dimensions of the data set

Type: tuple of integers

columns¶

List of column names in the data set

Type: list of strings

Features:

Cement (float):: Kg of cement in an m3 mixture
Blast Furnace Slag (float):: Kg of blast furnace slag in an m3 mixture
Fly Ash (float):: Kg of fly ash in an m3 mixture
Water (float):: Kg of water in an m3 mixture
Superplasticiser (float):: Kg of superplasticiser in an m3 mixture
Coarse Aggregate (float):: Kg of coarse aggregate in an m3 mixture
Fine Aggregate (float):: Kg of fine aggregate in an m3 mixture
Age (int):: Age in days, between 1 and 365 inclusive

Targets:

Concrete Compressive Strength (float):: Concrete compressive strength in megapascals

Source:

https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength

Examples

Load in the data set:

>>> dataset = Concrete()
>>> dataset.shape
(1030, 9)

Split the data set into features and targets, as NumPy arrays:

>>> X, y = dataset.split()
>>> X.shape, y.shape
((1030, 8), (1030,))

Perform a train/test split, also outputting NumPy arrays:

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((807, 8), (807,), (223, 8), (223,))

Output the underlying Pandas DataFrame:

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

doubt.datasets.cpu module¶

CPU data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

class doubt.datasets.cpu.CPU(cache: Optional[str] = '.dataset_cache')¶

Bases: doubt.datasets._dataset.BaseDataset

Relative CPU Performance Data, described in terms of its cycle time, memory size, etc.

Parameters: cache (str or None, optional) – The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to ‘.dataset_cache’.

cache¶

The name of the cache.

Type: str or None

shape¶

Dimensions of the data set

Type: tuple of integers

columns¶

List of column names in the data set

Type: list of strings

Features:

vendor_name (string):: Name of the vendor, 30 unique values
model_name (string):: Name of the model
myct (int):: Machine cycle time in nanoseconds
mmin (int):: Minimum main memory in kilobytes
mmax (int):: Maximum main memory in kilobytes
cach (int):: Cache memory in kilobytes
chmin (int):: Minimum channels in units
chmax (int):: Maximum channels in units

Targets:

prp (int):: Published relative performance

Source:

https://archive.ics.uci.edu/ml/datasets/Computer+Hardware

Examples

Load in the data set:

>>> dataset = CPU()
>>> dataset.shape
(209, 9)

Split the data set into features and targets, as NumPy arrays:

>>> X, y = dataset.split()
>>> X.shape, y.shape
((209, 8), (209,))

Perform a train/test split, also outputting NumPy arrays:

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((162, 8), (162,), (47, 8), (47,))

Output the underlying Pandas DataFrame:

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

doubt.datasets.facebook_comments module¶

Facebook comments data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

class doubt.datasets.facebook_comments.FacebookComments(cache: Optional[str] = '.dataset_cache')¶

Bases: doubt.datasets._dataset.BaseDataset

Instances in this dataset contain features extracted from Facebook posts. The task associated with the data is to predict how many comments the post will receive.

Parameters: cache (str or None, optional) – The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to ‘.dataset_cache’.

cache¶

The name of the cache.

Type: str or None

shape¶

Dimensions of the data set

Type: tuple of integers

columns¶

List of column names in the data set

Type: list of strings

Features:

page_popularity (int):: Defines the popularity of support for the source of the document
page_checkins (int):: Describes how many individuals so far visited this place. This feature is only associated with places; e.g., some institution, place, theater, etc.
page_talking_about (int):: Defines the daily interest of individuals towards source of the document/post. The people who actually come back to the page, after liking the page. This include activities such as comments, likes to a post, shares etc., by visitors to the page
page_category (int):: Defines the category of the source of the document; e.g., place, institution, branch etc.
agg[n] for n=0..24 (float):: These features are aggreagted by page, by calculating min, max, average, median and standard deviation of essential features
cc1 (int):: The total number of comments before selected base date/time
cc2 (int):: The number of comments in the last 24 hours, relative to base date/time
cc3 (int):: The number of comments in the last 48 to last 24 hours relative to base date/time
cc4 (int):: The number of comments in the first 24 hours after the publication of post but before base date/time
cc5 (int):: The difference between cc2 and cc3
base_time (int):: Selected time in order to simulate the scenario, ranges from 0 to 71
post_length (int):: Character count in the post
post_share_count (int):: This feature counts the number of shares of the post, how many people had shared this post onto their timeline
post_promotion_status (int):: Binary feature. To reach more people with posts in News Feed, individuals can promote their post and this feature indicates whether the post is promoted or not
h_local (int):: This describes the hours for which we have received the target variable/comments. Ranges from 0 to 23
day_published[n] for n=0..6 (int):: Binary feature. This represents the day (Sunday-Saturday) on which the post was published
day[n] for n=0..6 (int):: Binary feature. This represents the day (Sunday-Saturday) on selected base date/time

Targets:

ncomments (int): The number of comments in the next h_local hours

Source:

https://archive.ics.uci.edu/ml/datasets/Facebook+Comment+Volume+Dataset

Examples

Load in the data set:

>>> dataset = FacebookComments()
>>> dataset.shape
(199030, 54)

Split the data set into features and targets, as NumPy arrays:

>>> X, y = dataset.split()
>>> X.shape, y.shape
((199030, 54), (199030,))

Perform a train/test split, also outputting NumPy arrays:

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((159211, 54), (159211,), (39819, 54), (39819,))

Output the underlying Pandas DataFrame:

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

doubt.datasets.facebook_metrics module¶

Facebook metrics data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

class doubt.datasets.facebook_metrics.FacebookMetrics(cache: Optional[str] = '.dataset_cache')¶

Bases: doubt.datasets._dataset.BaseDataset

The data is related to posts’ published during the year of 2014 on the Facebook’s page of a renowned cosmetics brand.

Parameters: cache (str or None, optional) – The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to ‘.dataset_cache’.

cache¶

The name of the cache.

Type: str or None

shape¶

Dimensions of the data set

Type: tuple of integers

columns¶

List of column names in the data set

Type: list of strings

Features:

page_likes(int):: The total number of likes of the Facebook page at the given time.
post_type (int):: The type of post. Here 0 means ‘Photo’, 1 means ‘Status’, 2 means ‘Link’ and 3 means ‘Video’
post_category (int):: The category of the post.
post_month (int):: The month the post was posted, from 1 to 12 inclusive.
post_weekday (int):: The day of the week the post was posted, from 1 to 7 inclusive.
post_hour (int):: The hour the post was posted, from 0 to 23 inclusive
paid (int):: Binary feature, whether the post was paid for.

Targets:

total_reach (int):: The lifetime post total reach.
total_impressions (int):: The lifetime post total impressions.
engaged_users (int):: The lifetime engaged users.
post_consumers (int):: The lifetime post consumers.
post_consumptions (int):: The lifetime post consumptions.
post_impressions (int):: The lifetime post impressions by people who liked the page.
post_reach (int):: The lifetime post reach by people who liked the page.
post_engagements (int):: The lifetime people who have liked the page and engaged with the post.
comments (int):: The number of comments.
shares (int):: The number of shares.
total_interactions (int):: The total number of interactions

Source:

https://archive.ics.uci.edu/ml/datasets/Facebook+metrics

Examples

Load in the data set:

>>> dataset = FacebookMetrics()
>>> dataset.shape
(500, 18)

Split the data set into features and targets, as NumPy arrays:

>>> X, y = dataset.split()
>>> X.shape, y.shape
((500, 7), (500, 11))

Perform a train/test split, also outputting NumPy arrays:

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((388, 7), (388, 11), (112, 7), (112, 11))

Output the underlying Pandas DataFrame:

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

doubt.datasets.fish_bioconcentration module¶

Fish bioconcentration data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

class doubt.datasets.fish_bioconcentration.FishBioconcentration(cache: Optional[str] = '.dataset_cache')¶

Bases: doubt.datasets._dataset.BaseDataset

This dataset contains manually-curated experimental bioconcentration factor (BCF) for 1058 molecules (continuous values). Each row contains a molecule, identified by a CAS number, a name (if available), and a SMILES string. Additionally, the KOW (experimental or predicted) is reported. In this database, you will also find Extended Connectivity Fingerprints (binary vectors of 1024 bits), to be used as independent variables to predict the BCF.

Parameters: cache (str or None, optional) – The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to ‘.dataset_cache’.

cache¶

The name of the cache.

Type: str or None

shape¶

Dimensions of the data set

Type: tuple of integers

columns¶

List of column names in the data set

Type: list of strings

Features:

logkow (float):

Octanol water paritioning coefficient (experimental or predicted, as indicated by KOW type

kow_exp (int):

Indicates whether logKOW is experimental or predicted, with 1 denoting experimental and 0 denoting predicted

smiles_[idx] for idx = 0..125 (int):

Encoding of SMILES string to identify the 2D molecular structure. The encoding is as follows, where ‘x’ is a padding string to ensure that all the SMILES strings are of the same length:

0 = ‘x’

1 = ‘#’

2 = ‘(‘

3 = ‘)’

4 = ‘+’

5 = ‘-‘

6 = ‘/’

7 = ‘1’

8 = ‘2’

9 = ‘3’

10 = ‘4’

11 = ‘5’

12 = ‘6’

13 = ‘7’

14 = ‘8’

15 = ‘=’

16 = ‘@’

17 = ‘B’

18 = ‘C’

19 = ‘F’

20 = ‘H’

21 = ‘I’

22 = ‘N’

23 = ‘O’

24 = ‘P’

25 = ‘S’

26 = ‘[‘

27 = ‘'

28 = ‘]’

29 = ‘c’

30 = ‘i’

31 = ‘l’

32 = ‘n’

33 = ‘o’

34 = ‘r’

35 = ‘s’

Targets:

logbcf (float):: Experimental fish bioconcentration factor (logarithm form)

Source:

https://archive.ics.uci.edu/ml/datasets/QSAR+fish+bioconcentration+factor+%28BCF%29

Examples

Load in the data set:

>>> dataset = FishBioconcentration()
>>> dataset.shape
(1054, 129)

Split the data set into features and targets, as NumPy arrays:

>>> X, y = dataset.split()
>>> X.shape, y.shape
((1054, 128), (1054,))

Perform a train/test split, also outputting NumPy arrays:

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((825, 128), (825,), (229, 128), (229,))

Output the underlying Pandas DataFrame:

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

doubt.datasets.fish_toxicity module¶

Fish toxicity data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

class doubt.datasets.fish_toxicity.FishToxicity(cache: Optional[str] = '.dataset_cache')¶

Bases: doubt.datasets._dataset.BaseDataset

This dataset was used to develop quantitative regression QSAR models to predict acute aquatic toxicity towards the fish Pimephales promelas (fathead minnow) on a set of 908 chemicals. LC50 data, which is the concentration that causes death in 50% of test fish over a test duration of 96 hours, was used as model response

Parameters: cache (str or None, optional) – The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to ‘.dataset_cache’.

cache¶

The name of the cache.

Type: str or None

shape¶

Dimensions of the data set

Type: tuple of integers

columns¶

List of column names in the data set

Type: list of strings

Features:

CIC0 (float):: Information indices
SM1_Dz(Z) (float):: 2D matrix-based descriptors
GATS1i (float):: 2D autocorrelations
NdsCH (int): Atom-type counts
NdssC (int): Atom-type counts
MLOGP (float):: Molecular properties

Targets:

LC50 (float):: A concentration that causes death in 50% of test fish over a test duration of 96 hours. In -log(mol/L) units.

Source:

https://archive.ics.uci.edu/ml/datasets/QSAR+fish+toxicity

Examples

Load in the data set:

>>> dataset = FishToxicity()
>>> dataset.shape
(908, 7)

Split the data set into features and targets, as NumPy arrays:

>>> X, y = dataset.split()
>>> X.shape, y.shape
((908, 6), (908,))

Perform a train/test split, also outputting NumPy arrays:

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((708, 6), (708,), (200, 6), (200,))

Output the underlying Pandas DataFrame:

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

doubt.datasets.forest_fire module¶

Forest fire data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

class doubt.datasets.forest_fire.ForestFire(cache: Optional[str] = '.dataset_cache')¶

Bases: doubt.datasets._dataset.BaseDataset

This is a difficult regression task, where the aim is to predict the burned area of forest fires, in the northeast region of Portugal, by using meteorological and other data.

Parameters: cache (str or None, optional) – The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to ‘.dataset_cache’.

cache¶

The name of the cache.

Type: str or None

shape¶

Dimensions of the data set

Type: tuple of integers

columns¶

List of column names in the data set

Type: list of strings

Features:

X (float):: The x-axis spatial coordinate within the Montesinho park map. Ranges from 1 to 9.
Y (float):: The y-axis spatial coordinate within the Montesinho park map Ranges from 2 to 9.
month (int):: Month of the year. Ranges from 0 to 11
day (int):: Day of the week. Ranges from 0 to 6
FFMC (float):: FFMC index from the FWI system. Ranges from 18.7 to 96.20
DMC (float):: DMC index from the FWI system. Ranges from 1.1 to 291.3
DC (float):: DC index from the FWI system. Ranges from 7.9 to 860.6
ISI (float):: ISI index from the FWI system. Ranges from 0.0 to 56.1
temp (float):: Temperature in Celsius degrees. Ranges from 2.2 to 33.3
RH (float):: Relative humidity in %. Ranges from 15.0 to 100.0
wind (float):: Wind speed in km/h. Ranges from 0.4 to 9.4
rain (float):: Outside rain in mm/m2. Ranges from 0.0 to 6.4

Targets:

area (float):: The burned area of the forest (in ha). Ranges from 0.00 to 1090.84

Notes

The target variable is very skewed towards 0.0, thus it may make sense to model with the logarithm transform.

Source:: https://archive.ics.uci.edu/ml/datasets/Forest+Fires

Examples

Load in the data set:

>>> dataset = ForestFire()
>>> dataset.shape
(517, 13)

Split the data set into features and targets, as NumPy arrays:

>>> X, y = dataset.split()
>>> X.shape, y.shape
((517, 12), (517,))

Perform a train/test split, also outputting NumPy arrays:

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((401, 12), (401,), (116, 12), (116,))

Output the underlying Pandas DataFrame:

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

doubt.datasets.gas_turbine module¶

Gas turbine data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

class doubt.datasets.gas_turbine.GasTurbine(cache: Optional[str] = '.dataset_cache')¶

Bases: doubt.datasets._dataset.BaseDataset

Data have been generated from a sophisticated simulator of a Gas Turbines (GT), mounted on a Frigate characterized by a COmbined Diesel eLectric And Gas (CODLAG) propulsion plant type.

The experiments have been carried out by means of a numerical simulator of a naval vessel (Frigate) characterized by a Gas Turbine (GT) propulsion plant. The different blocks forming the complete simulator (Propeller, Hull, GT, Gear Box and Controller) have been developed and fine tuned over the year on several similar real propulsion plants. In view of these observations the available data are in agreement with a possible real vessel.

In this release of the simulator it is also possible to take into account the performance decay over time of the GT components such as GT compressor and turbines.

The propulsion system behaviour has been described with this parameters:

Ship speed (linear function of the lever position lp).

Compressor degradation coefficient kMc.

Turbine degradation coefficient kMt.

so that each possible degradation state can be described by a combination of this triple (lp,kMt,kMc).

The range of decay of compressor and turbine has been sampled with an uniform grid of precision 0.001 so to have a good granularity of representation.

In particular for the compressor decay state discretization the kMc coefficient has been investigated in the domain [1; 0.95], and the turbine coefficient in the domain [1; 0.975].

Ship speed has been investigated sampling the range of feasible speed from 3 knots to 27 knots with a granularity of representation equal to tree knots.

A series of measures (16 features) which indirectly represents of the state of the system subject to performance decay has been acquired and stored in the dataset over the parameter’s space.

Parameters: cache (str or None, optional) – The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to ‘.dataset_cache’.

cache¶

The name of the cache.

Type: str or None

shape¶

Dimensions of the data set

Type: tuple of integers

columns¶

List of column names in the data set

Type: list of strings

Features:

lever_position (float): The position of the lever
ship_speed (float):: The ship speed, in knots
shaft_torque (float):: The shaft torque of the gas turbine, in kN m
turbine_revolution_rate (float):: The gas turbine rate of revolutions, in rpm
generator_revolution_rate (float):: The gas generator rate of revolutions, in rpm
starboard_propeller_torque (float):: The torque of the starboard propeller, in kN
port_propeller_torque (float):: The torque of the port propeller, in kN
turbine_exit_temp (float):: Height pressure turbine exit temperature, in celcius
inlet_temp (float):: Gas turbine compressor inlet air temperature, in celcius
outlet_temp (float):: Gas turbine compressor outlet air temperature, in celcius
turbine_exit_pres (float):: Height pressure turbine exit pressure, in bar
inlet_pres (float):: Gas turbine compressor inlet air pressure, in bar
outlet_pres (float):: Gas turbine compressor outlet air pressure, in bar
exhaust_pres (float):: Gas turbine exhaust gas pressure, in bar
turbine_injection_control (float):: Turbine injection control, in percent
fuel_flow (float):: Fuel flow, in kg/s

Targets:

compressor_decay (type):: Gas turbine compressor decay state coefficient
turbine_decay (type):: Gas turbine decay state coefficient

Source:

https://archive.ics.uci.edu/ml/datasets/Condition+Based+Maintenance+of+Naval+Propulsion+Plants

Examples

Load in the data set:

>>> dataset = GasTurbine()
>>> dataset.shape
(11934, 18)

Split the data set into features and targets, as NumPy arrays:

>>> X, y = dataset.split()
>>> X.shape, y.shape
((11934, 16), (11934, 2))

Perform a train/test split, also outputting NumPy arrays:

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((9516, 16), (9516, 2), (2418, 16), (2418, 2))

Output the underlying Pandas DataFrame:

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

doubt.datasets.nanotube module¶

Nanotube data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

class doubt.datasets.nanotube.Nanotube(cache: Optional[str] = '.dataset_cache')¶

Bases: doubt.datasets._dataset.BaseDataset

CASTEP can simulate a wide range of properties of materials proprieties using density functional theory (DFT). DFT is the most successful method calculates atomic coordinates faster than other mathematical approaches, and it also reaches more accurate results. The dataset is generated with CASTEP using CNT geometry optimization. Many CNTs are simulated in CASTEP, then geometry optimizations are calculated. Initial coordinates of all carbon atoms are generated randomly. Different chiral vectors are used for each CNT simulation.

The atom type is selected as carbon, bond length is used as 1.42 AÂ° (default value). CNT calculation parameters are used as default parameters. To finalize the computation, CASTEP uses a parameter named as elec_energy_tol (electrical energy tolerance) (default 1x10-5 eV) which represents that the change in the total energy from one iteration to the next remains below some tolerance value per atom for a few self-consistent field steps. Initial atomic coordinates (u, v, w), chiral vector (n, m) and calculated atomic coordinates (u, v, w) are obtained from the output files.

Parameters: cache (str or None, optional) – The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to ‘.dataset_cache’.

cache¶

The name of the cache.

Type: str or None

shape¶

Dimensions of the data set

Type: tuple of integers

columns¶

List of column names in the data set

Type: list of strings

Features:

Chiral indice n (int):: n parameter of the selected chiral vector
Chiral indice m (int):: m parameter of the selected chiral vector
Initial atomic coordinate u (float):: Randomly generated u parameter of the initial atomic coordinates of all carbon atoms.
Initial atomic coordinate v (float):: Randomly generated v parameter of the initial atomic coordinates of all carbon atoms.
Initial atomic coordinate w (float):: Randomly generated w parameter of the initial atomic coordinates of all carbon atoms.

Targets:

Calculated atomic coordinates u (float):: Calculated u parameter of the atomic coordinates of all carbon atoms
Calculated atomic coordinates v (float):: Calculated v parameter of the atomic coordinates of all carbon atoms
Calculated atomic coordinates w (float):: Calculated w parameter of the atomic coordinates of all carbon atoms

Sources:

https://archive.ics.uci.edu/ml/datasets/Carbon+Nanotubes https://doi.org/10.1007/s00339-016-0153-1 https://doi.org/10.17341/gazimmfd.337642

Examples

Load in the data set:

>>> dataset = Nanotube()
>>> dataset.shape
(10721, 8)

Split the data set into features and targets, as NumPy arrays:

>>> X, y = dataset.split()
>>> X.shape, y.shape
((10721, 5), (10721, 3))

Perform a train/test split, also outputting NumPy arrays:

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((8541, 5), (8541, 3), (2180, 5), (2180, 3))

Output the underlying Pandas DataFrame:

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

doubt.datasets.new_taipei_housing module¶

New Taipei Housing data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

class doubt.datasets.new_taipei_housing.NewTaipeiHousing(cache: Optional[str] = '.dataset_cache')¶

Bases: doubt.datasets._dataset.BaseDataset

The “real estate valuation” is a regression problem. The market historical data set of real estate valuation are collected from Sindian Dist., New Taipei City, Taiwan.

Parameters: cache (str or None, optional) – The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to ‘.dataset_cache’.

cache¶

The name of the cache.

Type: str or None

shape¶

Dimensions of the data set

Type: tuple of integers

columns¶

List of column names in the data set

Type: list of strings

Features:

transaction_date (float):: The transaction date encoded as a floating point value. For instance, 2013.250 is March 2013 and 2013.500 is June March
house_age (float):: The age of the house
mrt_distance (float):: Distance to the nearest MRT station
n_stores (int):: Number of convenience stores
lat (float):: Latitude
lng (float):: Longitude

Targets:

house_price (float):: House price of unit area

Source:

https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set

Examples

Load in the data set:

>>> dataset = NewTaipeiHousing()
>>> dataset.shape
(414, 7)

Split the data set into features and targets, as NumPy arrays:

>>> X, y = dataset.split()
>>> X.shape, y.shape
((414, 6), (414,))

Perform a train/test split, also outputting NumPy arrays:

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((323, 6), (323,), (91, 6), (91,))

Output the underlying Pandas DataFrame:

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

doubt.datasets.parkinsons module¶

Parkinsons data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

class doubt.datasets.parkinsons.Parkinsons(cache: Optional[str] = '.dataset_cache')¶

Bases: doubt.datasets._dataset.BaseDataset

This dataset is composed of a range of biomedical voice measurements from 42 people with early-stage Parkinson’s disease recruited to a six-month trial of a telemonitoring device for remote symptom progression monitoring. The recordings were automatically captured in the patient’s homes.

Columns in the table contain subject number, subject age, subject gender, time interval from baseline recruitment date, motor UPDRS, total UPDRS, and 16 biomedical voice measures. Each row corresponds to one of 5,875 voice recording from these individuals. The main aim of the data is to predict the motor and total UPDRS scores (‘motor_UPDRS’ and ‘total_UPDRS’) from the 16 voice measures.

Parameters: cache (str or None, optional) – The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to ‘.dataset_cache’.

cache¶

The name of the cache.

Type: str or None

shape¶

Dimensions of the data set

Type: tuple of integers

columns¶

List of column names in the data set

Type: list of strings

Features:

subject# (int):: Integer that uniquely identifies each subject
age (int):: Subject age
sex (int):: Binary feature. Subject sex, with 0 being male and 1 female
test_time (float):: Time since recruitment into the trial. The integer part is the number of days since recruitment
Jitter(%) (float):: Measure of variation in fundamental frequency
Jitter(Abs) (float):: Measure of variation in fundamental frequency
Jitter:RAP (float):: Measure of variation in fundamental frequency
Jitter:PPQ5 (float):: Measure of variation in fundamental frequency
Jitter:DDP (float):: Measure of variation in fundamental frequency
Shimmer (float):: Measure of variation in amplitude
Shimmer(dB) (float):: Measure of variation in amplitude
Shimmer:APQ3 (float):: Measure of variation in amplitude
Shimmer:APQ5 (float):: Measure of variation in amplitude
Shimmer:APQ11 (float):: Measure of variation in amplitude
Shimmer:DDA (float):: Measure of variation in amplitude
NHR (float):: Measure of ratio of noise to tonal components in the voice
HNR (float):: Measure of ratio of noise to tonal components in the voice
RPDE (float):: A nonlinear dynamical complexity measure
DFA (float):: Signal fractal scaling exponent
PPE (float):: A nonlinear measure of fundamental frequency variation

Targets:

motor_UPDRS (float):: Clinician’s motor UPDRS score, linearly interpolated
total_UPDRS (float):: Clinician’s total UPDRS score, linearly interpolated

Source:

https://archive.ics.uci.edu/ml/datasets/Parkinsons+Telemonitoring

Examples

Load in the data set:

>>> dataset = Parkinsons()
>>> dataset.shape
(5875, 22)

Split the data set into features and targets, as NumPy arrays:

>>> X, y = dataset.split()
>>> X.shape, y.shape
((5875, 20), (5875, 2))

Perform a train/test split, also outputting NumPy arrays:

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((4659, 20), (4659, 2), (1216, 20), (1216, 2))

Output the underlying Pandas DataFrame:

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

doubt.datasets.power_plant module¶

Power plant data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

class doubt.datasets.power_plant.PowerPlant(cache: Optional[str] = '.dataset_cache')¶

Bases: doubt.datasets._dataset.BaseDataset

The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant.

A combined cycle power plant (CCPP) is composed of gas turbines (GT), steam turbines (ST) and heat recovery steam generators. In a CCPP, the electricity is generated by gas and steam turbines, which are combined in one cycle, and is transferred from one turbine to another. While the Vacuum is colected from and has effect on the Steam Turbine, he other three of the ambient variables effect the GT performance.

For comparability with our baseline studies, and to allow 5x2 fold statistical tests be carried out, we provide the data shuffled five times. For each shuffling 2-fold CV is carried out and the resulting 10 measurements are used for statistical testing.

Parameters: cache (str or None, optional) – The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to ‘.dataset_cache’.

cache¶

The name of the cache.

Type: str or None

shape¶

Dimensions of the data set

Type: tuple of integers

columns¶

List of column names in the data set

Type: list of strings

Features:

AT (float):: Hourly average temperature in Celsius, ranges from 1.81 to 37.11
V (float):: Hourly average exhaust vacuum in cm Hg, ranges from 25.36 to 81.56
AP (float):: Hourly average ambient pressure in millibar, ranges from 992.89 to 1033.30
RH (float):: Hourly average relative humidity in percent, ranges from 25.56 to 100.16

Targets:

PE (float):: Net hourly electrical energy output in MW, ranges from 420.26 to 495.76

Source:

https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

Examples

Load in the data set:

>>> dataset = PowerPlant()
>>> dataset.shape
(9568, 5)

Split the data set into features and targets, as NumPy arrays:

>>> X, y = dataset.split()
>>> X.shape, y.shape
((9568, 4), (9568,))

Perform a train/test split, also outputting NumPy arrays:

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((7633, 4), (7633,), (1935, 4), (1935,))

Output the underlying Pandas DataFrame:

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

doubt.datasets.protein module¶

Protein data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

class doubt.datasets.protein.Protein(cache: Optional[str] = '.dataset_cache')¶

Bases: doubt.datasets._dataset.BaseDataset

This is a data set of Physicochemical Properties of Protein Tertiary Structure. The data set is taken from CASP 5-9. There are 45730 decoys and size varying from 0 to 21 armstrong.

Parameters: cache (str or None, optional) – The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to ‘.dataset_cache’.

cache¶

The name of the cache.

Type: str or None

shape¶

Dimensions of the data set

Type: tuple of integers

columns¶

List of column names in the data set

Type: list of strings

Features:

F1 (float):: Total surface area
F2 (float):: Non polar exposed area
F3 (float):: Fractional area of exposed non polar residue
F4 (float):: Fractional area of exposed non polar part of residue
F5 (float):: Molecular mass weighted exposed area
F6 (float):: Average deviation from standard exposed area of residue
F7 (float):: Euclidean distance
F8 (float):: Secondary structure penalty
F9 (float):: Spacial Distribution constraints (N,K Value)

Targets:

RMSD (float):: Size of the residue

Source:

https://archive.ics.uci.edu/ml/datasets/Physicochemical+Properties+of+Protein+Tertiary+Structure

Examples

Load in the data set:

>>> dataset = Protein()
>>> dataset.shape
(45730, 10)

Split the data set into features and targets, as NumPy arrays:

>>> X, y = dataset.split()
>>> X.shape, y.shape
((45730, 9), (45730,))

Perform a train/test split, also outputting NumPy arrays:

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((36580, 9), (36580,), (9150, 9), (9150,))

Output the underlying Pandas DataFrame:

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

doubt.datasets.servo module¶

Servo data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

class doubt.datasets.servo.Servo(cache: Optional[str] = '.dataset_cache')¶

Bases: doubt.datasets._dataset.BaseDataset

Data was from a simulation of a servo system.

Ross Quinlan:

This data was given to me by Karl Ulrich at MIT in 1986. I didn’t record his description at the time, but here’s his subsequent (1992) recollection:

“I seem to remember that the data was from a simulation of a servo system involving a servo amplifier, a motor, a lead screw/nut, and a sliding carriage of some sort. It may have been on of the translational axes of a robot on the 9th floor of the AI lab. In any case, the output value is almost certainly a rise time, or the time required for the system to respond to a step change in a position set point.”

(Quinlan, ML’93)

“This is an interesting collection of data provided by Karl Ulrich. It covers an extremely non-linear phenomenon - predicting the rise time of a servomechanism in terms of two (continuous) gain settings and two (discrete) choices of mechanical linkages.”

Parameters: cache (str or None, optional) – The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to ‘.dataset_cache’.

cache¶

The name of the cache.

Type: str or None

shape¶

Dimensions of the data set

Type: tuple of integers

columns¶

List of column names in the data set

Type: list of strings

Features:

motor (int):: Motor, ranges from 0 to 4 inclusive
screw (int):: Screw, ranges from 0 to 4 inclusive
pgain (int):: PGain, ranges from 3 to 6 inclusive
vgain (int):: VGain, ranges from 1 to 5 inclusive

Targets:

class (float):: Class values, ranges from 0.13 to 7.10 inclusive

Source:

https://archive.ics.uci.edu/ml/datasets/Servo

Examples

Load in the data set:

>>> dataset = Servo()
>>> dataset.shape
(167, 5)

Split the data set into features and targets, as NumPy arrays:

>>> X, y = dataset.split()
>>> X.shape, y.shape
((167, 4), (167,))

Perform a train/test split, also outputting NumPy arrays:

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((131, 4), (131,), (36, 4), (36,))

Output the underlying Pandas DataFrame:

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

doubt.datasets.solar_flare module¶

Solar flare data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

class doubt.datasets.solar_flare.SolarFlare(cache: Optional[str] = '.dataset_cache')¶

Bases: doubt.datasets._dataset.BaseDataset

Each class attribute counts the number of solar flares of a certain class that occur in a 24 hour period.

The database contains 3 potential classes, one for the number of times a certain type of solar flare occured in a 24 hour period.

Each instance represents captured features for 1 active region on the sun.

The data are divided into two sections. The second section (flare.data2) has had much more error correction applied to the it, and has consequently been treated as more reliable.

Parameters: cache (str or None, optional) – The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to ‘.dataset_cache’.

cache¶

The name of the cache.

Type: str or None

shape¶

Dimensions of the data set

Type: tuple of integers

columns¶

List of column names in the data set

Type: list of strings

Features:

class (int):: Code for class (modified Zurich class). Ranges from 0 to 6 inclusive
spot_size (int):: Code for largest spot size. Ranges from 0 to 5 inclusive
spot_distr (int):: Code for spot distribution. Ranges from 0 to 3 inclusive
activity (int):: Binary feature indicating 1 = reduced and 2 = unchanged
evolution (int):: 0 = decay, 1 = no growth and 2 = growth
flare_activity (int):: Previous 24 hour flare activity code, where 0 = nothing as big as an M1, 1 = one M1 and 2 = more activity than one M1
is_complex (int):: Binary feature indicating historically complex
became_complex (int):: Binary feature indicating whether the region became historically complex on this pass across the sun’s disk
large (int):: Binary feature, indicating whether area is large
large_spot (int):: Binary feature, indicating whether the area of the largest spot is greater than 5

Targets:

C-class (int):: C-class flares production by this region in the following 24 hours (common flares)
M-class (int):: M-class flares production by this region in the following 24 hours (common flares)
X-class (int):: X-class flares production by this region in the following 24 hours (common flares)

Source:

https://archive.ics.uci.edu/ml/datasets/Solar+Flare

Examples

Load in the data set:

>>> dataset = SolarFlare()
>>> dataset.shape
(1066, 13)

Split the data set into features and targets, as NumPy arrays:

>>> X, y = dataset.split()
>>> X.shape, y.shape
((1066, 10), (1066, 3))

Perform a train/test split, also outputting NumPy arrays:

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((837, 10), (837, 3), (229, 10), (229, 3))

Output the underlying Pandas DataFrame:

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

doubt.datasets.space_shuttle module¶

Space shuttle data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

class doubt.datasets.space_shuttle.SpaceShuttle(cache: Optional[str] = '.dataset_cache')¶

Bases: doubt.datasets._dataset.BaseDataset

The motivation for collecting this database was the explosion of the USA Space Shuttle Challenger on 28 January, 1986. An investigation ensued into the reliability of the shuttle’s propulsion system. The explosion was eventually traced to the failure of one of the three field joints on one of the two solid booster rockets. Each of these six field joints includes two O-rings, designated as primary and secondary, which fail when phenomena called erosion and blowby both occur.

The night before the launch a decision had to be made regarding launch safety. The discussion among engineers and managers leading to this decision included concern that the probability of failure of the O-rings depended on the temperature t at launch, which was forecase to be 31 degrees F. There are strong engineering reasons based on the composition of O-rings to support the judgment that failure probability may rise monotonically as temperature drops. One other variable, the pressure s at which safety testing for field join leaks was performed, was available, but its relevance to the failure process was unclear.

Draper’s paper includes a menacing figure graphing the number of field joints experiencing stress vs. liftoff temperature for the 23 shuttle flights previous to the Challenger disaster. No previous liftoff temperature was under 53 degrees F. Although tremendous extrapolation must be done from the given data to assess risk at 31 degrees F, it is obvious even to the layman “to foresee the unacceptably high risk created by launching at 31 degrees F.” For more information, see Draper (1993) or the other previous analyses.

The task is to predict the number of O-rings that will experience thermal distress for a given flight when the launch temperature is below freezing.

Parameters: cache (str or None, optional) – The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to ‘.dataset_cache’.

cache¶

The name of the cache.

Type: str or None

shape¶

Dimensions of the data set

Type: tuple of integers

columns¶

List of column names in the data set

Type: list of strings

Features:

idx (int):: Temporal order of flight
temp (int):: Launch temperature in Fahrenheit
pres (int):: Leak-check pressure in psi
n_risky_rings (int):: Number of O-rings at risk on a given flight

Targets:

n_distressed_rings (int):: Number of O-rings experiencing thermal distress

Source:

https://archive.ics.uci.edu/ml/datasets/Challenger+USA+Space+Shuttle+O-Ring

Examples

Load in the data set:

>>> dataset = SpaceShuttle()
>>> dataset.shape
(23, 5)

Split the data set into features and targets, as NumPy arrays:

>>> X, y = dataset.split()
>>> X.shape, y.shape
((23, 4), (23,))

Perform a train/test split, also outputting NumPy arrays:

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((20, 4), (20,), (3, 4), (3,))

Output the underlying Pandas DataFrame:

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

doubt.datasets.stocks module¶

Stocks data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

class doubt.datasets.stocks.Stocks(cache: Optional[str] = '.dataset_cache')¶

Bases: doubt.datasets._dataset.BaseDataset

There are three disadvantages of weighted scoring stock selection models. First, they cannot identify the relations between weights of stock-picking concepts and performances of portfolios. Second, they cannot systematically discover the optimal combination for weights of concepts to optimize the performances. Third, they are unable to meet various investors’ preferences.

This study aims to more efficiently construct weighted scoring stock selection models to overcome these disadvantages. Since the weights of stock-picking concepts in a weighted scoring stock selection model can be regarded as components in a mixture, we used the simplex centroid mixture design to obtain the experimental sets of weights. These sets of weights are simulated with US stock market historical data to obtain their performances. Performance prediction models were built with the simulated performance data set and artificial neural networks.

Furthermore, the optimization models to reflect investors’ preferences were built up, and the performance prediction models were employed as the kernel of the optimization models so that the optimal solutions can now be solved with optimization techniques. The empirical values of the performances of the optimal weighting combinations generated by the optimization models showed that they can meet various investors’ preferences and outperform those of S&P’s 500 not only during the training period but also during the testing period.

Parameters: cache (str or None, optional) – The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to ‘.dataset_cache’.

cache¶

The name of the cache.

Type: str or None

shape¶

Dimensions of the data set

Type: tuple of integers

columns¶

List of column names in the data set

Type: list of strings

Features:

bp (float):: Large B/P
roe (float):: Large ROE
sp (float):: Large S/P
return_rate (float):: Large return rate in the last quarter
market_value (float):: Large market value
small_risk (float):: Small systematic risk
orig_annual_return (float):: Annual return
orig_excess_return (float):: Excess return
orig_risk (float):: Systematic risk
orig_total_risk (float):: Total risk
orig_abs_win_rate (float):: Absolute win rate
orig_rel_win_rate (float):: Relative win rate

Targets:

annual_return (float):: Annual return
excess_return (float):: Excess return
risk (float):: Systematic risk
total_risk (float):: Total risk
abs_win_rate (float):: Absolute win rate
rel_win_rate (float):: Relative win rate

Source:

https://archive.ics.uci.edu/ml/datasets/Stock+portfolio+performance

Examples

Load in the data set:

>>> dataset = Stocks()
>>> dataset.shape
(252, 19)

Split the data set into features and targets, as NumPy arrays:

>>> X, y = dataset.split()
>>> X.shape, y.shape
((252, 12), (252, 6))

Perform a train/test split, also outputting NumPy arrays:

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((197, 12), (197, 6), (55, 12), (55, 6))

Output the underlying Pandas DataFrame:

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

doubt.datasets.superconductivity module¶

Superconductivity data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

class doubt.datasets.superconductivity.Superconductivity(cache: Optional[str] = '.dataset_cache')¶

Bases: doubt.datasets._dataset.BaseDataset

This dataset contains data on 21,263 superconductors and their relevant features. The goal here is to predict the critical temperature based on the features extracted.

Parameters: cache (str or None, optional) – The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to ‘.dataset_cache’.

cache¶

The name of the cache.

Type: str or None

shape¶

Dimensions of the data set

Type: tuple of integers

columns¶

List of column names in the data set

Type: list of strings

Features:

number_of_elements (int)
mean_atomic_mass (float)
wtd_mean_atomic_mass (float)
gmean_atomic_mass (float)
wtd_gmean_atomic_mass (float)
entropy_atomic_mass (float)
wtd_entropy_atomic_mass (float)
range_atomic_mass (float)
wtd_range_atomic_mass (float)
std_atomic_mass (float)
wtd_std_atomic_mass (float)
mean_fie (float)
wtd_mean_fie (float)
gmean_fie (float)
wtd_gmean_fie (float)
entropy_fie (float)
wtd_entropy_fie (float)
range_fie (float)
wtd_range_fie (float)
std_fie (float)
wtd_std_fie (float)
mean_atomic_radius (float)
wtd_mean_atomic_radius (float)
gmean_atomic_radius (float)
wtd_gmean_atomic_radius (float)
entropy_atomic_radius (float)
wtd_entropy_atomic_radius (float)
range_atomic_radius (float)
wtd_range_atomic_radius (float)
std_atomic_radius (float)
wtd_std_atomic_radius (float)
mean_Density (float)
wtd_mean_Density (float)
gmean_Density (float)
wtd_gmean_Density (float)
entropy_Density (float)
wtd_entropy_Density (float)
range_Density (float)
wtd_range_Density (float)
std_Density (float)
wtd_std_Density (float)
mean_ElectronAffinity (float)
wtd_mean_ElectronAffinity (float)
gmean_ElectronAffinity (float)
wtd_gmean_ElectronAffinity (float)
entropy_ElectronAffinity (float)
wtd_entropy_ElectronAffinity (float)
range_ElectronAffinity (float)
wtd_range_ElectronAffinity (float)
std_ElectronAffinity (float)
wtd_std_ElectronAffinity (float)
mean_FusionHeat (float)
wtd_mean_FusionHeat (float)
gmean_FusionHeat (float)
wtd_gmean_FusionHeat (float)
entropy_FusionHeat (float)
wtd_entropy_FusionHeat (float)
range_FusionHeat (float)
wtd_range_FusionHeat (float)
std_FusionHeat (float)
wtd_std_FusionHeat (float)
mean_ThermalConductivity (float)
wtd_mean_ThermalConductivity (float)
gmean_ThermalConductivity (float)
wtd_gmean_ThermalConductivity (float)
entropy_ThermalConductivity (float)
wtd_entropy_ThermalConductivity (float)
range_ThermalConductivity (float)
wtd_range_ThermalConductivity (float)
std_ThermalConductivity (float)
wtd_std_ThermalConductivity (float)
mean_Valence (float)
wtd_mean_Valence (float)
gmean_Valence (float)
wtd_gmean_Valence (float)
entropy_Valence (float)
wtd_entropy_Valence (float)
range_Valence (float)
wtd_range_Valence (float)
std_Valence (float)
wtd_std_Valence (float)

Targets:

critical_temp (float)

Source:

https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data

Examples

Load in the data set:

>>> dataset = Superconductivity()
>>> dataset.shape
(21263, 82)

Split the data set into features and targets, as NumPy arrays:

>>> X, y = dataset.split()
>>> X.shape, y.shape
((21263, 81), (21263,))

Perform a train/test split, also outputting NumPy arrays:

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((17004, 81), (17004,), (4259, 81), (4259,))

Output the underlying Pandas DataFrame:

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

doubt.datasets.tehran_housing module¶

Tehran housing data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

class doubt.datasets.tehran_housing.TehranHousing(cache: Optional[str] = '.dataset_cache')¶

Bases: doubt.datasets._dataset.BaseDataset

Data set includes construction cost, sale prices, project variables, and economic variables corresponding to real estate single-family residential apartments in Tehran, Iran.

Parameters: cache (str or None, optional) – The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to ‘.dataset_cache’.

cache¶

The name of the cache.

Type: str or None

shape¶

Dimensions of the data set

Type: tuple of integers

columns¶

List of column names in the data set

Type: list of strings

Features:

start_year (int):: Start year in the Persian calendar
start_quarter (int): Start quarter in the Persian calendar
completion_year (int): Completion year in the Persian calendar
completion_quarter (int): Completion quarter in the Persian calendar
V-1..V-8 (floats):: Project physical and financial variables
V-11-1..29-1 (floats):: Economic variables and indices in time, lag 1
V-11-2..29-2 (floats):: Economic variables and indices in time, lag 2
V-11-3..29-3 (floats):: Economic variables and indices in time, lag 3
V-11-4..29-4 (floats):: Economic variables and indices in time, lag 4
V-11-5..29-5 (floats):: Economic variables and indices in time, lag 5

Targets:

construction_cost (float) sale_price (float)

Source:

https://archive.ics.uci.edu/ml/datasets/Residential+Building+Data+Set

Examples

Load in the data set:

>>> dataset = TehranHousing()
>>> dataset.shape
(371, 109)

Split the data set into features and targets, as NumPy arrays:

>>> X, y = dataset.split()
>>> X.shape, y.shape
((371, 107), (371, 2))

Perform a train/test split, also outputting NumPy arrays:

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((288, 107), (288, 2), (83, 107), (83, 2))

Output the underlying Pandas DataFrame:

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

doubt.datasets.yacht module¶

Yacht data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

class doubt.datasets.yacht.Yacht(cache: Optional[str] = '.dataset_cache')¶

Bases: doubt.datasets._dataset.BaseDataset

Prediction of residuary resistance of sailing yachts at the initial design stage is of a great value for evaluating the ship’s performance and for estimating the required propulsive power. Essential inputs include the basic hull dimensions and the boat velocity.

The Delft data set comprises 308 full-scale experiments, which were performed at the Delft Ship Hydromechanics Laboratory for that purpose.

These experiments include 22 different hull forms, derived from a parent form closely related to the “Standfast 43” designed by Frans Maas.

Parameters: cache (str or None, optional) – The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to ‘.dataset_cache’.

cache¶

The name of the cache.

Type: str or None

shape¶

Dimensions of the data set

Type: tuple of integers

columns¶

List of column names in the data set

Type: list of strings

Features:

pos (float):: Longitudinal position of the center of buoyancy, adimensional
prismatic (float):: Prismatic coefficient, adimensional
displacement (float):: Length-displacement ratio, adimensional
beam_draught (float):: Beam-draught ratio, adimensional
length_beam (float):: Length-beam ratio, adimensional
froude_no (float):: Froude number, adimensional

Targets:

resistance (float):: Residuary resistance per unit weight of displacement, adimensional

Source:

https://archive.ics.uci.edu/ml/datasets/Yacht+Hydrodynamics

Examples

Load in the data set:

>>> dataset = Yacht()
>>> dataset.shape
(308, 7)

Split the data set into features and targets, as NumPy arrays:

>>> X, y = dataset.split()
>>> X.shape, y.shape
((308, 6), (308,))

Perform a train/test split, also outputting NumPy arrays:

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((235, 6), (235,), (73, 6), (73,))

Output the underlying Pandas DataFrame:

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

doubt.datasets package¶

Submodules¶

doubt.datasets.airfoil module¶

doubt.datasets.bike_sharing_daily module¶

doubt.datasets.bike_sharing_hourly module¶

doubt.datasets.blog module¶

doubt.datasets.concrete module¶

doubt.datasets.cpu module¶

doubt.datasets.facebook_comments module¶

doubt.datasets.facebook_metrics module¶

doubt.datasets.fish_bioconcentration module¶

doubt.datasets.fish_toxicity module¶

doubt.datasets.forest_fire module¶

doubt.datasets.gas_turbine module¶

doubt.datasets.nanotube module¶

doubt.datasets.new_taipei_housing module¶

doubt.datasets.parkinsons module¶

doubt.datasets.power_plant module¶

doubt.datasets.protein module¶

doubt.datasets.servo module¶

doubt.datasets.solar_flare module¶

doubt.datasets.space_shuttle module¶

doubt.datasets.stocks module¶

doubt.datasets.superconductivity module¶

doubt.datasets.tehran_housing module¶

doubt.datasets.yacht module¶

Module contents¶