Preparations and Settings

Silas machine learning requires three configuration files to run correctly:

We shall use command templates of the following form:

silas command [OPTIONS] [para1] [para2] ...

where OPTIONS include optional parameters, and para1, para2, etc. are mandatory parameters.

You can skip the below sections and generate all these files automatically.

Metadata Settings

Metadata settings include two fields:

  • attribute-settings: lists the data type for each attribute,

  • missing-value-place-holders: lists string place holders for missing values.

The settings are stored in JSON format. The purpose of metadata settings is to be able to generate Metadata.

An example metadata settings file would look like the following:

"attribute-settings":
[
    {
        "type": "nominal",
        "name": "Month"
    },
    {
        "type": "numerical",
        "name": "DayofMonth"
    }
],
"missing-value-place-holders":
[
    "",
    "NA",
    "?"
]

Attribute Type

There are two attribute types supported in Silas:

  • numerical: Used when the values of an attribute are real/continuous numbers.

  • nominal: Used when the values of an attribute are discrete. Example: east, west, north, south.

Generating Metadata Settings Automatically

To generate the metadata settings file, use the following command:

silas gen-metadata-settings [OPTIONS] [data_files...]

where data_files are a list of file paths for data sets, and OPTIONS include:

  • -h: print help message and exit.

  • -o file_path: output the metadata settings in the given file. If this option is not supplied, the metadata settings will be stored in metadata-settings.json in the directory where the command is issued.

  • --nh: a flag that indicates that the data set files do not have headers. In this case, Silas will generate new data set files that contain the same data and have headers. The new data set files will be saved in the same directory as the original data set files, the file names will end with “-w-headers”.

If you have multiple data set files, you can use any one of them in the command. For instance, to generate a metadata settings file from data/dataset1.csv and output the settings in data/metadata-settings.json, run the following instance:

silas gen-metadata-settings -o data/metadata-settings.json data/dataset1.csv

The user is encouraged to inspect the metadata settings file and select the data types for features manually.

Metadata

A metadata file gives the details of attributes and the features. These definitions are stored in JSON format.

You can skip the below details and jump to the Generating Metadata Automatically section.

Metadata for attributes

The metadata file includes a list of attributes and their detailed information.

An attribute defines the type of the attribute, its name, its data type in C++ and the (range of) values.

If the type of an attribute is numerical, the definition of the attribute includes the name of the attribute, the C++ data type, the min value, and the max value. For example, we define the attribute “ratio” as follows, where “f32” means 32-bit floating-point:

"type": "numerical",
"name": "ratio",
"data-type": "f32",
"bounds":
{
    "min": 0.0,
    "max": 1.0
}

If the type of an attribute is nominal, the definition of the attribute includes the name of the attribute, the C++ data type, whether it’s values can be ordered and the list of values. For example, we define the attribute “size” as follows:

"type": "nominal",
"name": "size",
"data-type": "u8",
"ordered": true,
"values":
[
    "Short",
    "Tall",
    "Grande"
]

Supported C++ data types include:

  • bool: Boolean.

  • u8: 8-bit unsigned integer.

  • u16: 16-bit unsigned integer.

  • u32: 32-bit unsigned integer.

  • u64: 64-bit unsigned integer.

  • i8: 8-bit signed integer.

  • i16: 16-bit signed integer.

  • i32: 32-bit signed integer.

  • i64: 64-bit signed integer.

  • f32: 32-bit floating-point.

  • f64: 64-bit floating-point.

The user is encouraged to use the smallest data type that is sufficient to represent all the values of an attribute.

Feature

Each feature is defined by its name and its attribute. For instance, below are some features in a flight data set:

{
    "feature-name": "Dest",
    "attribute-name": "Location"
},
{
    "feature-name": "Origin",
    "attribute-name": "Location"
},
{
    "feature-name": "Distance",
    "attribute-name": "Distance"
},
{
    "feature-name": "Month",
    "attribute-name": "Month"
},

The reason why we distinguish feature and attribute here is that some features, e.g., Dest and Origin, have the same attribute. This organisation of information thus reduces redundancy.

Generating Metadata Automatically

Silas comes with a tool that can generate metadata automatically from the data set. To do so, use the following command:

silas gen-metadata [OPTIONS] [metadata_settings] [data_files...]

where metadata_settings is the file path for Metadata Settings, data_files is a list of file paths of data sets, and OPTIONS include:

  • -h: Print help message and exit.

  • -o file_path: output the metadata in the given file. If this option is not supplied, the metadata will be stored in metadata.json in the directory where the command is issued.

For instance, to output the metadata in metadata1.json using metadata settings in data/metadata-settings1.json and data set source files data/dataset1.csv and data/dataset2.csv, use the following command:

silas gen-metadata -o metadata1.json data/metadata-settings1.json data/dataset1.csv data/dataset2.csv

Note that the gen-metadata command will also output the statistics of features in feature-stats.json in the directory where the output file is.

Plot Graphs of Feature Stats

Silas provides a simple visualisation of the dataset in terminal with the following command:

silas draw [OPTIONS] data_stats_file

where data_stats_file is the feature statistics file generated by silas gen-metadata, and OPTIONS only include a -h flag to show the help message.

Machine Learning Settings

Parameters in Settings

The machine learning settings file defines the parameters for Silas machine learning. The settings are stored in JSON format. The parameters include:

  • output-feature: The feature to be predicted or classified. Sometimes called class or target in the literature.

  • metadata-file: The path of the metadata file.

  • ignored-features: A list of features that will not be used in training.

  • learner-settings: The settings for the machine learning process:
    • mode: Either classification or regression.

    • reduction-strategy: The strategy used in multi-class classification. Permitted values are:
      • none: Use decision trees to classify multiple classes natively. This process only builds 1 forest.

      • one-vs-rest: Build 1 forest per class. Obtain prediction by weighted voting.

      • one-vs-one: Build 1 forest per each pair of classes. Obtain prediction by weighted voting.

    • grower-settings: The settings for growing trees and forests:
      • forest-settings: The settings for growing forests:
        • type: The type of the forest. Permitted values are:
          • ClassicForest: An ensemble algorithm that is similar to Random Forest but with customised balanced-sub-sampling. This method is usually fast and works well for imbalanced datasets. Only permitted when the mode is classification.

          • PrototypeSampleForest: Use a customised sampling method inspired by prototype selection and condensed nearest neighbour to obtain a balanced subsample for building each tree. This method is slower than ClassicForest but sometimes performs better. Also works well for imbalanced datasets. Only permitted when the mode is classification.

          • SimpleForest: Sample the training set only based on sampling-proportion when building each tree. This method works well for roughly balanced datasets. Only permitted when the mode is classification.

          • SimpleValueForest: Same as SimpleForest, except that each leaf node stores a voted result instead of a distribution. This significantly saves memory and often works best with leaf size 1. Use this method for datasets with thousands of classes. Only permitted when the mode is classification.

          • SimpleRegForest: Use bagging and perform regression. Only permitted when the mode is regression.

          • SimpleOOBRegForest: Sub-sample the training set for each tree and use the OOB sample to compute the weight for each tree. Only permitted when the mode is regression.

        • number-of-trees: The number of decision trees in the ensemble model.

        • sampling-proportion: The proportion of data instances in the subsample (compared to the entire training set) for building each tree.

        • oob-proportion: The proportion of data instances in the out-of-the-bag (OOB) sample. If the oob_proportion is 0.1 then 10% of the sampled data are used as OOB instances which are not used when building a decision tree. The OOB splitting occurs after the sampling of the data set. Example: for a 1 million instances data set, if forest type is “ClassicForest”, sampling_proportion = 0.7, and oob_proportion = 0.1, then only 630K instances are used for training a decision tree, and 70K instances are in the OOB set which is used to evaluate the decision tree. N.B. This parameter is disabled on version v0.86 for classification tasks.

      • tree-settings: The settings for growing trees:
        • type: The type of the tree. Permitted values are:
          • GreedyNarrow1D: Grow each tree in a greedy manner. Choose the best cut-point for each feature and then choose the best predicate from considered features. When choosing the best cut-point, use a customised sampling on data points. Only permitted when the mode is classification.

          • RdGreedy1D: Randomly choose a cut-point for each feature and then choose the best predicate from considered features. It’s faster than other tree growers. When used in combination with SimpleForest and sampling-proportion 1.0, it is equivalent to Extremely Randomised Trees (Extra Trees). Only permitted when the mode is classification.

          • SimpleTreeGrower: Similar to GreedyNarrow1D but does not sample data points when choosing the best cut-point for each feature. Only permitted when the mode is classification.

          • RdGreedyReg1D: Similar to RdGreedy1D but for regression. Only permitted when the mode is regression.

        • max-depth: The max depth of the decision trees. When the max depth is n, the tree has at most \(2^{n+1} - 1\) nodes in which \(2^n\) are leaf nodes. This parameter is used as a stopping condition to control the growth of trees. If max_depth is 32, the tree building algorithm stops expending a branch when the depth of the branch reaches 32.

        • desired-leaf-size: The desired number of data instances contained in a leaf node. This parameter is used as a stopping condition to control the growth of trees. If desired_leaf_size is 32, the tree building algorithm stops expending a branch when the leaf node contains less than 32 data instances.

        • feature-proportion: The proportion of the number of features used when building decision trees. Permitted values are:
          • “sqrt”: The square root of the total number of features (default).

          • “log”: natural log of the total number of features.

          • “log2”: log base 2 of the total number of features.

          • “golden”: 0.618 times the total number of features.

          • Any floating-point number (without double quotes) between 0.0 and 1.0.

    • training-dataset: Details of the training set:
      • type: The type of the file. Permitted values for the Edu version are:
        • CSV: comma separated values.

      • path: File path of the training set.

    • validation-settings: Settings for validation and testing.
      • type: Permitted values are:
        • CV: Cross-validation.

        • TT: Train and test.

    If the type is CV, the following two settings are available:
    • number-of-runs: The number of runs of experiment.

    • number-of-partitions: The number of partitions in the cross-validation.

    If the type is TT, the following settings are available:
    • testing-dataset: Information of the testing set:
      • type: The type of the file. Permitted values for the Edu version are:
        • CSV: comma separated values.

      • path: File path of the training set.

Generating Machine Learning Settings Automatically

To generate a settings file automatically, use the following command:

silas gen-settings [OPTIONS] [metadata_file] [data_files...]

where metadata_file specifies the metadata file path, data_files is a list of data set file paths, and OPTIONS include:

  • -h: Print help message and exit.

  • -v validation_mode: Specify the validation mode. If this option is not supplied, the validation mode will be deduced from the number of data set files: cross-validation if only one data set file is supplied, and training and testing if more than one data set files are supplied, in which case the first file will be used for training and the second file will be used for testing. There are two options:
    • “cv”: Cross-validation. This means that the user has to specify at least 1 data set file. If multiple files are supplied, the first file will be used for training and validation.

    • “tt”: Training and testing. This means that the user has to specify at least 2 data set files. If multiple files are supplied, the first data set file will be used for training and the second one will be used for testing.

  • -o file_path: output the settings in the given file. If this option is not supplied, the settings will be stored in settings.json in the directory where the command is issued.

For example, to generate a settings file called settings1.json from the metadata file data/metadata1.json and the data set file data/dataset1.csv, use the following command:

silas gen-settings -o settings1.json data/metadata.json data/dataset1.csv

Generating All Configuration Files Automatically

To generate all the files required in Silas machine learning automatically, use the following command:

silas gen-all [OPTIONS] [data_files...]

where data_files is a list of file paths of data sets and OPTIONS include:

  • -h: Print help message and exit.

  • -v validation_mode: Specify the validation mode. If this option is not supplied, the validation mode will be deduced from the number of data set files: cross-validation if only one data set file is supplied, and training and testing if more than one data set files are supplied, in which case the first file will be used for training and the second file will be used for testing. There are two options:
    • “cv”: Cross-validation. This means that the user has to specify at least 1 data set file. If multiple files are supplied, the first data set file will be used for training and validation. The remaining data set files will be used only for computing the statistics of the data sets.

    • “tt”: Training and testing. This means that the user has to specify at least 2 data set files. If multiple files are supplied, the first data set file will be used for training and the second one will be used for testing. The remaining data set files will be used only for computing the statistics of the data sets.

  • -o directory: output the configuration files in the specified directory. If this option is not supplied, the configuration files will be stored in the directory where the command is issued.

  • --nh: a flag that indicates that the data set files do not have headers. In this case, Silas will generate new data set files that contain the same data and have headers. The new data set files will be saved in the same directory as the original data set files, the file names will end with “-w-headers”.

For instance, to generate all the configuration files from data/train.csv and data/test.csv for training and testing and storing the configuration files in config/, run the following command:

silas gen-all -v tt -o config data/train.csv data/test.csv

Sanitise the data

In case the dataset contains missing data or is of incorrect format, you can sanitise the dataset using the following command:

silas sanitise [OPTIONS] [metadata_settings] [feature_stats_file] [metadata] [data_files...]

where metadata_settings is the file path for Metadata Settings; feature_stats_file is the file path for feature statistics, which is generated with meta-data; metadata is the file path for Metadata; data_files is a list of file paths of data sets; OPTIONS include:

  • -c with the following options:
    • new: replace missing categorical values with new category.

    • most-common: replace missing categorical values with the most common category.

    • least-common: replace missing categorical values with the least common category.

    • remove: remove missing categorical values.

  • -n with the following options:
    • mean: replace missing numerical values with the mean value.

    • median: replace missing numerical values with the median value.

    • new-above-max: replace missing numerical values with max + 1.

    • new-under-min: replace missing numerical values with min - 1.

    • remove: remove missing numerical values.

By default, the strategy for categorical values is to create a new category (“-c new”) and the strategy for numerical values is to use the mean (“-n mean”).

For instance, to sanitise example/data.csv using example/metadata-settings.json, example/feature-stats.json, and example/metadata.json, using the strategy that replaces categorical values with the most common category and replaces numerical values with the median, run the following command:

silas sanitise -c most-common -n median example/metadata-settings.json example/feature-stats.json example/metadata.json example/data.csv