Advanced Tutorial

In this tutorial, we will use the “kick” dataset. The goal of this dataset is to predict whether a used car at an auto auction is a “kick”, i.e., has some serious issues and is therefore hard to be sold to customers.

Download Dataset

Navigate to the directory where the Silas executable is located, let’s call it bin. Create a new directory called tutorial2 in bin.

Download the dataset and put it in bin/tutorial2/data/.

Generate Metadata Settings

At bin, run the following command to generate metadata settings:

silas gen-metadata-settings -o tutorial2/metadata-settings.json tutorial2/data/kick.csv

This command will output metadata settings in tutorial2/metadata-settings.json. This file contains the configuration of each feature in the dataset.

Open tutorial2/metadata-settings.json. The first field is “missing_value_place_holders”, which tells Silas a default list of strings that represent missing values in the dataset. You can cross-check this list with the dataset and add any other strings for missing values. The second field is “feature_type_settings”, it assigns a data type to each feature. By default, the generator assigns a data type to a feature based on what it infers from the dataset. However, this inference may not be the most accurate and wise, so it’s best if you go through each feature and assign the data type at your discretion.

For instance, the generator thinks that the first feature “IsBadBuy” is numeric because its values are numbers. However, you may notice that there are only two possible values for this feature, and it’s the outcome feature. So it’s best to make it a “collection” feature.

Similarly, although the “WheelTypeID” feature only has numeric values, each value is an ID and there’s really no preorder between the values, so we should treat it as a collection feature. The user can try different settings and see how they impact the machine learning performance.

On the other hand, “WarrantyCost” is a number feature and “Size” is an enumeration feature.

The rules of thumb are:

  • If the feature has numeric values and there are many unique values, then it’s data type is number. Example: age, year, height, weight, price etc.

  • If the feature only has a small number of unique values and these values can be ordered, then it’s data type is enumeration. Example: size, day of week, education, rank in a systeme etc.

  • If the feature only has a small number of unique values and these values cannot be ordered, then its data type is collection. Example: sex, direction, colour etc.

After editing, the list of “feature_type_settings” should be:

"feature_type_settings": [
     {
         "feature_name": "IsBadBuy",
         "data_type": "collection"
     },
     {
         "feature_name": "PurchDate",
         "data_type": "number"
     },
     {
         "feature_name": "Auction",
         "data_type": "collection"
     },
     {
         "feature_name": "VehYear",
         "data_type": "number"
     },
     {
         "feature_name": "VehicleAge",
         "data_type": "number"
     },
     {
         "feature_name": "Make",
         "data_type": "collection"
     },
     {
         "feature_name": "Model",
         "data_type": "collection"
     },
     {
         "feature_name": "Trim",
         "data_type": "collection"
     },
     {
         "feature_name": "SubModel",
         "data_type": "collection"
     },
     {
         "feature_name": "Color",
         "data_type": "collection"
     },
     {
         "feature_name": "Transmission",
         "data_type": "collection"
     },
     {
         "feature_name": "WheelTypeID",
         "data_type": "collection"
     },
     {
         "feature_name": "WheelType",
         "data_type": "collection"
     },
     {
         "feature_name": "VehOdo",
         "data_type": "number"
     },
     {
         "feature_name": "Nationality",
         "data_type": "collection"
     },
     {
         "feature_name": "Size",
         "data_type": "enumeration"
     },
     {
         "feature_name": "TopThreeAmericanName",
         "data_type": "collection"
     },
     {
         "feature_name": "MMRAcquisitionAuctionAveragePrice",
         "data_type": "number"
     },
     {
         "feature_name": "MMRAcquisitionAuctionCleanPrice",
         "data_type": "number"
     },
     {
         "feature_name": "MMRAcquisitionRetailAveragePrice",
         "data_type": "number"
     },
     {
         "feature_name": "MMRAcquisitonRetailCleanPrice",
         "data_type": "number"
     },
     {
         "feature_name": "MMRCurrentAuctionAveragePrice",
         "data_type": "number"
     },
     {
         "feature_name": "MMRCurrentAuctionCleanPrice",
         "data_type": "number"
     },
     {
         "feature_name": "MMRCurrentRetailAveragePrice",
         "data_type": "number"
     },
     {
         "feature_name": "MMRCurrentRetailCleanPrice",
         "data_type": "number"
     },
     {
         "feature_name": "PRIMEUNIT",
         "data_type": "collection"
     },
     {
         "feature_name": "AUCGUART",
         "data_type": "collection"
     },
     {
         "feature_name": "BYRNO",
         "data_type": "collection"
     },
     {
         "feature_name": "VNZIP1",
         "data_type": "collection"
     },
     {
         "feature_name": "VNST",
         "data_type": "collection"
     },
     {
         "feature_name": "VehBCost",
         "data_type": "number"
     },
     {
         "feature_name": "IsOnlineSale",
         "data_type": "collection"
     },
     {
         "feature_name": "WarrantyCost",
         "data_type": "number"
     }
 ]

Generate Metadata

Hit the following command to generate metadata in tutorial2:

silas gen-metadata -o tutorial2/metadata.json tutorial2/metadata-settings.json tutorial2/data/kick.csv

This command will also output feature-type-stats.json in tutorial2. This file contains some statistics of the features. You will see a warning that says the kick.csv data file needs to be sanitised. Run:

silas draw tutorial2/feature-type-stats.json

To see some bare-bone visualisations in your terminal. You may have to maximise the terminal window to see graphs being displayed properly. Note that some features have missing values. We have to deal with the missing values before we can proceed to machine learning.

Sanitise The Data

This part can be done by other tools, but just to be self-contained, Silas provides a simple feature to handle missing data. To do so, run the following command:

silas sanitise -c new -n mean tutorial2/metadata-settings.json tutorial2/feature-type-stats.json tutorial2/metadata.json tutorial2/data/kick.csv

The new dataset file will be saved as “clean-kick.csv” in the same directory where the original dataset is located; the new metadata file will be saved as “clean-metadata.json” in the same directory where the original metadata is located. The two flags “-c” and “-n” are treatments for categorical values and numerical values respectively. The options are:

  • -c with the following options:
    • new: replace missing categorical values with new category.

    • most-common: replace missing categorical values with the most common category.

    • least-common: replace missing categorical values with the least common category.

    • remove: remove missing categorical values.

  • -n with the following options:
    • mean: replace missing numerical values with the mean value.

    • median: replace missing numerical values with the median value.

    • new-above-max: replace missing numerical values with max + 1.

    • new-under-min: replace missing numerical values with min - 1.

    • remove: remove missing numerical values.

If the user would like to undertake more complicated cleaning and preprocessing, there are plenty of free tools that can achieve this and these are not in the scope of this tutorial.

Open tutorial2/clean-metadata.json. Most of the automatically generated metadata for features are okay but we need to make some minor changes to the feature “Size”. Since we have made “Size” an enumeration feature, the generator will create a mapping from each value to a number that specifies how the value is ranked. However, the generator is not smart enough to rank the values correctly because it doesn’t know enough about car sizes. We change the “value_map” to the following:

"value_map": {
   "missing-Size": 0,
   "SMALL SUV": 5,
   "SPECIALTY": 12,
   "SPORTS": 2,
   "LARGE SUV": 7,
   "SMALL TRUCK": 10,
   "LARGE TRUCK": 11,
   "MEDIUM": 3,
   "LARGE": 4,
   "CROSSOVER": 9,
   "COMPACT": 1,
   "VAN": 8,
   "MEDIUM SUV": 6
}

Feel free to change the above mappings and see how they impact the predictive performance.

Generate Machine Learning Settings

We generate the settings file in tutorial2/settings.json using the following command:

silas gen-settings -o tutorial2/settings.json -v cv tutorial2/clean-metadata.json tutorial2/data/clean-kick.csv

The option -v cv tells the generator to use the cross-validation template.

Now open the settings file (tutorial2/settings.json). Note that the generator incorrectly makes the last feature as the outcome feature, whereas in this dataset the first feature is the outcome feature. Swap “WarrantyCost” in “outcome_feature” with “IsBadBuy” in “selected_features”, we now have a correct settings files as follows:

{
   "outcome_feature": "IsBadBuy",
   "metadata_file": "clean-metadata.json",
   "number_of_trees": 100,
   "max_depth": 64,
   "desired_leaf_size": 64,
   "number_of_outcome_subintervals": 10,
   "selected_features": [
      "WarrantyCost",
      "PurchDate",
      "Auction",
      "VehYear",
      "VehicleAge",
      "Make",
      "Model",
      "Trim",
      "SubModel",
      "Color",
      "Transmission",
      "WheelTypeID",
      "WheelType",
      "VehOdo",
      "Nationality",
      "Size",
      "TopThreeAmericanName",
      "MMRAcquisitionAuctionAveragePrice",
      "MMRAcquisitionAuctionCleanPrice",
      "MMRAcquisitionRetailAveragePrice",
      "MMRAcquisitonRetailCleanPrice",
      "MMRCurrentAuctionAveragePrice",
      "MMRCurrentAuctionCleanPrice",
      "MMRCurrentRetailAveragePrice",
      "MMRCurrentRetailCleanPrice",
      "PRIMEUNIT",
      "AUCGUART",
      "BYRNO",
      "VNZIP1",
      "VNST",
      "VehBCost",
      "IsOnlineSale"
   ],
   "feature_proportion": "sqrt",
   "sampling_method": "balancing",
   "sampling_proportion": 1.0,
   "oob_proportion": 0.05,
   "validation_method": "CV",
   "cv_settings": {
      "dataset_file": "data/clean-kick.csv",
      "number_of_runs": 1,
      "number_of_cross_validation_partitions": 10
   }
}

Run Machine Learning

The settings files specifies that we will run a 10-fold cross-validation on the dataset. This will split the dataset into 10 partitions and build a predictive model using each combination of 9 partitions and validate using the other partition. In total, 10 predictive models will be built and the learning process will report the average accuracy and AUC in the end. To do this, run the following command:

silas learn -o model/kick tutorial2/settings.json

The default settings will probably get you around 0.736 accuracy and 0.769 AUC. The accuracy does not seem to be very high but that is mainly because the default settings use balanced down sampling and the reward functions are optimised to improve the performance on the minority class. Since this dataset is highly imbalanced, with 87.7% “good buy” cases, even a basic model that says everything is a good buy will get 0.877 accuracy. If you prefer to get a high accuracy, you can change the “sampling_method” in tutorial2/settings.json from “balancing” to “uniform”, and run the learning process again, you will probably get 0.90+ accuracy but slightly lower AUC.

There are many ways to improve the predictive performance. For instance, you can play with the parameters in the settings file. The following parameters are particularly important:

  • number_of_trees: More often than not, increasing the number of decision trees in the ensemble model will improve the predictive performance. However, it will slow down the learning process. We recommend always using a number that is a multiple of the number of threads your computer can run. This maximises the concurrent computation. For instance, if your computer has 4 cores and can run 8 threads at the same time, then you should use 96, 120, 160, 200, 504, etc.

  • max_depth: The max depth of the tree determines how big each tree can be. The default value (64) is pretty safe to ensure that we can grow as big trees as possible. However, sometimes you may want to grow a large number of very small trees, in which case you can reduce this number.

  • desired_leaf_size: This parameter also determines how big the tree can grow. The tree building process will stop at a node if the number of data entries is less than the desired leaf size. Again, the default value (64) works well in a range of datasets, but it may not be the best.

  • feature_proportion: This parameter determines how many features are used when building a decision tree. The default value is “sqrt”, i.e., the square root of the total number of features. This ensures that each tree is sufficiently different from the others. You can generate trees that are similar and use 1.0, i.e., all features when building each tree; or generate trees that are more different and use 0.5 etc.

  • sampling_method: There are two options. (1) “balancing” (the default option) down samples majority classes to the same number of data entries in the minority class. This leads to faster computation as fewer data entries are considered. (2) “uniform” samples each class uniformly (randomly).

  • sampling_proportion: This parameter determines the proportion of the data being sampled by the previous parameter. The default value is 1.0 because the default sampling method is balancing which already down samples the dataset. If you use uniform sampling then you may want to reduce this number to increase the computation speed.

  • oob_proportion: The proportion of data that are used as the out-of-the-bag set. These data entries are not used in the learning process, but are used to evaluate and give a score for each decision tree.

For instance, you can increase the number of trees to 500 and reduce the desired leaf size to 1 and keep the sampling method as “balancing”. This will probably give you 0.77+ AUC, which is better than the result from default settings.

You can also play around with the features. For instance, if the value mapping for “Size” is not ideal and you are not sure how to rank those sizes, then perhaps making it a collection feature is easier. You can remove some unimportant features from the “selected_features” list and focus on the important ones.

For more advanced feature engineering, you can even create new features using an arithmetic expression that combines existing features. See Section Arithmetic Compound Feature for details.

Use Machine Learning To Perform Prediction

See Section 使用模型完成预测任务 from the basic tutorial.

Understand the Machine Learning Model

As a step towards white-box machine learning, the user should be able to understand how the predictive model works, and should be able to even interact with the model and tweak it. This section deals with the former.

Let’s assume that we have built predictive models using the following (fragment of) settings:

"number_of_trees": 300,
"max_depth": 64,
"desired_leaf_size": 1,
"feature_proportion": "sqrt",
"sampling_method": "balancing",
"sampling_proportion": 1.0,
"oob_proportion": 0.01,
"validation_method": "CV",
"cv_settings": {
    "dataset_file": "data/clean-kick.csv",
    "number_of_runs": 1,
    "number_of_cross_validation_partitions": 10
}

and the models are stored at bin/model/kick. Since the settings specifies a 10-fold cross-validation, the learning process has built 10 models. We shall analyse the best model in this tutorial. In my case, the model with the highest AUC is forest_0_8 (it could be different in your case).

Issue the following command at the bin directory to generate settings for logical formula extraction:

silas gen-extract-settings -o tutorial2/extract-settings.json

Open tutorial2/extract-settings.json, and change the methods and proportions based on the dataset we are working on. There are three options for each method: (1) best, (2) uniform (i.e., random sampling), (3) worst. The best and worst sampling will be using various reward functions and scores to rank the decision trees, the branches in each tree, and nodes on each branch.

As for the proportion, the more we sample, the more concrete the logical analysis result will be, and the slower the computation will be. You will find that when the proportions are high, the logical analysis will give you very narrow and specific cases, whereas if the proportions are low, you will get relatively more general cases.

In this tutorial, let’s change the extract settings to the following:

{
   "tree_sampling": {
      "method": "best",
      "proportion": 0.02
   },
   "branch_sampling": {
      "method": "best",
      "proportion": 0.1
   },
   "node_sampling": {
      "method": "best",
      "proportion": 0.1
   }
}

Then run the following command to extract the logical formulae that represent the decision-making of the predictive model for predicting the class “1” (i.e., bad buy):

silas extract -o formulae/kick tutorial2/extract-settings.json model/kick/forest_0_8/ 1

The extracted logical formulae will be stored at formulae/kick. Next, run the following command to perform automated reasoning on the logical formulae and obtain a simplified “core” of the decision-making:

silas introspect formulae/kick/

This computation may take a while. In the end, it outputs a big chunk of logical formula that gives the reason why the model predicts a bad buy. The formula is most likely a conjunction (∧) of many sub-formulae, and each sub-formula describes the condition for a feature. There should be as many sub-formulae (conjuncts) as the number of features in the dataset. This might explain why the resultant formula is so big: there are a lot of features in this dataset. Let us look at my result in details (yours might be different, but you can read it similarly).

Copy the formula into a text editor, and look at the sub-formulae by the name of features. The first feature we are looking at is VehicleAge, and there are two sub-formulae about this feature:

(VehicleAge ≥ 2.0039e+00)
(VehicleAge ≤ 9.0000e+00)

Since VehicleAge is a number feature, it is very typical that we get two sub-formulae for it: one for lower bound and the other for upper bound. Cross-check the above sub-formulae with the min and max of VehicleAge in tutorial2/metadata.json, we basically get that VehicleAge ≥ 2 might imply a bad buy.

Next, look at the feature WheelTypeID. There is only one sub-formula about it:

WheelTypeID ∈ {1,2,3}

Again, cross-check the above with the metadata, we obtain that the WheelTypeID 1, 2 and 3 are related to bad buy, whereas WheelTypeID 0 is not.

The two largest sub-formulae are for the features Model and SubModel respectively. It seems that these sub-formulae almost give the full range of possible values of the two features. These indicate that the predictive model does not really use these two features to make decisions (at least not at the best decision nodes). So remove them from the core formula. The reason why the core formula still contains these features is that the logical analysis is a procedure that narrows down the logical conditions for each feature from their full range. The fact that these features are not narrowed down much implies that they are not used often in the decision-making.

Using the same logic, we conclude that the features BYRNO, Color, IsOnlineSale, Nationality, PRIMEUNIT, TopThreeAmericanName, Transmission, VNST are not strong features in the decision-making.

The above features are all collection features. The same reasoning can be applied to number/enumeration features too. For instance, the sub-formulae for MMRAcquisitionAuctionAveragePrice is given below:

(MMRAcquisitionAuctionAveragePrice ≥ 0.0000e+00) ∧ (MMRAcquisitionAuctionAveragePrice ≤ 3.5722e+04)

And it is identical to the min and max of the feature. So this feature does not contribute much in the decision-making. Similarly, we can ignore the features MMRAcquisitionRetailAveragePrice, MMRAcquisitonRetailCleanPrice, MMRCurrentAuctionCleanPrice, MMRCurrentRetailCleanPrice, PurchDate, AUCGUART, VehOdo and WarrantyCost, because they are not strong features. This does not mean that they are useless. The features might have been used somewhere in the decision-trees, but they are just not the most important decision nodes.

On the other hand, the above means that the user can tweak some settings for the above features, and the data scientist could do some feature engineering for them. For instance, the condition for feature VNZIP1 includes only numerical values. Maybe we should treat it as a number feature instead of a collection feature.

The above leaves us with more interesting features. The sub-formulae for MMRCurrentAuctionAveragePrice is as follows:

(MMRCurrentAuctionAveragePrice ≥ 4.7094e+03)
(MMRCurrentAuctionAveragePrice ≤ 3.5722e+04)

Comparing this with its min and max, we obtain that 4709.4 ≤ MMRCurrentAuctionAveragePrice ≤ 35722 is related to a bad buy.

From:

(MMRCurrentRetailAveragePrice ≥ 5.9918e+03)
(MMRCurrentRetailAveragePrice ≤ 3.9080e+04)

We obtain that 5991.8 ≤ MMRCurrentRetailAveragePrice ≤ 39080 is related to a bad buy.

Similarly, from:

(MMRAcquisitionAuctionCleanPrice ≥ 6.0472e+03)
(MMRAcquisitionAuctionCleanPrice < 7.9189e+03)

We obtain that 6047.2 ≤ MMRAcquisitionAuctionCleanPrice < 7918.9 is related to a bad buy.

The sub-formulae for Size are:

(Size ≥ 1.0000e+00)
(Size < 1.0000e+01)

These indicate that, according to our mapping for sizes, those between 1 and 9 are more prone to be a bad buy, whereas very large cars tend to be okay.

The following sub-formula:

Make ∈ {ACURA,CHEVROLET,CHRYSLER,DODGE,HONDA,ISUZU,KIA,MAZDA,MITSUBISHI,SATURN,SUZUKI}

lists the brands that tend to have problems.

Also:

Trim ∈ {Bas,ES,LE,LS,LT,LX,Lar,Nor,SE,STX,XE,i}

gives the list of trim that are suspicious.

Next:

WheelType ∈ {Alloy,Covers}

shows that only “Special” WheelType cars are usually good buys.

The sub-formula:

Auction ∈ {MANHEIM}

singles out “MANHEIM” as the one that tends to have bad buys.

The sub-formulae:

(VehBCost ≥ 7.0166e+03)
(VehBCost ≤ 4.5469e+04)

show that 7016.6 ≤ VehBCost ≤ 45469 may imply a bad buy.

The sub-formula:

AUCGUART ∈ {AUCGUART-missing-value-category}

might mean that if the data on AUCGUART is missing then the car is a bit dodgy.

Lastly:

(VehYear ≥ 2.0050e+03)
(VehYear ≤ 2.0100e+03)

show that 2005 ≤ VehYear ≤ 2010 may imply a bad buy.

If we change the extraction settings to the following:

{
   "tree_sampling": {
      "method": "best",
      "proportion": 0.02
   },
   "branch_sampling": {
      "method": "best",
      "proportion": 0.1
   },
   "node_sampling": {
      "method": "uniform",
      "proportion": 0.1
   }
}

The logical formulae would contain all sorts of decision nodes, and the analysis result will be more concrete. My resultant formula even includes the following:

Make ∈ {CHEVROLET}
Model ∈ {CONCORDE 3.5L V6 MPI}
SubModel ∈ {4D SUV 4.7L}

But since we are sampling decision-nodes of various qualities, the above may not be the most accurate.

Verify the Machine Learning Model

Imagine the following hypothetical scenario: recently a car maker found out that there were manufacture issues with certain models it had been making, the the company had recalled the cars of that model made between a certain period.

The predictive model may not have picked up this information because the past data may not contain enough evidence. As a first step, we want to check if the above situation is captured by the predictive model.

Let’s concretise this example. Suppose the recalled make and model are “BUICK” “CENTURY V6 3.1L V6 S “, and the year is 2005. Note that these details are randomly selected. The specification we want to write is “if the make is BUICK and the model is CENTURY V6 3.1L V6 S and the year is 2005, then it’s a bad buy”. Logically, this is equivalent to “If this car is not a bad buy, then the make is not BUICK or the model is not CENTURY V6 3.1L V6 S or the year is not 2005”. We can express this as follows:

[
   {
      "name": "spec_example_1",
      "outcome": "0",
      "constraint_formula": {
         "type": "||",
         "operands": [
            {
               "type": "!",
               "internal": {
                  "type": "Membership",
                  "name": "Make",
                  "subset": [
                     "BUICK"
                  ]
               }
            },
            {
               "type": "!",
               "internal": {
                  "type": "Membership",
                  "name": "Model",
                  "subset": [
                     "CENTURY V6 3.1L V6 S"
                  ]
               }
            },
            {
               "type": "!",
               "internal": {
                  "type": "=",
                  "left": {
                     "type": "ArithmeticVariable",
                     "name": "VehYear"
                  },
                  "right": {
                     "type": "ArithmeticConstant",
                     "value": 2005
                  }
               }
            }
         ]
      }
   }
]

Create a file named spec.json at bin/tutorial2/, copy the above specification and paste it in the file and save it. Now run:

silas verify model/kick/forest_0_8/ tutorial2/spec.json

The verification may take a long time. Although the predictive model you have built may be different from mine, the above specification is most likely to be invalid in your model because we chose the make, model and year randomly in this example. The verification will output the list of decision trees that are not compliant with the specification.

Enforce User Specifications in Machine Learning

Suppose we really want to ensure that the predictive model is compliant with the specification, we can use the enforcement learning feature to build new predictive models that are correct by construction. To do so, run the following command:

silas learn -e tutorial2/spec.json -o model/kick-enforced tutorial2/settings.json

The new models will be stored at model/kick-enforced. All these models are guaranteed to be compliant with the specification. You can run the verification again and double-check, although you really don’t have to. The new models are not just the old ones with simple extra steps that satisfy the specification. It takes the specification into the learning process whilst optimising the predictive performance under the specified conditions.