Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
COMPX521-24B Assignment 2
Separate-and-conquer algorithms provide a well-known approach to learning rules in supervised learning. With rule sets generated using standard algorithms of this type, a prediction is often obtained from the first rule that applies, which corresponds to interpretation of the rules as a decision list. An alternative approach to rule learning, which generates a rule set where order is immaterial, is to treat a rule set as an ensemble of individual predictors, where the predictions of the rules are combined in a manner dictated by the underlying ensemble learning algorithm used to generate the rule set: a prediction is obtained by applying a suitable mechanism to combine the consequents of all rules whose antecedents apply. A particularly appealing approach to generating this kind of predictive model is to build an ensemble of rules using a boosting algorithm.In this case, the “weak” learner that is boosted using the boosting algorithm is a learning algorithm that generates a single rule. The first algorithm of this type was called SLIPPER (Cohen & Singer, 1999) and based on a variant of AdaBoost (Freund & Schapire, 1996), but the same idea can be applied in conjunction with more recent boosting algorithms.
In this assignment, we implement weka.classifiers.rules.XGBoostRule, a method that grows a single rule using the basic covering algorithm to greedily optimise the objective function of the XGBoost algorithm (Chen & Guestrin, 2016). Once a rule has been grown, given a particular instance for which a prediction is to be made, the weak learner should predict zero if the antecedent of the rule does not apply; on the other hand, if the rule does apply, the optimum prediction is the same as the one for the leaf of an XGBoost decision tree,
where G is the sum of the gradients of the training instances covered by the rule and H is the corresponding sum of Hessians.
The individual per-instance gradients and Hessians are provided by the class weka.classifiers.meta.XGBoost in the same way as in Assignment 1. Note that predicting zero means that the rule abstains from making a contribution to the ensemble prediction: the ensemble prediction is only modified if the antecedent of the rule applies. This means individual ensemble members can be very specific in their contribution to the XGBoost ensemble.
The question remains how to grow the rule to optimise the XGBoost objective. As indicated above, a basic greedy covering algorithm can be applied— finding the optimum rule is generally too expensive—but the test selection metrics discussed in class, such as the metric p/t used in PRISM, are clearly unsuitable. Instead, the rule learning algorithm should implement the following metric for choosing which attribute-value test to add to the rule that is grown:
where the sums of the gradients G and Hessians H, respectively, are computed across those instances that would be covered by the rule if the test were added to the rule.
The user-specified parameter λ has the same regularisation effect as in the XGBoost tree learner, providing a way to control the L2 regularisation term in the XGBoost objective. The second parameter in the decision tree case, γ, can also be applied, but the variable T must be redefined to make sense for rules. In this assignment, we use T to refer to the length of the rule that is grown, given by the number of attribute-value tests it contains.
The test that maximises the above metric is chosen because it directly cor responds to the reduction in the XGBoost loss. You should convince yourself that this is the case by inspecting the derivation of the split selection criterion used in XGBoost decision trees. If there is no test whose addition improves the above metric, the rule should not be extended. Note that a rule with an empty antecedent is a valid rule and may be better, according to the above metric, than any of the rules that can be obtained by adding a test.
The other hyperparameters from the XGBoost tree learner implemented in Assignment 1 should also be included in XGBoostRule, in an analogous manner, so that attributes can be randomly subsampled before a test is selected, instances can be subsampled before a rule is grown, and a learning rate can be applied. The minimum child weight, which should also be implemented, corresponds to the minimum value of H that a rule is required to have to be considered a candidate. The maximum depth parameter from the tree learnershould be replaced by a corresponding length parameter that limits the number of tests that a rule is permitted to have, and should be called max_length instead of max_depth. Use the same default parameters as in Assignment 1 for the hyperparameters, including a default maximum rule length of 6.
Once you have implemented XGBoostRule, compare it to XGBoostTree using the same experimental methodology as in Assignment 1 and the same datasets. You can use the XGBoostTree implementation posted on the course webpage in Moodle. You can also re-use as much code from this implementation in your XGBoostRule code as you like, in particular, the code for setting the hyperpa rameters. However, the rest of the code in your XGBoostRule class must be your own code, and all modifications must be yours. (Make sure the modified code grows a single rule, not a tree or partial tree.)
Note that to enable you to consider the size of the models in your experiments, and not just predictive accuracy, the XGBoost code that has been posted on the course webpage in Moodle, when used in WEKA’s Experimenter, will compute the number of rules included in the ensemble as an additional measure, by implementing WEKA’s AdditionalMeasureProducer interface. The name of the additional measure must be measureNumRules. In the case XGBoostTree, this is computed by counting the number of leaves in the decision tree. Take a look at how AdditionalMeasureProducer is implemented there and adapt the code accordingly.
In the literature, the closest method to what we consider in this assignment is the method proposed and evaluated in (Rapp, Menc´ıa, F¨urnkranz, Nguyen & H¨ullermeier, 2021), which is also based on XGBoost. One difference is that it does not include regularisation based on the length of the rules. It also considers the more general setting of multi-label classification. Going back further, there is also earlier work on using gradient boosting based on first derivatives to grow rule sets (Dembczy´nski, Kot lowski & S lowi´nski, 2010). Please take a look at this work, and the work on SLIPPER (Cohen & Singer, 1999), and cite it appropriately in your report.
As in the first assignment, when analysing the experimental results in WEKA’s Experimenter, make sure that you pay particular attention to the statistical significance of the observed differences in performance, and once the experiments are done, write a PDF report on your findings. Make a .zip or .tgz archive with your code (sources and compiled classes) and the report, and submit it on Moodle. Once extracted, it must be possible to run your implementations by adding the top level of the extracted directory structure to the Java CLASSPATH (assuming the main WEKA 3.8 weka.jar file is already in the CLASSPATH).
The criteria for grading are the same as for Assignment 1. Grades will be based on the quality of the report and the program. The report should be approximately three pages long in standard LATEX report format. It should contain an introduction, a section explaining the algorithm that was implemented, a section with results and a corresponding discussion, and some conclusions. Relevant references to the literature should also be included. Results should be presented as tables or graphs when appropriate, with corresponding captions and references in the text that discusses them.
Make sure you start the assignment early. There will be no extensions, except for documented medical reasons.
References
Chen, T. & Guestrin, C. (2016). Xgboost: A scalable tree boosting system.In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (p. 785–794). Association for Computing Machinery.
Cohen, W. W. & Singer, Y. (1999). A simple, fast, and effective rule learner. In Proceedings of the 16th National Conference on Artificial Intelligence (pp. 335—-342). American Association for Artificial Intelligence.
Dembczy´nski, K., Kot lowski, W. & S lowi´nski, R. (2010). Ender: a statisti-3cal framework for boosting decision rules. Data Mining and Knowledge Discovery, 21, 52–90.
Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on International Conference on Machine Learning (pp. 148—-156). Morgan Kaufmann Pub lishers Inc.