1
$\begingroup$

after I've spent several weeks trying to fit a regression model to my flood damage data (x1=water height, x2=adaptation height, x3=(x1-x2), y=damage), it is now time for my very first question on StackExchange. It is also my first time working with all of these different regression models, so please bear with me.

I aim for a simple model that captures information from my dataset and holds true for monotone constraints, e.g. increasing water height leads to higher damage and increasing adaptation height leads to lower damage. I came to the conclusion that a random forest or tree regressor would be a good option, as linear assumptions are violated. I tried several different packages to model a single regression tree or a random forest (DecisionTree.jl, XGBoost, Sklearn tree, lightgbm,...). I tried XGBoost and lightgbm as these packages have the option to insert monotone constraints, however, the models I get with monotone constraints always have a really poor fit to the real data. Models without these constraints have a really good fit to the real data, but strongly violate the constraints.

My two questions are therefore:

  1. Is there a way to setup a single regression tree and afterwards slightly modify the tree such that the monotone constraints hold true?

or

  1. How could I improve the fit of the monotone constrained models (e.g. lightgbm)?

I'll attach a python code example with lightgbm in case anyone is interested to have a deeper look into this. I would be really grateful for any help on this topic! Suggestions for other modeling methods are also highly welcome (so far I've tried linear models and gaussian processes) :)

import pandas as pd
from math import sqrt
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import lightgbm as lgb

df = pd.DataFrame( {'slr_extreme': {0: 30, 1: 30, 2: 50, 3: 50, 4: 70, 5: 70, 6: 90, 7: 90, 8: 110, 9: 110, 10: 130, 11: 130, 12: 150, 13: 150, 14: 160, 15: 160, 16: 170, 17: 170, 18: 180, 19: 180, 20: 190, 21: 190, 22: 193, 23: 198, 24: 200, 25: 200, 26: 200, 27: 210, 28: 217, 29: 220, 30: 221, 31: 229, 32: 230, 33: 230, 34: 231, 35: 240, 36: 244, 37: 245, 38: 246, 39: 250, 40: 254, 41: 255, 42: 256, 43: 259, 44: 261, 45: 267, 46: 270, 47: 270, 48: 272, 49: 273, 50: 274, 51: 279, 52: 280, 53: 282, 54: 286, 55: 287, 56: 287, 57: 288, 58: 288, 59: 290, 60: 293, 61: 296, 62: 300, 63: 301, 64: 305, 65: 310, 66: 310, 67: 310, 68: 317, 69: 317, 70: 320, 71: 330, 72: 331, 73: 331, 74: 334, 75: 336, 76: 340, 77: 340, 78: 340, 79: 342, 80: 352, 81: 352, 82: 352, 83: 354, 84: 362, 85: 362, 86: 362, 87: 363, 88: 363, 89: 368, 90: 368, 91: 370, 92: 370, 93: 376, 94: 381, 95: 382, 96: 383, 97: 383, 98: 385, 99: 386, 100: 387, 101: 391, 102: 392, 103: 395, 104: 398, 105: 399, 106: 400, 107: 400, 108: 400, 109: 400, 110: 400, 111: 400, 112: 400, 113: 400, 114: 400, 115: 400, 116: 400, 117: 400, 118: 400, 119: 400, 120: 400, 121: 400, 122: 406, 123: 407, 124: 410, 125: 413, 126: 414, 127: 429, 128: 431, 129: 436, 130: 437, 131: 438, 132: 438, 133: 442, 134: 443, 135: 444, 136: 446, 137: 447, 138: 449, 139: 450, 140: 451, 141: 453, 142: 454, 143: 456, 144: 457, 145: 460, 146: 461, 147: 469, 148: 470, 149: 471, 150: 471, 151: 475, 152: 479, 153: 481, 154: 489, 155: 493, 156: 495, 157: 496, 158: 497, 159: 498, 160: 499, 161: 500, 162: 500, 163: 500, 164: 500, 165: 501, 166: 502, 167: 504, 168: 505, 169: 505, 170: 510, 171: 510, 172: 513, 173: 518, 174: 518, 175: 519, 176: 520, 177: 527, 178: 528, 179: 529, 180: 535, 181: 541, 182: 541, 183: 544, 184: 544, 185: 544, 186: 550, 187: 562, 188: 564, 189: 567, 190: 571, 191: 571, 192: 576, 193: 576, 194: 579, 195: 579, 196: 582, 197: 586, 198: 589, 199: 589, 200: 590, 201: 590, 202: 591, 203: 592, 204: 592, 205: 600, 206: 600, 207: 600, 208: 600, 209: 600, 210: 600, 211: 602, 212: 603, 213: 605, 214: 612, 215: 618, 216: 619, 217: 622, 218: 622, 219: 623, 220: 623, 221: 624, 222: 626, 223: 628, 224: 629, 225: 630, 226: 633, 227: 635, 228: 635, 229: 641, 230: 642, 231: 647, 232: 648, 233: 648, 234: 650, 235: 652, 236: 653, 237: 655, 238: 655, 239: 657, 240: 657, 241: 662, 242: 666, 243: 669, 244: 669, 245: 670, 246: 671, 247: 672, 248: 675, 249: 688, 250: 692, 251: 695, 252: 700, 253: 700, 254: 700, 255: 700, 256: 700, 257: 700, 258: 700, 259: 700, 260: 700, 261: 700, 262: 700, 263: 700, 264: 700, 265: 700, 266: 700, 267: 712, 268: 713, 269: 721, 270: 737, 271: 747, 272: 748, 273: 755, 274: 764, 275: 773, 276: 780}, 'adaptation_height': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0, 22: 433, 23: 443, 24: 100, 25: 50, 26: 0, 27: 0, 28: 770, 29: 0, 30: 430, 31: 460, 32: 209, 33: 0, 34: 279, 35: 0, 36: 539, 37: 0, 38: 795, 39: 0, 40: 512, 41: 472, 42: 393, 43: 299, 44: 687, 45: 343, 46: 720, 47: 0, 48: 229, 49: 292, 50: 732, 51: 492, 52: 0, 53: 288, 54: 270, 55: 451, 56: 735, 57: 296, 58: 388, 59: 0, 60: 638, 61: 564, 62: 0, 63: 259, 64: 744, 65: 579, 66: 611, 67: 0, 68: 453, 69: 659, 70: 0, 71: 0, 72: 705, 73: 764, 74: 678, 75: 307, 76: 248, 77: 416, 78: 0, 79: 378, 80: 293, 81: 252, 82: 630, 83: 509, 84: 593, 85: 490, 86: 592, 87: 532, 88: 518, 89: 482, 90: 528, 91: 402, 92: 551, 93: 748, 94: 412, 95: 559, 96: 749, 97: 605, 98: 417, 99: 437, 100: 249, 101: 641, 102: 662, 103: 665, 104: 725, 105: 485, 106: 150, 107: 0, 108: 50, 109: 100, 110: 150, 111: 0, 112: 50, 113: 100, 114: 150, 115: 0, 116: 50, 117: 100, 118: 150, 119: 0, 120: 50, 121: 100, 122: 359, 123: 527, 124: 285, 125: 652, 126: 372, 127: 517, 128: 765, 129: 278, 130: 677, 131: 612, 132: 702, 133: 367, 134: 335, 135: 466, 136: 401, 137: 727, 138: 474, 139: 303, 140: 740, 141: 261, 142: 627, 143: 204, 144: 681, 145: 446, 146: 693, 147: 782, 148: 202, 149: 458, 150: 390, 151: 602, 152: 357, 153: 225, 154: 264, 155: 742, 156: 585, 157: 639, 158: 330, 159: 308, 160: 334, 161: 0, 162: 0, 163: 0, 164: 0, 165: 754, 166: 325, 167: 596, 168: 760, 169: 544, 170: 523, 171: 479, 172: 498, 173: 318, 174: 684, 175: 363, 176: 312, 177: 469, 178: 427, 179: 239, 180: 255, 181: 273, 182: 424, 183: 634, 184: 382, 185: 398, 186: 240, 187: 589, 188: 616, 189: 660, 190: 618, 191: 376, 192: 776, 193: 758, 194: 572, 195: 554, 196: 562, 197: 716, 198: 669, 199: 215, 200: 464, 201: 283, 202: 541, 203: 784, 204: 355, 205: 328, 206: 0, 207: 0, 208: 0, 209: 0, 210: 0, 211: 794, 212: 350, 213: 623, 214: 536, 215: 574, 216: 730, 217: 231, 218: 514, 219: 567, 220: 315, 221: 349, 222: 368, 223: 384, 224: 697, 225: 346, 226: 421, 227: 699, 228: 495, 229: 576, 230: 582, 231: 774, 232: 216, 233: 339, 234: 546, 235: 437, 236: 606, 237: 244, 238: 654, 239: 707, 240: 236, 241: 220, 242: 501, 243: 323, 244: 267, 245: 787, 246: 223, 247: 646, 248: 478, 249: 673, 250: 208, 251: 769, 252: 100, 253: 50, 254: 150, 255: 100, 256: 50, 257: 150, 258: 100, 259: 50, 260: 150, 261: 100, 262: 50, 263: 150, 264: 100, 265: 50, 266: 150, 267: 714, 268: 550, 269: 691, 270: 625, 271: 711, 272: 790, 273: 407, 274: 505, 275: 447, 276: 650}, 'damage': {0: 2132, 1: 2132, 2: 33375, 3: 33375, 4: 257107, 5: 257107, 6: 758311, 7: 758311, 8: 4846285, 9: 4846285, 10: 25193143, 11: 25193143, 12: 37192942, 13: 37192942, 14: 68625782, 15: 68625782, 16: 72085371, 17: 72085371, 18: 96208447, 19: 96208447, 20: 113176187, 21: 113176187, 22: 17207, 23: 17777, 24: 25253, 25: 16844491, 26: 180725684, 27: 216044410, 28: 18974, 29: 253466555, 30: 19503, 31: 22169, 32: 27409, 33: 279673348, 34: 22343, 35: 315745661, 36: 24400, 37: 344889055, 38: 25987, 39: 359087022, 40: 26582, 41: 26207, 42: 27154, 43: 27068, 44: 28456, 45: 28911, 46: 27770, 47: 492143747, 48: 261362, 49: 29726, 50: 27635, 51: 28111, 52: 538782860, 53: 28725, 54: 33966, 55: 29217, 56: 29477, 57: 30605, 58: 30866, 59: 598434834, 60: 28894, 61: 31412, 62: 664315099, 63: 1963231, 64: 32861, 65: 31661, 66: 33627, 67: 735658332, 68: 32697, 69: 33437, 70: 774815984, 71: 836202742, 72: 35038, 73: 7776, 74: 35130, 75: 1612155, 76: 50188206, 77: 35967, 78: 871620832, 79: 34393, 80: 55241491, 81: 35399785, 82: 34579, 83: 37033, 84: 38809, 85: 37795, 86: 38037, 87: 37818, 88: 37891, 89: 37926, 90: 37302, 91: 39176, 92: 37764, 93: 37905, 94: 649241, 95: 39957, 96: 630007, 97: 650339, 98: 628154, 99: 639232, 100: 911135311, 101: 658599, 102: 630476, 103: 693128, 104: 713779, 105: 617451, 106: 1228363520, 107: 1170492313, 108: 1173308300, 109: 1173852997, 110: 1228363520, 111: 1170492313, 112: 1173308300, 113: 1173852997, 114: 1228363520, 115: 1170492313, 116: 1173308300, 117: 1173852997, 118: 1228363520, 119: 1170492313, 120: 1173308300, 121: 1173852997, 122: 29962757, 123: 990516, 124: 283411036, 125: 588663, 126: 20890700, 127: 28757, 128: 900872, 129: 1049376853, 130: 1066421, 131: 1149868, 132: 1084773, 133: 309164536, 134: 1339016742, 135: 974289, 136: 5018292, 137: 987802, 138: 1359626, 139: 1215046932, 140: 1071044, 141: 228296699, 142: 962042, 143: 1547414910, 144: 978270, 145: 1251094, 146: 1438944, 147: 1627049, 148: 1531207637, 149: 1326279, 150: 110926711, 151: 1632563, 152: 1193710330, 153: 1640604168, 154: 1692006214, 155: 1321474, 156: 1590079, 157: 1128916, 158: 1728144756, 159: 1770536491, 160: 1733938960, 161: 1775063218, 162: 1775063218, 163: 1775063218, 164: 1775063218, 165: 1156540, 166: 1766307168, 167: 965208, 168: 755326, 169: 917120, 170: 908369, 171: 1356136, 172: 1113295, 173: 1958488899, 174: 980414, 175: 1957173621, 176: 1919050915, 177: 1853560, 178: 281832510, 179: 2050243979, 180: 2065173884, 181: 2008871945, 182: 1739690633, 183: 1180765, 184: 1824958381, 185: 2162917576, 186: 2233560864, 187: 1612742, 188: 1493167, 189: 7964275, 190: 8033201, 191: 2310957838, 192: 10893265, 193: 11109662, 194: 10611393, 195: 10693799, 196: 11924688, 197: 12345743, 198: 10720584, 199: 2440040389, 200: 391613496, 201: 2423881195, 202: 13530638, 203: 12974310, 204: 2484629667, 205: 2480809167, 206: 2524047104, 207: 2524047104, 208: 2524047104, 209: 2524047104, 210: 2524047104, 211: 12204528, 212: 2541470661, 213: 12597469, 214: 14510690, 215: 16159782, 216: 15807155, 217: 2676142112, 218: 15132382, 219: 16120664, 220: 2632091594, 221: 2653068109, 222: 2563103088, 223: 2680157352, 224: 13411429, 225: 2721086656, 226: 2703131792, 227: 13691012, 228: 652457950, 229: 17417511, 230: 17288278, 231: 19864307, 232: 2827712914, 233: 2835421223, 234: 17436469, 235: 2849478388, 236: 25319278, 237: 2848134489, 238: 21037281, 239: 21631118, 240: 2888070249, 241: 2949796391, 242: 2511557027, 243: 2951615954, 244: 3005020592, 245: 23467024, 246: 2967442697, 247: 28913065, 248: 2953288911, 249: 31693516, 250: 3120286741, 251: 35367700, 252: 3146352265, 253: 3146416805, 254: 3146633128, 255: 3146352265, 256: 3146416805, 257: 3146633128, 258: 3146352265, 259: 3146416805, 260: 3146633128, 261: 3146352265, 262: 3146416805, 263: 3146633128, 264: 3146352265, 265: 3146416805, 266: 3146633128, 267: 36126449, 268: 1139889056, 269: 38869499, 270: 41857920, 271: 42742203, 272: 827833, 273: 3587224405, 274: 3485575626, 275: 3629888460, 276: 48446772}})

#imitate single tree regressor
model=lgb.LGBMRegressor(num_leaves=277,min_child_samples=1,min_data_in_bin=1,learning_rate= 1.0)
model.fit(df[['slr_extreme','adaptation_height']], df['damage'])
 
prediction = model.predict(df[['slr_extreme','adaptation_height']])

#rmse is very low
sqrt(mean_squared_error(df["damage"], prediction))


#but monotone constraint is violated, e.g. increasing dike leads to higher damages
plt.plot([model.predict((pd.DataFrame([[400.0, x]], columns=['slr_extreme','adaptation_height'])))  for x in range(800)])

enter image description here


#now the same version with monotone constraints

model=lgb.LGBMRegressor(num_leaves=277,min_child_samples=1,min_data_in_bin=1,learning_rate= 1.0,monotone_constraints=[-1,1])

model.fit(df[['slr_extreme','adaptation_height']], df['damage'])
 
prediction = model.predict(df[['slr_extreme','adaptation_height']])

#rmse is very high!!
sqrt(mean_squared_error(df["damage"], prediction))


$\endgroup$
3
  • 2
    $\begingroup$ Are you aware that it does not make sense to evaluate a grossly overfitting model on the training data? And that the chosen parameters are very strange (e.g. no number of rounds, but min_data_in_bin). Furthermore, when assumptions of a linear model are not fulfilled, one can try change its specification! $\endgroup$ Commented Sep 15, 2023 at 15:59
  • $\begingroup$ Thank you for your comment, I appreciate it! I agree that the model is overfitting, my assumption was that if I am able to fulfil the monotonic constraints the overfitting would not be a problem for my usage. I tried splitting the data in training and validation but faced the same problem. I got the parameters from another post where someone said that these parameters would mimic one single Regression tree. Is that not the case? Do you have other suggestions for choosing the parameters? And can you explain what you mean with specification? I couldn't see any linear relationship in the data. $\endgroup$ Commented Sep 15, 2023 at 17:32
  • $\begingroup$ The default number of trees is 100, I think. Linear models can represent almost any non-linear relationship, e.g. with splines. $\endgroup$ Commented Sep 15, 2023 at 17:37

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.