after I've spent several weeks trying to fit a regression model to my flood damage data (x1=water height, x2=adaptation height, x3=(x1-x2), y=damage), it is now time for my very first question on StackExchange. It is also my first time working with all of these different regression models, so please bear with me.
I aim for a simple model that captures information from my dataset and holds true for monotone constraints, e.g. increasing water height leads to higher damage and increasing adaptation height leads to lower damage. I came to the conclusion that a random forest or tree regressor would be a good option, as linear assumptions are violated. I tried several different packages to model a single regression tree or a random forest (DecisionTree.jl, XGBoost, Sklearn tree, lightgbm,...). I tried XGBoost and lightgbm as these packages have the option to insert monotone constraints, however, the models I get with monotone constraints always have a really poor fit to the real data. Models without these constraints have a really good fit to the real data, but strongly violate the constraints.
My two questions are therefore:
- Is there a way to setup a single regression tree and afterwards slightly modify the tree such that the monotone constraints hold true?
or
- How could I improve the fit of the monotone constrained models (e.g. lightgbm)?
I'll attach a python code example with lightgbm in case anyone is interested to have a deeper look into this. I would be really grateful for any help on this topic! Suggestions for other modeling methods are also highly welcome (so far I've tried linear models and gaussian processes) :)
import pandas as pd
from math import sqrt
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import lightgbm as lgb
df = pd.DataFrame( {'slr_extreme': {0: 30, 1: 30, 2: 50, 3: 50, 4: 70, 5: 70, 6: 90, 7: 90, 8: 110, 9: 110, 10: 130, 11: 130, 12: 150, 13: 150, 14: 160, 15: 160, 16: 170, 17: 170, 18: 180, 19: 180, 20: 190, 21: 190, 22: 193, 23: 198, 24: 200, 25: 200, 26: 200, 27: 210, 28: 217, 29: 220, 30: 221, 31: 229, 32: 230, 33: 230, 34: 231, 35: 240, 36: 244, 37: 245, 38: 246, 39: 250, 40: 254, 41: 255, 42: 256, 43: 259, 44: 261, 45: 267, 46: 270, 47: 270, 48: 272, 49: 273, 50: 274, 51: 279, 52: 280, 53: 282, 54: 286, 55: 287, 56: 287, 57: 288, 58: 288, 59: 290, 60: 293, 61: 296, 62: 300, 63: 301, 64: 305, 65: 310, 66: 310, 67: 310, 68: 317, 69: 317, 70: 320, 71: 330, 72: 331, 73: 331, 74: 334, 75: 336, 76: 340, 77: 340, 78: 340, 79: 342, 80: 352, 81: 352, 82: 352, 83: 354, 84: 362, 85: 362, 86: 362, 87: 363, 88: 363, 89: 368, 90: 368, 91: 370, 92: 370, 93: 376, 94: 381, 95: 382, 96: 383, 97: 383, 98: 385, 99: 386, 100: 387, 101: 391, 102: 392, 103: 395, 104: 398, 105: 399, 106: 400, 107: 400, 108: 400, 109: 400, 110: 400, 111: 400, 112: 400, 113: 400, 114: 400, 115: 400, 116: 400, 117: 400, 118: 400, 119: 400, 120: 400, 121: 400, 122: 406, 123: 407, 124: 410, 125: 413, 126: 414, 127: 429, 128: 431, 129: 436, 130: 437, 131: 438, 132: 438, 133: 442, 134: 443, 135: 444, 136: 446, 137: 447, 138: 449, 139: 450, 140: 451, 141: 453, 142: 454, 143: 456, 144: 457, 145: 460, 146: 461, 147: 469, 148: 470, 149: 471, 150: 471, 151: 475, 152: 479, 153: 481, 154: 489, 155: 493, 156: 495, 157: 496, 158: 497, 159: 498, 160: 499, 161: 500, 162: 500, 163: 500, 164: 500, 165: 501, 166: 502, 167: 504, 168: 505, 169: 505, 170: 510, 171: 510, 172: 513, 173: 518, 174: 518, 175: 519, 176: 520, 177: 527, 178: 528, 179: 529, 180: 535, 181: 541, 182: 541, 183: 544, 184: 544, 185: 544, 186: 550, 187: 562, 188: 564, 189: 567, 190: 571, 191: 571, 192: 576, 193: 576, 194: 579, 195: 579, 196: 582, 197: 586, 198: 589, 199: 589, 200: 590, 201: 590, 202: 591, 203: 592, 204: 592, 205: 600, 206: 600, 207: 600, 208: 600, 209: 600, 210: 600, 211: 602, 212: 603, 213: 605, 214: 612, 215: 618, 216: 619, 217: 622, 218: 622, 219: 623, 220: 623, 221: 624, 222: 626, 223: 628, 224: 629, 225: 630, 226: 633, 227: 635, 228: 635, 229: 641, 230: 642, 231: 647, 232: 648, 233: 648, 234: 650, 235: 652, 236: 653, 237: 655, 238: 655, 239: 657, 240: 657, 241: 662, 242: 666, 243: 669, 244: 669, 245: 670, 246: 671, 247: 672, 248: 675, 249: 688, 250: 692, 251: 695, 252: 700, 253: 700, 254: 700, 255: 700, 256: 700, 257: 700, 258: 700, 259: 700, 260: 700, 261: 700, 262: 700, 263: 700, 264: 700, 265: 700, 266: 700, 267: 712, 268: 713, 269: 721, 270: 737, 271: 747, 272: 748, 273: 755, 274: 764, 275: 773, 276: 780}, 'adaptation_height': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0, 22: 433, 23: 443, 24: 100, 25: 50, 26: 0, 27: 0, 28: 770, 29: 0, 30: 430, 31: 460, 32: 209, 33: 0, 34: 279, 35: 0, 36: 539, 37: 0, 38: 795, 39: 0, 40: 512, 41: 472, 42: 393, 43: 299, 44: 687, 45: 343, 46: 720, 47: 0, 48: 229, 49: 292, 50: 732, 51: 492, 52: 0, 53: 288, 54: 270, 55: 451, 56: 735, 57: 296, 58: 388, 59: 0, 60: 638, 61: 564, 62: 0, 63: 259, 64: 744, 65: 579, 66: 611, 67: 0, 68: 453, 69: 659, 70: 0, 71: 0, 72: 705, 73: 764, 74: 678, 75: 307, 76: 248, 77: 416, 78: 0, 79: 378, 80: 293, 81: 252, 82: 630, 83: 509, 84: 593, 85: 490, 86: 592, 87: 532, 88: 518, 89: 482, 90: 528, 91: 402, 92: 551, 93: 748, 94: 412, 95: 559, 96: 749, 97: 605, 98: 417, 99: 437, 100: 249, 101: 641, 102: 662, 103: 665, 104: 725, 105: 485, 106: 150, 107: 0, 108: 50, 109: 100, 110: 150, 111: 0, 112: 50, 113: 100, 114: 150, 115: 0, 116: 50, 117: 100, 118: 150, 119: 0, 120: 50, 121: 100, 122: 359, 123: 527, 124: 285, 125: 652, 126: 372, 127: 517, 128: 765, 129: 278, 130: 677, 131: 612, 132: 702, 133: 367, 134: 335, 135: 466, 136: 401, 137: 727, 138: 474, 139: 303, 140: 740, 141: 261, 142: 627, 143: 204, 144: 681, 145: 446, 146: 693, 147: 782, 148: 202, 149: 458, 150: 390, 151: 602, 152: 357, 153: 225, 154: 264, 155: 742, 156: 585, 157: 639, 158: 330, 159: 308, 160: 334, 161: 0, 162: 0, 163: 0, 164: 0, 165: 754, 166: 325, 167: 596, 168: 760, 169: 544, 170: 523, 171: 479, 172: 498, 173: 318, 174: 684, 175: 363, 176: 312, 177: 469, 178: 427, 179: 239, 180: 255, 181: 273, 182: 424, 183: 634, 184: 382, 185: 398, 186: 240, 187: 589, 188: 616, 189: 660, 190: 618, 191: 376, 192: 776, 193: 758, 194: 572, 195: 554, 196: 562, 197: 716, 198: 669, 199: 215, 200: 464, 201: 283, 202: 541, 203: 784, 204: 355, 205: 328, 206: 0, 207: 0, 208: 0, 209: 0, 210: 0, 211: 794, 212: 350, 213: 623, 214: 536, 215: 574, 216: 730, 217: 231, 218: 514, 219: 567, 220: 315, 221: 349, 222: 368, 223: 384, 224: 697, 225: 346, 226: 421, 227: 699, 228: 495, 229: 576, 230: 582, 231: 774, 232: 216, 233: 339, 234: 546, 235: 437, 236: 606, 237: 244, 238: 654, 239: 707, 240: 236, 241: 220, 242: 501, 243: 323, 244: 267, 245: 787, 246: 223, 247: 646, 248: 478, 249: 673, 250: 208, 251: 769, 252: 100, 253: 50, 254: 150, 255: 100, 256: 50, 257: 150, 258: 100, 259: 50, 260: 150, 261: 100, 262: 50, 263: 150, 264: 100, 265: 50, 266: 150, 267: 714, 268: 550, 269: 691, 270: 625, 271: 711, 272: 790, 273: 407, 274: 505, 275: 447, 276: 650}, 'damage': {0: 2132, 1: 2132, 2: 33375, 3: 33375, 4: 257107, 5: 257107, 6: 758311, 7: 758311, 8: 4846285, 9: 4846285, 10: 25193143, 11: 25193143, 12: 37192942, 13: 37192942, 14: 68625782, 15: 68625782, 16: 72085371, 17: 72085371, 18: 96208447, 19: 96208447, 20: 113176187, 21: 113176187, 22: 17207, 23: 17777, 24: 25253, 25: 16844491, 26: 180725684, 27: 216044410, 28: 18974, 29: 253466555, 30: 19503, 31: 22169, 32: 27409, 33: 279673348, 34: 22343, 35: 315745661, 36: 24400, 37: 344889055, 38: 25987, 39: 359087022, 40: 26582, 41: 26207, 42: 27154, 43: 27068, 44: 28456, 45: 28911, 46: 27770, 47: 492143747, 48: 261362, 49: 29726, 50: 27635, 51: 28111, 52: 538782860, 53: 28725, 54: 33966, 55: 29217, 56: 29477, 57: 30605, 58: 30866, 59: 598434834, 60: 28894, 61: 31412, 62: 664315099, 63: 1963231, 64: 32861, 65: 31661, 66: 33627, 67: 735658332, 68: 32697, 69: 33437, 70: 774815984, 71: 836202742, 72: 35038, 73: 7776, 74: 35130, 75: 1612155, 76: 50188206, 77: 35967, 78: 871620832, 79: 34393, 80: 55241491, 81: 35399785, 82: 34579, 83: 37033, 84: 38809, 85: 37795, 86: 38037, 87: 37818, 88: 37891, 89: 37926, 90: 37302, 91: 39176, 92: 37764, 93: 37905, 94: 649241, 95: 39957, 96: 630007, 97: 650339, 98: 628154, 99: 639232, 100: 911135311, 101: 658599, 102: 630476, 103: 693128, 104: 713779, 105: 617451, 106: 1228363520, 107: 1170492313, 108: 1173308300, 109: 1173852997, 110: 1228363520, 111: 1170492313, 112: 1173308300, 113: 1173852997, 114: 1228363520, 115: 1170492313, 116: 1173308300, 117: 1173852997, 118: 1228363520, 119: 1170492313, 120: 1173308300, 121: 1173852997, 122: 29962757, 123: 990516, 124: 283411036, 125: 588663, 126: 20890700, 127: 28757, 128: 900872, 129: 1049376853, 130: 1066421, 131: 1149868, 132: 1084773, 133: 309164536, 134: 1339016742, 135: 974289, 136: 5018292, 137: 987802, 138: 1359626, 139: 1215046932, 140: 1071044, 141: 228296699, 142: 962042, 143: 1547414910, 144: 978270, 145: 1251094, 146: 1438944, 147: 1627049, 148: 1531207637, 149: 1326279, 150: 110926711, 151: 1632563, 152: 1193710330, 153: 1640604168, 154: 1692006214, 155: 1321474, 156: 1590079, 157: 1128916, 158: 1728144756, 159: 1770536491, 160: 1733938960, 161: 1775063218, 162: 1775063218, 163: 1775063218, 164: 1775063218, 165: 1156540, 166: 1766307168, 167: 965208, 168: 755326, 169: 917120, 170: 908369, 171: 1356136, 172: 1113295, 173: 1958488899, 174: 980414, 175: 1957173621, 176: 1919050915, 177: 1853560, 178: 281832510, 179: 2050243979, 180: 2065173884, 181: 2008871945, 182: 1739690633, 183: 1180765, 184: 1824958381, 185: 2162917576, 186: 2233560864, 187: 1612742, 188: 1493167, 189: 7964275, 190: 8033201, 191: 2310957838, 192: 10893265, 193: 11109662, 194: 10611393, 195: 10693799, 196: 11924688, 197: 12345743, 198: 10720584, 199: 2440040389, 200: 391613496, 201: 2423881195, 202: 13530638, 203: 12974310, 204: 2484629667, 205: 2480809167, 206: 2524047104, 207: 2524047104, 208: 2524047104, 209: 2524047104, 210: 2524047104, 211: 12204528, 212: 2541470661, 213: 12597469, 214: 14510690, 215: 16159782, 216: 15807155, 217: 2676142112, 218: 15132382, 219: 16120664, 220: 2632091594, 221: 2653068109, 222: 2563103088, 223: 2680157352, 224: 13411429, 225: 2721086656, 226: 2703131792, 227: 13691012, 228: 652457950, 229: 17417511, 230: 17288278, 231: 19864307, 232: 2827712914, 233: 2835421223, 234: 17436469, 235: 2849478388, 236: 25319278, 237: 2848134489, 238: 21037281, 239: 21631118, 240: 2888070249, 241: 2949796391, 242: 2511557027, 243: 2951615954, 244: 3005020592, 245: 23467024, 246: 2967442697, 247: 28913065, 248: 2953288911, 249: 31693516, 250: 3120286741, 251: 35367700, 252: 3146352265, 253: 3146416805, 254: 3146633128, 255: 3146352265, 256: 3146416805, 257: 3146633128, 258: 3146352265, 259: 3146416805, 260: 3146633128, 261: 3146352265, 262: 3146416805, 263: 3146633128, 264: 3146352265, 265: 3146416805, 266: 3146633128, 267: 36126449, 268: 1139889056, 269: 38869499, 270: 41857920, 271: 42742203, 272: 827833, 273: 3587224405, 274: 3485575626, 275: 3629888460, 276: 48446772}})
#imitate single tree regressor
model=lgb.LGBMRegressor(num_leaves=277,min_child_samples=1,min_data_in_bin=1,learning_rate= 1.0)
model.fit(df[['slr_extreme','adaptation_height']], df['damage'])
prediction = model.predict(df[['slr_extreme','adaptation_height']])
#rmse is very low
sqrt(mean_squared_error(df["damage"], prediction))
#but monotone constraint is violated, e.g. increasing dike leads to higher damages
plt.plot([model.predict((pd.DataFrame([[400.0, x]], columns=['slr_extreme','adaptation_height']))) for x in range(800)])
#now the same version with monotone constraints
model=lgb.LGBMRegressor(num_leaves=277,min_child_samples=1,min_data_in_bin=1,learning_rate= 1.0,monotone_constraints=[-1,1])
model.fit(df[['slr_extreme','adaptation_height']], df['damage'])
prediction = model.predict(df[['slr_extreme','adaptation_height']])
#rmse is very high!!
sqrt(mean_squared_error(df["damage"], prediction))
