Training LLMs for spam classification: I added 14 experiments comparing different approaches: https://lnkd.in/gTNVvGcj - which token to train - which layers to train - different model sizes - LoRA - unmasking - and more! Any additional experiments you'd like to see? And here are the take aways for the table shown in the picture: 1. Training the Last vs. First Output Token (Row 1 vs. 2): Training the last output token results in substantially better performance compared to the first. This improvement is expected due to the causal self-attention mask. 2. Training the Last Transformer Block vs. Last Layer (Row 1 vs. 3): Training the entire last transformer block is also results in substantially better results than training only the last layer. 3. Training All Layers vs. Last Transformer Block (Row 1 vs. 4): Training all layers shows a modest improvement of ~2% over just training the last transformer block, but it requires almost three times longer in terms of training duration. 4. Using Larger Pretrained Models (Row 1 vs 5, and Row 1 vs. 6 and 7): Employing a 3x larger pretrained model leads to worse results. However, using a 5x larger model improves performance compared to the initial model, as was anticipated. Similarly, the 12x larger model improves the predictive performance even further. (The medium model was perhaps not well pretrained or the particular finetuning configuration works not as well for this model.) 5. Using a Model with Random Weights vs. Pretrained Weights (Row 1 vs. 8): Utilizing a model with random weights yields results that are only slightly worse by 1.3% compared to using pretrained weights. 6. Using LoRA (Low-Rank Adaptation) vs Training All Layers (Row 9 vs. 4): Keeping the model frozen and adding trainable LoRA layers (see Appendix E for details) is a viable alternative to training all model parameters and even improves the performance by 1% point. As it can be seen by the 1% lower gap between the training and validation accuracy when using LoRA, this is likely due to less overfitting. 7. Padding Input to Full Context Length vs. Longest Training Example (Row 1 vs. 10): Padding the input to the full supported context length results is significantly worse. 8. Padding vs no padding (Row 1 vs. 11 and 12): The `--no_padding` option disables the padding in the dataset, which requires training the model with a batch size of 1 since the inputs have variable lengths. This results in a better test accuracy but takes longer to train. In row 12, we additionally enable gradient accumulation with 8 steps to achieve the same batch size as in the other experiments. 9. Disabling the causal attention mask (Row 1 vs. 13): Disables the causal attention mask used in the multi-head attention module. This means all tokens can attend all other tokens. The model accuracy is slightly improved compared to the GPT model with causal mask.
How to Improve Predictive Accuracy
Explore top LinkedIn content from expert professionals.
Summary
Improving predictive accuracy ensures models produce more reliable and precise forecasts by refining data inputs, methods, and evaluation strategies, ultimately enhancing decision-making and performance in dynamic environments.
- Refine data inputs: Explore new features or transform existing ones to provide the model with richer information that can lead to better predictions.
- Experiment with approaches: Test alternative modeling techniques, such as survival modeling or model merging, to align predictions with real-world challenges and shifting patterns.
- Analyze and iterate: Perform error analysis to identify the most frequent mistakes, then implement targeted fixes to address these specific issues efficiently.
-
-
Honestly, most AI developers are still stuck in the last century. It blows my mind how few people are aware of Error Analysis. This is *literally* the fastest and most effective way to evaluate AI applications, and most teams are still stuck chasing ghosts. Please, stop tracking generic metrics and follow these steps: 1. Collect failure samples Start reviewing the responses generated by your application. Write notes about each response, especially those that were mistakes. You don't need to format your notes in any specific way. Focus on describing what went wrong with the response. 2. Categorize your notes After you have reviewed a good set of responses, take an LLM and ask it to find common patterns in your notes. Ask it to classify each note based on these patterns. You'll end up with categories covering every type of mistake your application made. 3. Diagnose the most frequent mistakes Begin by focusing on the most common type of mistake. You don't want to waste time working with rare mistakes. Drill into the conversations, inputs, and logs leading to those incorrect samples. Try to understand what might be causing the problems. 4. Design targeted fixes At this point, you want to determine how to eliminate the mistakes you diagnosed in the previous step as quickly and cheaply as possible. For example, you could tweak your prompts, add extra validation rules, find more training data, or modify the model. 5. Automate the evaluation process You need to implement a simple process to rerun an evaluation set through your application and evaluate whether your fixes were effective. My recommendation is to use an LLM-as-a-Judge to run samples through the application, score them with a PASS/FAIL tag, and compute the results. 6. Keep an eye on your metrics Each category you identified during error analysis is a metric you want to track over time. You will get nowhere by obsessing over "relevance", "correctness", "completeness", "coherence", and any other out-of-the-box metrics. Forget about these and focus on the real issues you found.
-
Want to boost LLM performance? Merge two LLMs together. I used to be active in data science competitions on Kaggle. The way to win a Kaggle competition is generally to create the biggest ensemble of models you can. Each model excels in its own corner of the prediction space, and when you put them together, you generally get a performance boost. Kind of like asking the same question of a lot of smart people. This same technique is coming to large language models. It is called merging. Merging is cost-effective (no GPU required) and produces winners. For example, the Marcoro14-7B-slerp model, created using the mergekit library (link below), became the best-performing model on the Open LLM Leaderboard as of Feb 1, 2024. The most common model merging technique is called SLERP (Spherical Linear Interpolation). Here’s how it works: 1/Normalization: The input vectors from the LLMs are normalized to unit length. This ensures they represent directions rather than magnitudes1. 2/Angle Calculation: The angle between these vectors is calculated using their dot product1. 3/Interpolation: Spherical Linear Interpolation (SLERP) is used to smoothly interpolate between the vectors1. It maintains a constant rate of change and preserves the geometric properties of the spherical space in which the vectors reside1. 4/Weight Calculation: Scale factors based on the interpolation factor and the angle between the vectors are computed. These factors are used to weigh the original vectors. 5/Vector Summation: The weighted vectors are then summed to obtain the interpolated vector. Another technique, BRANCH-SOLVE-MERGE (BSM) from Meta, has shown significant improvements in evaluation correctness and consistency for each LLM, enhancing human-LLM agreement by up to 26%, reducing length and pairwise position biases by up to 50%. It also improved the coherence of the stories while also improving constraint satisfaction by 12%. Want to try it out? Start with MergeKit (https://buff.ly/4bg4wU1) Here are a few more resources: BSM paper: https://buff.ly/3vn0uck LLM-Slerp-Merge: https://buff.ly/4a6bREH HuggingFace article on LLM merging: https://buff.ly/43s3hO1 #ArtificialIntelligence #AIResearch #DeepLearning #NLP #LLM #ModelMerging
-
Machine learning models are built to learn from customer behavior and make predictions. But when that behavior shifts rapidly, like during the pandemic, even the most accurate models can fall behind. That’s exactly what the Data Science team at Booking.com experienced while working on cancellation prediction. In a recent blog post, they shared how they evolved their approach to stay aligned with changing user behavior. Originally, the team used traditional classification models to predict whether a booking would be canceled. These models performed well when patterns were stable, but they struggled in fast-changing environments. One key issue: they relied on historical outcomes that often took time to materialize. Plus, they only answered if a cancellation might happen, not when. To address these challenges, the team shifted to survival modeling, which estimates the time until an event occurs. This approach enabled them to generate dynamic, time-sensitive predictions over the course of each booking. With multiple enhancements to their survival modeling pipeline, the team saw improved predictive accuracy, especially in volatile conditions. The shift didn’t just boost performance; it showed how reframing a business problem through a different modeling lens can unlock smarter, more adaptable solutions. #MachineLearning #DataScience #SurvivalModeling #Classification #SnacksWeeklyonDataScience – – – Check out the "Snacks Weekly on Data Science" podcast and subscribe, where I explain in more detail the concepts discussed in this and future posts: -- Spotify: https://lnkd.in/gKgaMvbh -- Apple Podcast: https://lnkd.in/gj6aPBBY -- Youtube: https://lnkd.in/gcwPeBmR https://lnkd.in/gU3TsMQP
-
One of the most common questions I get is "𝐌𝐲 𝐩𝐫𝐞𝐝𝐢𝐜𝐭𝐢𝐯𝐞 𝐦𝐨𝐝𝐞𝐥 𝐢𝐬𝐧'𝐭 𝐰𝐨𝐫𝐤𝐢𝐧𝐠 𝐰𝐞𝐥𝐥 𝐞𝐧𝐨𝐮𝐠𝐡...𝐰𝐡𝐚𝐭 𝐬𝐡𝐨𝐮𝐥𝐝 𝐈 𝐝𝐨?" If model performance is disappointing, there are three main levers that we can pull to try improving the performance. 🔷 The first and most powerful lever is changing the data that the model is using. We can add more features to the model or by transforming the features that we’ve already included. In my experience, this is the most powerful of the levers. 🔷 Another lever we can pull is changing the type of model or the type of feature selection. If we see that a regression model isn’t working well, we can try a decision tree, for example. We can also try using a penalized regression model that performs feature selection automatically during the modeling process. 🔷 The other lever is tuning the hyperparameters. A hyperparameter is like a setting knob on a model. It adjusts the model's rules so changes in the hyperparameters produce models with very different results. Any combination of these three levers may be used to improve model performance. Depending on what data is accessible, it may not be feasible to add more features, so the data scientist must rely on hyperparameter tuning and model selection to improve the quality of predictions.