Join now Sign in

From the course: Introduction to Machine Learning with KNIME

Using CRISP-DM to evaluate tools - KNIME Tutorial

From the course: Introduction to Machine Learning with KNIME

Start my 1-month free trial Buy for my team

Using CRISP-DM to evaluate tools

“

- [Instructor] Okay, we're going to keep organized by using CRISP-DM, the Cross-industry standard process for data mining to be a structure by which we can move to the various nodes within KNIME, and I think it's actually a fabulous way to evaluate software like KNIME. How good a job we might ask does it do at all the different tasks? So I find that the most reliable way to find the CRISP-DM document is to go to the Wikipedia page, and then if you go down to the references you'll be able to find copies of it here. So I've got it open and this is the famous circular diagram that many folks have come across, but what I want to draw your attention to is the task diagram. And this shows us the 24 tasks that fall under the six phases. Now we're going to stay focused on those tasks that are software oriented. Data understanding, data preparation, and modeling. Now clearly we're also very focused on deployment, but as you can see, some of the deployment tasks are not really software related, like producing the final report and so on. But what we are going to do is go through each and every one of the tasks in data understanding, data prep, and modeling, and find nodes within KNIME that are capable of those tasks. That way, even though we're going to move briskly through KNIME, we're going to get a real tremendous variety of the nodes that we see. I want to give you a preview of just that. So the first task in the data understanding phase is collect initial data, which includes data loading, so we're going to see the final reader node. But additional tasks in the data understanding phase include describe data and verify data quality, and the data explorer node is going to help us with that. But we're also going to do some data visualizations, specifically the scatter plot node and the box plot node. Moving on to data preparation, which of course is a very labor intensive phase, we have a number of CRISP-DM tasks to discuss here. So CRISP-DM describes integrate data as those methods where information is combined from multiple tables or records. This is of course an important task, so we're going to be doing merging with a joiner node and aggregation with the groupby node. Additional data preparation tasks include construction, and lately folks have been talking about this kind of thing as feature engineering. We're going to do just a simple example with the math formula node, but data construction is what is often called these days feature engineering. CRISP-DM describes the select task as deciding on what data is going to be used for the analysis. You know, a lot of folks just try to use all the data, and you actually have to be more thoughtful about that. Nodes that will help us control what data is presented to the modeling algorithms are going to include nodes like the column filter node and the row sampling node. In fact, we'll take a quick peak at balancing. We're also going to use the row filter node to select only our complete data, and then use the missing values node to impute missing data. Finally, we'll correct some formatting issues with the cell splitter node. Moving on to the modeling phase, we're going to see how to do train test partitioning with the partitioning node, and we'll do a linear regression example. Then we'll do a decision tree example, also with partitioning, but showing the ROC curve node as well as the scorer node. In short, KNIME is comprehensive. It has examples of all the different tasks that we would have to perform, so walking through CRISP-DM is a really good way to organize our time. It's also a good way for you to learn KNIME.

Contents