Uploaded bySkyl.ai

148 views

How to perform Secure Data Labeling for Machine Learning

The document discusses secure data labeling for machine learning, featuring experts in the field sharing their experiences and insights. It covers the data labeling process, best practices, and the importance of quality datasets while outlining the capabilities of skyl.ai's data labeling solution. The webinar also includes a live demo and offers consulting services for AI adoption and implementation.

Technology◦

How to perform Secure Data Labeling for Machine Learning

1.
How to performSecure Data Labeling for Machine Learning
2.
Solutions Analyst withexperience working at the forefront of cutting-edge technology and leading innovative projects. Areas of expertise include solutions analysis and design. Fahid Basheer Solutions Analyst The Speaker
3.
Extensive experience buildingfuture tech products using Machine Learning and Artificial Intelligence. Areas of expertise includes Deep Learning, Data Analysis, full stack development and building world class products in ecommerce, travel and healthcare sector. Shruti Tanwar Lead - Data Science The Speaker
4.
Technology enthusiast withextensive experience working in the information technology and services industry. Leads cutting-edge solutions for businesses using Machine Learning and Artificial Intelligence. Areas of expertise includes Architecture design, Solutioning, Data Engineering and Deep Learning. Mohit Juneja Solutions Architect The Panelist
5.
Getting familiar with‘Zoom’ All dial-in participants will be muted to enable the presenters to speak without interruption Questions can be submitted via Zoom Questions chat window and will be addressed at the end during Q&A The recording will be emailed to you after the webinar Please familiarize yourself with the Zoom ‘Control Panel’ on your screen
6.
Live Demo of SecureData Labeling Platform Deep Dive into the Data Labeling Process 1 2 ...In the next 45 minutes
7.
Machine Learning automationplatform for unstructured data A quick intro about Skyl.ai Guided Machine Learning Workflow Build & deploy ML models faster on unstructured data Collaborative Data Collection & Labeling Easy-to-use & scalable AI SaaS platform
8.
POLL #1 At whatstage of Machine learning adoption is your organization? ⊚ Exploring - Curious about it ⊚ Planning - Creating AI/ML strategy ⊚ Experimenting - Building proof of concepts ⊚ Scaling up - Some departments are using it ⊚ In production - Using it in product features ⊚ Transforming - AI/Ml driven business
9.
Deep Dive: Data LabelingProcess 01
10.
What is DataLabeling? Data labeling, also called data annotation/tagging, is the process of preparing labeled datasets for machine learning. Images Data labeling Image Classification ML Model
11.
Examples of Datalabeling
12.
Computer Vision -Image Classification
13.
Computer Vision -Object Detection
14.
NLP - TextExtraction (NER)
15.
By Collaborator (Human-in-loop) ●In-house employee- Assigning tasks to an in-house labeling team / employees of the organization. ● Hire data labeling companies. Automated data labeling ● Data labeling through machine learning algorithms ● Reduces the number of labeling tasks in the data labeling process ● Speed up the labeling process Types of Data Labeling
16.
3 Aspects forBuilding Quality Labeled Dataset Right team to carry out the data labeling project Right data labeling process & workflow Right data labeling tools in place
17.
⊚ Conducting mockdata labeling test ⊚ Measuring data labeling consistency ⊚ Auditing (QC) of Labeled dataset periodically as it gets labeled Best Practices to ensure Quality Labeled Dataset
18.
Labeling Quality: ConductingMock Data Labeling Test Qualify the right collaborator for your data labeling job
19.
Labeling Quality :Measuring Data Labeling Consistency Negative sentiment Neutral sentiment Positive sentiment Measuring how consistently collaborator agree with each other
20.
Labeling Quality :Review of Labeled Dataset Reviewing the labeled dataset by flagging out the bad labeled data
21.
⊚ Access Control ⊚Audit Log ⊚ Data Encryption ⊚ Data sources behind firewalls Data Security
22.
Data Security :Access Control Data scientist Project Lead / Data Manager Data Labeler (Collaborator) Data Labeling Job Reviewer Having right access control throughout the data labeling process
23.
Data Security :Audit Log Gain insights into user activities for attaining organization and compliance needs
24.
Data Security :Encryption Encrypted Data at rest Data in use Data in motion TLS/SSL Securing data assets while in rest, motion and use Data Labeling Tool
25.
Data Security :Firewall Having private network restriction to data by using on-prem data labeling solution Private network Public Network
26.
Demo of howto perform secure data labeling 02
27.
Skyl Labelwise: DataLabeling Process
28.
Demo of howto perform secure Data Labeling
29.
Skyl.ai Labelwise Guided Workflow Data labelingsolution for Computer Vision & NLP Quality Labeled dataset Right process and metrics in place to ensure quality data labeling Effective Collaboration Collaborate and manage data labeling projects efficiently Early Visibility Get early visibility; visualize and affirm correctness on every step of the way Scalable High - Performance Access on-demand and scalable, high-performance infrastructure Security & Compliance Access control, data encryption, audit log and on-prem solution
30.
We can helpyou with... ⊚ AI Adoption Assessment ⊚ AI Systems Integration ⊚ AI Performance Evaluation ⊚ AI-Enabled Software Development Our AI Consulting Services www.skyl.ai contact@skyl.ai
31.
⊚ Free Trial+ POC ⊚ Complimentary 30 min consultation ⊚ AI Implementation Playbook www.skyl.ai contact@skyl.ai Special offer for you...
32.
Questions? ?
33.
We hope tohear from you soon Thank you for joining! 85 Broad Street, New York, NY, 10004 +1 718 300 2104, +1 646 202 9343 contact@skyl.ai

Editor's Notes

#2 Hello everyone and welcome. Thank you for joining today’s webinar on How to perform Secure Data Labeling for Machine Learning. My name is Edwin Martinez and I’ll be your host today. First off, I’d like to introduce 3 expert speakers for today’s webinar..
#3 First we have Fahid Basheer. Fahid is a Solutions Consultant with experience working at the forefront of cutting-edge technology and leading innovative projects. His areas of expertise include solutions analysis and design. Welcome Fahid.
#4 Next we have Shruti Tanwar - Shruti is an expert in data science who is a veteran in building SaaS products using Machine Learning and AI. Her expertise includes Deep Learning and Data Analysis, as well as full stack development and building tech products in various different fields such as ecommerce, travel, and healthcare. Welcome, Shruti!
#5 And as a panelist, we have Mohit Juneja joining us. Mohit leads cutting-edge solutions for businesses using Machine Learning and AI. He’s an expert in Architect design, Data Engineering, and Deep Learning. Welcome Mohit!
#6 Before we begin, I’d like to briefly talk about some Zoom features that will be relevant to us. All participants in the webinar will be muted to avoid any interruptions during the session. Any questions you might have can be submitted to the Zoom Questions chat window in the control panel, located on the bottom of the screen. We’ll make sure to address your questions during the Q&A session. Also, the recording of the webinar will be emailed to you afterwards, just in case you’ve missed any talking points or wish to view it again. So that’s all for the introduction - now we’ll get started with the webinar and I’ll hand over the session to Fahid
#7 Thank You for the introductions Edwin and welcome everyone, my name is Fahid, and I'll be one of the presenters for you today Now without further ado, Let's take a look at what we are going to cover in the next 45 minutes,So the first part of this webinar will be presented by me and it will be about the Data Labeling Process for Machine Learning projects, and we will first be taking a look at what Data Labeling means for these projects, the types of Data Labeling processes and some examples for different image and text based solutions, and a few best practices on how to maintain Quality Labeled Datasets.We will also be covering a very crucial part of Data Labeling, which is maintaining Data Security during these Data Labeling processes, so you can leave this webinar with a very comprehensive understanding of Data Labeling Management for your Machine Learning Projects. And in the second section of this webinar we will have Shruti, our resident data scientist, demonstrating for you live, how to perform data labeling and build these high quality datasets using a secure data labeling platform like Skyl.ai, so we have that in store for you as well. Like Edwin mentioned earlier, we will have a Q&A session at the end of the webinar, so you don't have to worry if you have any questions about the sections we cover in the webinar, we will address your questions at that time.
#8 Let me start with a quick intro about the Skyl.ai platform and its capabilities Now the Skyl.ai platform is a Machine Learning automation platform that works with unstructured data, and this data can includes text based data, images or audio data etc. And using Skyl.ai’s platform businesses can build and deploy high quality NLP, Computer Vision models in hours rather than days or weeks. So how exactly does Skyl.ai do that? Well, Skyl.ai provides an easy to use unified platform for the entire machine learning workflow which includes data collection, data labeling, feature engineering, training the Machine Learning model by choosing out of the box algorithms at scale, and once model is trained, carrying out model evaluation and finally one click deployment and monitoring the model in production. So with Skyl.ai Platform you can basically. Manage all of your ML projects in one place. And allows you to take your AI experiments to production in no time with scale and leads to faster model release iteration cycles. The best part doing all this with no infrastructure or MLops effort required, the platform takes care of your infrastructure needs.Let me start with a quick intro about Skyl.ai and its capabilities. Skyl.ai is a ML automation platform for unstructured data which includes text, images, audio etc. Using Skyl.ai business can build and deploy high quality NLP, Computer Vision models in hours rather than days or weeks. So how does Skyl do that? Skyl.ai provides an easy to use unified platform for the entire machine learning workflow which includes data collection, data labeling, feature engineering, training the model by choosing out of the box algorithms at scale, once model is trained, carrying out model evaluation and finally one click deployment and monitoring the model in production. So with Skyl.ai Platform you can basically. Manage your ML projects in one place. And allows you to take your AI experiments to production in no time with scale and leads to faster model release iteration cycles. The best part doing all this with no infrastructure or MLops effort required.
#9 Now I'd like to launch a poll, and the poll will give us an idea about what stage of machine learning adoption is present in your, as in the attendees organization right now, so please go ahead and select the appropriate option on the poll, pertaining to your organization.So I'm just launching the poll please go ahead and select the appropriate option. Im just waiting for a few more people if you could complete it in a few seconds before I close the poll that that would be great.okay I'm about to close the poll alright interesting so we have about one third of our attendees in the mid stage like they're experimenting and building proof of concepts which is amazing and followed by that we have about 22% of the our attendees are exploring or scaling up so they're kind of like a bow and below that level and we have about 11 percent of attendees having their models used in production so we have you know people at various stages and 11 percent of people are in the planning stage so we have more or less an equal distribution. And at each stage of machine learning adoption, you may different types of questions, maybe on data labeling and management, or taking ML projects to fulfillment, and we will be glad to answer those questions for you during the QnA session.Exploring - Curious about it Planning - Creating AI/ML strategy Experimenting - Building proof of concepts Scaling up - Some departments are using it In production - Using it in product features Transforming - AI/Ml driven business
#10 Alright now we approach the main parts of the webinar, A deep dive into the Data Labeling ProcessSo, all Machine Learning problems start with data—preferably, lots of data for which you already know the ground truth or the target answer, and this type of data is what we called labeled data. Now, Supervised machine learning algorithms learn from this labeled dataset, (data that has been tagged with labels). Which means that, Programmers do not explicitly program machine learning algorithms on how to make decisions, they program the models to learn from these labeled dataset. NOW Often, this data is NOT readily available in a labeled form and Collecting and preparing these high-quality datasets is the most important step in solving an Machine Learning problem.
#11 Alright now we take a look at what Data Labeling is,So, Data labeling, which can be referred to as data annotation or tagging interchangeably, is the process of preparing a labelled data set And this data can be in the form of images, text could be audio data, and the output of the data labeling process is to have one or more tags or ground truth value, relating to the input data, so you can see on the screen here that the image of a tshirt here has been tagged with around six labels. Now, Machine learning models learn to recognize repetitive patterns in this data, as in, supervised machine learning algorithms learn from labeled dataset. (data that has been tagged with labels.) So, after a sufficient amount of labeled data is processed, machine learning models can identify the same patterns in data which has not been labeled, so you can understand that labeling the data for processing is the first step to a working machine learning model.
#12 Okay, now as I said earlier we will be taking a look at some examples of data labeling, with reference to the kind of problem we are trying to solve.
#13 So the first example is of a computer vision implementation, specifically in Image Classification.So what Image classification Machine Learning models do is that they classify or categorize images based on one or more attributes or labels that they can infer from said image. And For training such a computer vision based Machine Learning model we would require a dataset of images labeled with these attributes. So in this particular example we are trying to classify attributes of apparels from their image, and these attributes could be The Type of apparel whether its Top wear, bottom wear, or head wear Or the Base color of the apparel if its blue, green, blue , yellow, Or who the clothing is meant for like Men or Women And so forth. So as part of building this labeled dataset for model training, you would provide a series of such images to a data labeler or a collaborator who will then label these attributes out for each image.
#14 This is another example of computer vision ML model is of the Object Detection type, where we not only classify the attributes in an image but also pinpoint out the location of the attribute or object under a segmented area, referred to as a bounding box. In this example as shown in figure in order to build a model which can detect surgical equipments in a tray like mayo scissor, forceps, etc. we build a labeled dataset that has these attributes as well the location of said attribute labeled out in the form of that bounding box. And this process would again be done by a data labeler or workforce, after which the dataset can be used to train this particular ML model.
#15 Now this one here is a Natural Language Processing example, where we extract text data(location of the word) from a given sentence and tag that text under various categories. And this type of a model is referred to as NER / Named entity recognition model. So in this example we are labeling sentences, and these sentences are customer reviews, like the ones you find under a product sold by amazon, and we are trying to extract key attributes from these sentences, which could be things like the pros / cons of that product or mentions of other products. This labeled dataset can then be used to train a NER model which can extract these key insights about products from various other product reviews. So now that we have gone through these examples of data labeling examples, let now focus on : who does the data labeling and how is it done?
#16 Now there 2 types of data labeling, based on who does it and how it is done: So the first type is what we refer to as a collaborator based type of data labeling, By Collaborator And this is the simplest labeling approach, where a human is employed to do the data labeling. So you basically assign tasks to employees within your organization who are the subject matter experts and these experts would know how to label data and what exactly needs to be labeled. Or you could hire a data labeling company which would manage all aspects of a data labeling project and is usually paid on an hourly basis. So how this works is, you provide the collaborators with your raw unlabeled data like images or text and along with it a set of instructions on what and how the raw data needs to be labeled. Second approach is using automated data labeling In this process we automate the data labeling through machine learning algorithms So using either unsupervised learning, in which we cluster various categories of unlabeled data and then assign this to the human labeler to start validating these semi-labeled dataset. Or we use active learning where the algorithm learns as you start labeling and automatically labels the next image, where in a data labeler basically validates and modifies the annotation or labels. Examples of active learning could be labeling video frames, where a human labels the first few frames of video and then the AI system learning from these previous labeled frames and suggest labels for the upcoming frames Automated data labeling can be useful, particularly in instances where there is a significant amount of unlabeled data like for video frames, or for data that would be extremely expensive or time consuming to otherwise label. So an automated labeling system speeds up the process of labeling and human labelers basically validate or correct that labeled dataset accordingly.
#17 Great, now we understand what is data labeling and how it's done. Now let's try and understand 3 aspects of data labeling that you need to consider while taking up a data labeling project to build out a quality labeled dataset, which ultimately leads to a high quality ML Model. The first aspect is to have the right team in place to carry out the data labeling projects, which would involve data managers / project lead whose responsibility is make sure the data labeling projects run smoothly which includes having the right data sources in place, and having visibility of project progress. Then the data labelers who shall be responsible for carrying out the labeling tasks based on provided instructions. And Quality Control reviewers who would be responsible for reviewing the labeled dataset for quality control and making sure that the job carried out by the data labeler is as per instructions. Finally the data scientist or Machine Learning engineer who shall validate and consume these labeled datasets and build out the AI model. Second Aspect is having a data labeling process or workflow - which involves defining the data labeling tasks, and have the right checks in place so that you catch any errors or low quality of data labeled and flag them, so as to not compromise your dataset. And the final aspect is having the right data labeling tools in place, which is a software or labeling platform which your team shall use to configure the right data labeling workflow that suits your need. And your human collaborator or data labeling partner can effectively carry out data annotation or labeling in a secured environment. There would be mechanism to carry out QC activities with right quality metrics in place and also visibility around the progress of labeling tasks. Lastly the tool should provide easy and secured access to this labeled dataset to your data scientist who shall use it to train the ML model.
#18 Alright, now let's explore how we can ensure a labeled dataset is of high quality and what are some of the best practises are to do so. We shall go through 3 key practices which you must implement while carrying a data labeling project: First is conducting mock labelling tests to qualify right data labeler / collaborator. Second is measuring data labeling inconsistency to ensure labeling is more reliable and consistent. Third is quality review of labeled data periodically, as they gets labeled, so there is a scope for improvement in the future, if you find any anomalies. So lets take a closer look at these practices
#19 Okay first of all your models will be only as good as your labeled dataset so it's important to have the right collaborator / data labelers perform your data labeling process. Also key thing to learn and understand is when we are preparing the labeled dataset, the machine learning model shall pick the nuances of these collaborators, that is, how they perceive the data based on their age, gender, demographic and knowledge about the subject. Most of the time this is where a bias is created in the labeled dataset, which ultimately may lead to a biased model. So now the question comes, how do we qualify a data labeler? So you need try and have collaborators with diverse personalities, age, demography These collaborators need to understand the subject well and should have no bias towards it Now in order to know the capability of your data labelers and how well versed they are with the particular data labeling job which you are going to assign to them, as a best practise have a mock data labeling test where a set of labeling tasks is served to all these candidates. And you can then evaluate among to qualify them for the data labeling job. Data labeling mock tests will help to build qualified collaborators. whose judgement on data labeling will help to build high quality labeled dataset.
#20 Now the next best practise is measuring data labeling consistency - in simple terms measure how consistently collaborators agree with each other, As humans we may disagree with each other’s opinion, it's no different while performing data labeling task either, There could be tasks where collaborators may not label the data as the other collaborators have, this could be due to various reasons as mentioned earlier, it could difference in age, personality, demographic or knowledge on the subject. Consider this example, 3 collaborators have assigned the same task of tagging a sentiment of a tweet. And all 3 of them have tagged it differently, which shows 3 of them don't agree with each other. Now in a data labeling job we need to measure this degree of agreement and one of the metrics to do it is, IRR or the Inter-Rater Reliability score. The IRR metrics is calculated from having collaborators label the same data, measure how many are in agreement and assign a score to that group of collaborators, the IRR score. The higher the IRR score of your collaborators, the better your data labeling job will turn out to be.
#21 Another best practice for data labeling quality is reviewing the labeled dataset. Now we understand that, data labeling is the most time consuming and resource intensive part of ML development. And these labeled dataset shall be used to build out ML models by data scientists / ML engineers. So it is important to review the data in terms how well the data is annotated and if it is infact good to build out a model. Consider this example: where a data labeling job is carried out for detecting pedestrians from video frames, and if you look at this particular video frame the bounding box is not covering the complete area of the pedestrian. Now this may lead to a poor machine learning model which may inappropriately detect pedestrians. So it is important for ML engineers / data scientists to review the labeled dataset and it is always recommended to do so while the data is getting labeled and flag those labeled data which are inappropriate, and also provide feedback to the annotator for corrective measures.
#22 Alright now we approach a very important topic, at the data security aspects of Secure data Labeling. There are 4 points to consider in a data labeling process. 1 Access control - meaning regulating and controlling who has access to what aspects of data labeling. 2. Audit Log - which is understanding who did what and when for a data labeling project. 3. Data Encryption - which is securing the data when its in rest, in motion and in use 4. Data source behind firewall - in this case the raw data might be behind a firewall to adhere to a business compliance So lets take a quick look at each one one of these points
#23 First off Access Control,So what is access control, so its a security technique that regulates who or what can view or use resources in a given environment. So understandably, it is very important to set up right access control around a data labeling process, which only allows authenticated and authorised users to have access to data. Now different types of access can be set to different user profiles in a data labeling process: So a Project Lead / Data Manager Can have Access to setup data sources or data assets that requires data labeling, and have Access to the progress of a labeling job, (manage job) Data Labelers can have read only access to those data which is assigned for data labeling. Data Labeling Job Reviewers can also have read only access to those data which is assigned for review Data scientists can have access via a secure API to the labeled dataset which they require.
#24 Audit Log Audit logs, or they can be also referred to as an audit trails, is animportant security requirement in data labeling processes. With an audit log record you can gain insight on who did what and when on data labeling activities, like accessing the data to be labeled, or download / viewing the labeled dataset, or access the data to outsource it to third party data labeling agencies. Audit trails are also important in terms of attaining an organization’s data security or industry compliance needs
#25 Now Encryption of data, Like mentioned earlier, your data needs to be secured at all points in your labeling projects, when its at rest or being moved around to other platforms, and when its being used or transformed.Encryption helps in protecting private information and sensitive data, like corporate confidential information, medical records, government classified information etc. And it enhances the security of communication between client apps and servers. so this data is secured using Transport Layer Security protocols or TLS as well as Secure Socket Layer protocols or SSL. So there are Advanced Encryption Standards used worldwide, and when youre using a data labeling platform youve to ensure that your data is secured wether its at rest or in transition
#26 And the final security initiative is a firewall system to prevent unauthorised access to or from a private computer network. So its an extra security layer between your private network devices and untrusted access from the Internet, hence securing your data from malicious attacks. So in this instance, there may be some reservations in moving the data beyond these firewalls, in which case you would want to use a platform that provides data-labeling functionality on premise, meaning where the data resides. So now we have caught up with all of the data security aspects of data labeling.
#27 Okay that was the end of the first part of the webinar, thank you so much for listening to me, and now Shruti will present to you live demonstrations of how secure data labeling projects can be managed and executed. Thank you and over to you Shruti.
#29 5 minutes intro - 10 industry awareness - 15 min demo - 20 minutes QnA Define problem - Features model - How this model is built using skyl.ai
#30 TODO
#31 Thank you Fahid and Shruti, for the wonderful presentation and demo. I’d like to mention that Skyl.ai is dedicated to helping people with their Machine Learning journey by offering consulting services. Services such as: AI Adoption Assessment, Skyl will help find key areas in your organisation where AI is beneficial. AI Systems Integration, Skyl will help find the best ways to integrate AI models with your current software systems AI Performance Evaluation, Skyl will assess your AI workflow and help find ways to improve your AI system’s performance And AI-Enabled Software Development, The team at Skyl can develop highly customized, AI-enabled software solutions catered towards your organisation’s needs. If you’d like to find out more, please check out the skyl.ai website or you can send an email directly to contact@skyl.ai.
#32 Skyl also has special offers for those of you that are curious about incorporating Machine Learning to your business. Skyl offers a free trial, plus Proof of Concept. You’ll be able to interact with real data on the screen, just like we showed in the demo. You’ll experience the process of going from collecting & labeling the data… all the way to deploying a model! Skyl also offers a complimentary 30 min consultation and an AI Implementation Playbook to go along. This is a great opportunity to see how Skyl can provide Machine Learning solutions to your challenges.
#33 Alright, now it’s Q&A time! As a reminder, if you have any questions, go to the question box in your control panel - located on the bottom of your Zoom screen. We’ll try to answer as many questions as possible in the time that we have left. So let’s answer some questions. Sample questions: Fahid Ques: How do you price your product? Ans: So we price our labeling tool on a pay as use basis, so it depends upon the size of your data that you want labeled out, but you can check out all our plans on the Skyl.ai/plans page on our website Shruti Would Labelwise also be providing the labelling workforce for data labelling or does that need to be taken care of by the users / customers? Ok, that’s all the time we have for questions today, but feel free to contact us with your specific questions and we’ll make sure to get them answered.
#34 All right, so we have reached the end of the webinar. We hope you enjoyed it. We have a lot more webinars coming up on different machine learning topics and how they can be implemented into different businesses and industries, So don’t miss out and make sure you sign up for upcoming webinars as well Thank you for joining and I hope you have a wonderful day.

How to perform Secure Data Labeling for Machine Learning

More Related Content

What's hot

Similar to How to perform Secure Data Labeling for Machine Learning

More from Skyl.ai

Recently uploaded

How to perform Secure Data Labeling for Machine Learning

Editor's Notes