Data Analytics (BCS-052)
Unit 1
Introduction to Data Analytics
Syllabus
Introduction to Data Analytics: Sources and nature of data, classification of data
(structured, semi-structured, unstructured), characteristics of data, introduction to Big Data
platform, need of data analytics, evolution of analytic scalability, analytic process and
tools, analysis vs reporting, modern data analytic tools, applications of data analytics.
Data Analytics Lifecycle: Need, key roles for successful analytic projects, various phases
of data analytics lifecycle – discovery, data preparation, model planning, model building,
communicating results, operationalization.
Data
• Data is a collection of information that can be used to answer questions
and solve business challenges.
• Data can be organized in the form of charts, tables, or graphs.
• It can be made up of facts, numbers, names, figures, or descriptions of
things.
Analytics
• Analytics is the process of using math and machine learning to find
patterns in data sets and gain insights.
• Data analytics is a broad field that includes analytics, as well as other
processes like collecting and storing data.
Data Analytics
• Data analytics is the process of analyzing raw data to find patterns, draw
conclusions, and make informed decisions.
• Data analytics is the collection, transformation, and organization of data to
draw conclusions, make predictions, and drive informed decision making.
• It's a broad field that uses tools, technologies, and processes to transform
data into actionable insights.
Sources of Data
The data are collected in the
following ways. These are:
Primary sources and
Secondary sources
Sources of Data (Contd…)
1. Primary Sources Data
• The data which is Raw, Original, and Extracted directly from the official sources
is known as primary data.
• The data which are collected for the first time by an individual or the group of
individuals, institutions or organisation are known as Primary sources of data.
• This type of data is collected directly by performing techniques such as
questionnaires, interviews, and surveys.
Sources of Data (Contd…)
1.1. Interview Method
• The data collected during this process is through interviewing the target
audience by a person called interviewer and the person who answers the
interview is known as the interviewee.
• Some basic business or product related questions are asked and noted down in
the form of notes, audio, or video and this data is stored for processing.
• These can be both structured and unstructured like personal interviews or
formal interviews through telephone, face to face, email, etc.
Sources of Data (Contd…)
1.2. Survey Method
• The survey method is the process of research where a list of relevant questions
are asked, and answers are noted down in the form of text, audio, or video.
• The survey method can be obtained in both online and offline mode like
through website forms and email.
• Then that survey answers are stored for analyzing data.
• Examples are online surveys are social media polls.
Sources of Data (Contd…)
1.3. Observation Method
• The observation method is a method of data collection in which the researcher
observes the behavior and practices of the target audience using some data
collecting tool and stores the observed data in the form of text, audio, video, or
any raw formats.
• In this method, the data is collected directly by posting a few questions on the
participants.
• For example, observing a group of customers and their behavior towards the
products.
Sources of Data (Contd…)
1.4. Experimental Method
• The experimental method is the process of collecting data through
performing experiments, research, and investigation.
• The most frequently used experiment methods are CRD (completely
randomized design), RBD (randomized block design), LSD (latin square
design), FD (factorial design).
Sources of Data (Contd…)
2. Secondary Sources of Data
• Secondary data is the data which has already been collected and reused again
for some valid purpose.
• This type of data is previously recorded from primary data, and it has two
types of sources: Internal Source and External source.
• Secondary sources of data consist of published and unpublished records
which include government publications, documents and reports.
Sources of Data (Contd…)
2.1. Internal source
• These types of data can easily be found within the organization such as
market record, a sales record, transactions, customer data, accounting
resources etc.
• The cost and time consumption is less in obtaining internal sources.
Sources of Data (Contd…)
2.2. External Source
• The data which can’t be found at internal organizations and can be gained only
through external third-party resources is external source data.
• The cost and time consumption is more because this contains a huge amount of
data.
• Examples of external sources are Government publications, news publications,
Registrar General of India, planning commission, syndicate services, and other
non-governmental publications.
Nature of Data
• The nature of data can be understood on the basis of the class to which it
belongs.
• By nature, data are either quantitative or qualitative.
1. Qualitative Data: which is a group of non-numerical data such as words;
sentences mostly focus on behavior and actions of the group.
2. Quantitative Data: which is in numerical forms and can be calculated
using different scientific tools and sampling data?
Nature of Data (Contd…)
With reference to the types of data their nature is as follows:
1. Numerical Data
2. Descriptive Data
3. Graphic and Symbolic Data
4. Enumerative Data
5. Descriptive Data
Nature of Data (Contd…)
1. Numerical Data: All data in science are derived by measurement and stated in
numerical values. Most of their nature is numerical.
2. Descriptive Data: Science is not known for descriptive data. However, qualitative
data in sciences are expressed in terms of definitive statements concerning objects.
These may be viewed as descriptive data. Here, the nature of data is descriptive.
3. Graphic and Symbolic Data: Graphic and symbolic data are modes of
presentation. They enable users to grasp data by visual perception. The nature of
data, in these cases, is graphic.
Nature of Data (Contd…)
4. Enumerative Data: Most data in social sciences are enumerative in nature.
However, they are refined with the help of statistical techniques to make them
more meaningful. They are known as statistical data.
5. Descriptive Data: All qualitative data in social sciences can be descriptive in
nature. These can be in the form of definitive statements. However, if necessary,
numerical values can be assigned to descriptive statements, which may be
reduced to numerical data.
Classification of Data
• Data classification is the process of organising data into categories that make
it easy to retrieve, sort and store for future use.
• The classification of data makes it easy for the user to retrieve it.
• Data classification is important for data security and for fulfilling different
types of business or personal objectives.
Purpose of Data Classification
• Systematic classification of data helps organisations to manipulate, track and
analyse individual pieces of data.
• Data professionals have a specific goal when categorising data.
• The goal affects the approach they take and classification levels, they use.
Why is Data Classification
• Data classification is used to categorise structured data, but it especially
important for getting the most out of unstructured data.
• Data categorisation also helps to identify duplicate copies of data.
• Eliminating redundant data contributes to efficient use of storage and
maximises data security measures.
Types of Data Classification
Three types of data classification:
• Structured Data
• Semi-structured Data
• Unstructured Data
Structured Data
• Data having a pre-defined structure which can also be categorized as
quantitative data and is well-organized defined as Structured Data.
• Because of having a pre-defined structure-property, data can be organized into
tables — columns and rows just like in spreadsheets.
• Most of the time when data is having relations and can’t store in spreadsheets
due to the large size in this case structured data stored in relational databases
tables.
Characteristics of Structured Data
• Data conforms to a data model and has an easily identifiable structure.
• Data is stored in the form of rows and columns.
• Data is well-organised so, definition, format and meaning of data is explicitly known.
• Data resides in fixed fields within a record or file.
• Similar entities are grouped together to form relations or classes.
• Entities in the same group have same attribute.
• Data elements are addressable, so efficient to analyse and process.
Sources of Structured Data
• SQL Databases
• Spreadsheets such as Excel
• OLTP System
• Online forms
• Sensors such as GPS
• Network and Web server logs
• Medical devices
Advantages of Structured Data
• Structured data has a well-defined structure that helps in easy storage and access of
data.
• Data mining is easy, i.e., knowledge can be easily extracted from data.
• Operations such as updating and deleting is easy due to well-structured from of
data.
• Business Intelligence operations such as data warehousing can be easily undertaken.
• Easily scalable in case there is an increment of data.
• Ensuring security to data is easy.
Unstructured Data
• Unstructured data is typically categorized as qualitative rather than
quantitative.
• It doesn’t have a pre-defined structure or specific format.
• Data that lies in this category are audio, video, images, and text files contents
which have different properties for making these data available for analysis and
can’t be stored in relational databases tables.
• So, these are stored in their raw format and analysis is done by applying Image
processing, Natural Language Processing, and Machine Learning.
Characteristics of Unstructured Data
• Data neither conforms to a data model nor has any structure.
• Data cannot be stored in the form of rows and columns as in databases.
• Data does not follow any semantic or rules.
• Data lacks a particular format or sequence.
• Data has no easily identifiable structure.
• Due to lack of identifiable structure, it cannot be used by computer program
easily.
Sources of Unstructured Data
• Web pages
• Images (JPEG, GIF, PNG, etc)
• Videos
• Reports
• Word documents
• Surveys
Advantages of Unstructured Data
• Its supports the data which lacks a proper format or sequence.
• The data is not constrained by a fixed schema.
• Very flexible due to the absence of schema.
• Data is portable.
• It is very scalable.
• It can deal easily with the heterogeneity of sources.
• These types of data have a variety of business intelligence and analytics
applications.
Disadvantages of Unstructured Data
• It is difficult to store and manage unstructured data due to lack of schema and
structure.
• Ensuring security to data is a difficult task.
• Indexing the data is difficult and is error due to unclear structure and not
having pre-defined attributes, due to which search results are not very
accurate.
Semi-Structured Data
• Semi-Structured data contains elements of both structured and unstructured,
its schema is not fixed as structured data and with the help of metadata
(which enables users to define some partial structure or hierarchy), it can be
organized to some extent so not unorganized as unstructured data.
• Metadata includes tags and other markers just like in JSON, XML, or CSV
which separates the elements and enforces the hierarchy, but the size of the
element varies, and order is not important.
Characteristics of Semi-structured Data
• Data does not conform to a data model but has some structure.
• Data cannot be stored in the form of rows and columns as in databases.
• Similar entities are grouped together and organised in a hierarchy.
• Entities in the same group may or may not have the same attributes or properties.
• Size and type of the same attribute in a group may differ.
• Due to lack of a well-defined structure, it cannot use by computer programs easily.
Sources of Semi-structured Data
• Emails
• XML and other markup languages
• TCP/IP packets
• Zipped files
• Integration of data from different sources
• Web pages
Advantages of Semi-structured Data
• The data is not constrained by a fixed schema.
• Flexible, i.e., Schema can be easily changed.
• Data is portable.
• It is possible to view structured data as semi-structured data.
• Its supports users who cannot express their need in SQL.
• It can deal easily with the heterogeneity of sources.
Disadvantages of Semi-structured Data
• Lack of fixed, schema makes it difficult to store the data.
• Interpreting the relationship between data is difficult as there is no separation
of the schema and the data.
• Queries are less efficient as compared to structured data.
Difference Between Structured, Semi-structured, and
Unstructured Data
Parameters Structured Data Semi-Structured Data Unstructured Data
Data Structure The information and data have a
predefined organization.
The contained data and information have
organizational properties- but are different
from predefined structured data.
There is no predefined organization for
the available data and information in the
system or database.
Technology Used Structured Data words based on
relational database tables.
Semi-Structured Data works based on
Relational Data Framework (RDF) or XML.
Unstructured data works based on binary
data and the available characters.
Flexibility The data depends a lot on the schema.
Thus, there is less flexibility.
The data is comparatively less flexible than
unstructured data but way more flexible than
the structured data.
Schema is totally absent. Thus, it is the
most flexible of all.
Management of
Transaction
It has a mature type of transaction.
Also, there are various techniques of
concurrency.
It adapts the transaction from DBMS. It is
not of mature type.
It consists of no management of
transaction or concurrency.
Management of
Version
It is possible to version over tables,
rows, and tuples.
It is possible to version over graphs or
tuples.
It is possible to version the data as a
whole.
Scalability Scaling a database schema is very
difficult. Thus, a structured database
offers lower scalability.
Scaling a Semi-Structured type of data is
comparatively much more feasible.
An unstructured data type is the most
scalable in nature.
Performance of
Query
A structured type of query makes
complex joining possible.
Semi-structured queries over various nodes
(anonymous) are most definitely possible.
Unstructured data only allows textual
types of queries.
Characteristics of Data
Several characteristics in the data such as:
1. Accuracy : The data must conform to actual, real-world scenarios and reflect
real-world objects and events. Analysts should use verifiable sources to confirm
the measure of accuracy, determined by how close the values with the verified
correct information sources.
2. Completeness : Completeness measures the data's ability to deliver all the
mandatory values that are available successfully.
Characteristics of Data(Contd…)
3. Consistency: Data consistency describes the data’s uniformity as it moves
across applications and networks and when it comes from multiple sources.
Consistency also means that the same datasets stored in different locations
should be the same and not conflict. Note that consistent data can still be wrong.
4. Timeliness: Timely data is information that is readily available whenever it’s
needed. This dimension also covers keeping the data current; data should
undergo real-time updates to ensure that it is always available and accessible.
Characteristics of Data(Contd…)
5. Uniqueness: Uniqueness means that no duplications or redundant information are
overlapping across all the datasets. No record in the dataset exists multiple times.
Analysts use data cleansing and deduplication to help address a low uniqueness
score.
6. Validity: Data must be collected according to the organization’s defined business
rules and parameters. The information should also conform to the correct, accepted
formats, and all dataset values should fall within the proper range.
Introduction to Big Data Platform
Big Data
• Big Data is a collection of large datasets that cannot be processed using traditional
computing techniques.
• It is not a single technique or a tool, rather it has become a complete subject, which
involves various tools, techniques and frameworks.
• Big Data is the technical term used in reference to the vast quantity of heterogeneous
datasets.
• Examples of big data includes Cell phone details, Social media content, Health
records, Transactional data, Web searches, Financial documents, Weather information.
Introduction to Big Data Platform (Contd…)
• Data which are very large in size are called Big Data.
• Normally, we work on data of size MB (Word Doc, Excel) or maximum GB
(Movies, Codes) but data which are Petabytes, is called Big Data.
• The size of large data can range from several terabytes (1 trillion bytes) to
petabytes and even Exabytes.
• It is the concept of gathering useful insights from such voluminous amounts of
structured, semi-structured and unstructured data that can be used for effective
decision-making in the business environment.
Sources of Big Data
These data come from many sources like:
• Social networking sites: Facebook, Google, LinkedIn all these sites generate
huge amount of data on a day-to-day basis as they have billions of users
worldwide.
• E-commerce site: Sites like Amazon, Flipkart generates huge amount of logs
from which users buying trends can be traced.
• Weather Station: All the weather station and satellite gives very huge data
which are stored and manipulated to forecast weather.
Sources of Big Data (Contd…)
• Telecom company: Telecom giants like Airtel, Vodafone study the user
trends and accordingly publish their plans and for this they store the data of
its million users.
• Share Market: Stock exchange across the world generates huge amount of
data through its daily transaction.
Applications of Big Data
• Banking and Securities
• Communications, Media and Entertainment
• Healthcare Providers
• Education
• Manufacturing and Natural Resources
• Government Insurance
• Retail and Wholesale trade
• Transportation
• Energy and Utilities
Uses of Big Data
• Location Tracking
• Fraud Detection and Handling
• Advertising
• Entertainment and Media
Real World Big Data Examples
• Discovering consumer shopping habits
• Personalised marketing
• Fuel optimisation tools for the transportation industry
• Monitoring health conditions through data from wearables
• Live road mapping for autonomous vehicles
• Streamlined media streaming
• Predictive inventory ordering
Issues with Big Data
There are three issues with Big Data, and they are as follows:
• Low Quality and Inaccurate Data: Low-quality data or inaccurate data
quality may lead to inaccurate results or predictions which does nothing apart
from wasting time and effort of the individuals.
• Processing Large Data Sets: Due to a large amount of data, no traditional
data management tool or software can directly process because the size of
these large datasets are usually in Terabytes which is difficult to process.
Issues with Big Data (Contd…)
• Integrating Data from a Variety of Sources: Data comes from various
types of sources like social media, different websites, captured images/
videos, customer logs, reports created by individuals, newspapers, e-mails,
etc. Collecting and integrating various data which are of different types is
basically a challenging task.
Big Data Characteristics
1. Volume (Huge Amount of Data)
2. Veracity (Inconsistencies & uncertainty in data)
3. Variety (Different formats of data from various sources)
4. Value (Extract Useful Data)
5. Velocity (High speed of accumulation of data)
Big Data Characteristics (Contd…)
1. Volume
• Big Data is a vast 'volumes' of data generated from many sources daily, such
as business processes, social media platforms, networks, human interactions,
and many more.
• Facebook can generate approximately a billion messages, 4.5 billion times
that the "Like" button is recorded, and more than 350 million new posts are
uploaded each day. Big data technologies can handle large amounts of data.
Big Data Characteristics (Contd…)
2. Variety
• Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources. Data will only be collected from databases
and sheets in the past, But these days the data will comes in array forms, that
are PDFs, Emails, audios, SM posts, photos, videos, etc.
Big Data Characteristics (Contd…)
3. Veracity
• Veracity means how much the data is reliable. It has many ways to filter or
translate the data. Veracity is the process of being able to handle and manage
data efficiently.
4. Value
• Value is an essential characteristic of big data. It is not the data that we
process or store. It is valuable and reliable data that we store, process, and
analyze.
Big Data Characteristics (Contd…)
5. Velocity
• Velocity plays an important role compared to others. Velocity creates the speed
by which the data is created in real-time.
• Big data velocity deals with the speed at the data flows from sources like
application logs, business processes, networks, and social media sites, sensors,
mobile devices, etc.
What is a big data platform?
• A big data platform is an integrated computing solution that combines
numerous software systems, tools, and hardware for big data management.
• Big data Platform workflow is divided into the following stages:
1. Data Collection
2. Data Storage
3. Data Processing
4. Data Analytics
What is a big data platform? (Contd…)
5. Data Management and Warehousing
6. Data Catalog and Metadata Management
7. Data Observability
8. Data Intelligence
Characteristics of a Big Data Platform
1. Ability to accommodate new applications and tools depending on the evolving business
needs.
2. Support several data formats.
3. Ability to accommodate large volumes of streaming data.
4. Have a wide variety of conversion tools to transform data to different preferred formats.
5. Capacity to accommodate data at any speed.
7. The ability for quick deployment.
8. Have the tools for data analysis and reporting requirements.
Different Types of Big Data Platforms and Tools
1. Hadoop Delta Lake Migration Platform
2. Data Catalog and Data Observability Platform
3. Data Ingestion and Integration Platform
4. Big Data and IoT Analytics Platform
5. Data Discovery and Management Platform
6. Cloud ETL Data Transformation Platform
Need/Importance of Big Data
• Reduction in cost
• Time reduction
• New product development in optimized offer
• Well groomed decision making
Challenges of Big Data
• Rapid data growth: The growth velocity at such a high rate creates a problem
to look for insights using it. There no 100% efficient way to filter out relevant
data.
• Storage: The generation of such a massive amount of data needs space for
storage, and organizations face challenges to handle such extensive data
without suitable tools and technologies.
Challenges of Big Data (Contd…)
• Unreliable data: It cannot be guaranteed that the big data collected and
analyzed are totally (100%) accurate.
• Data security: Firms and organizations storing such massive data (of users)
can be a target of cybercriminals, and there is a risk of data getting stolen.
Hence, encrypting such data is also a challenge for firms and organizations.
Data analytics
• Data analytics is the process of examining datasets to find trends and draw
conclusions about the information they contain.
• Data analytics technologies and techniques are widely used in commercial
industries to enable organisations to make more-informed business decisions.
• Scientists and researchers also use analytics tools to verify or disprove
scientific models, theories and hypotheses.
Need of Data Analytics
Data analytics is important for many reasons, including:
• Informed decision-making: Data analytics helps businesses make better
decisions by providing a holistic view of their performance and identifying
opportunities for improvement.
• Improved customer experience: Data analytics can help businesses understand
their customers' preferences and needs, which can lead to personalized
experiences and better customer outcomes
• Fraud detection and security: Data analytics can help businesses identify
suspicious activity and minimize risk.
• Healthcare: Data analytics can help healthcare professionals make evidence-
based decisions about patient care, disease diagnosis, and treatment optimization.
Analytic Scalability
• Analytic scalability is the ability to use data to understand and solve a large
variety of problems. And because problems come in many forms, analytics
must be flexible enough to address problems in different ways. This might
include the use of statistical tools and forecasting.
Evolution of Analytic Scalability
• The amount of data, organizations process continues to increase.
• For instance, it is estimated that Walmart collects more than 2.5 petabytes of
data every hour from its customer transactions.
• A petabyte is the equivalent of about 20 million filing cabinets ‘worth of text.
• An exabyte is 1,000 times that amount, or one billion gigabytes.
Data Analytic Process
• The collection, transformation, and organization of data to draw
conclusions make predictions for the future and make informed data-
driven decisions is called Data Analysis.
• The profession that handles data analysis is called a Data Analytic.
Data Analytic Process (Contd…)
Six steps of data analytic process:
1. Define the Problem or Research Question
2. Collect Data
3. Data Cleaning
4. Analyzing the Data
5. Data Visualization
6. Presenting Data
Data Analytic Process (Contd…)
1. Define the Problem or Research Question
• Data analyst is given a problem/business task.
• The analyst has to understand the task and the stakeholder’s expectations for the solution.
• A stakeholder is a person that has invested their money and resources to a project.
• Questions to ask yourself for the Ask phase are: 1. What are the problems that are being
mentioned by my stakeholders? (The analyst must find the root cause of the problem to fully
understand the problem) 2. What are their expectations for the solutions? (The analyst must
be able to ask different questions to find the right solution to their problem.)
Data Analytic Process (Contd…)
2. Collect Data
• The data has to be collected from various sources like Internal or External Sources.
• Internal data is the data available in the organization that you work for while external
data is the data available in sources other than your organization.
• The data that is collected by an individual from their own resources is called First-
Party Data.
• The data that is collected and sold is called Second-Party Data.
• Data that is collected from outside sources is called Third-Party Data.
• The common sources from where the data is collected are Interviews, Surveys and
Questionnaires.
• The collected data can be stored in a spreadsheet or SQL database.
• The best tools to store the data are MS Excel or Google Sheets in the case of
Spreadsheets and there are so many databases like Oracle, Microsoft to store the data
Data Analytic Process (Contd…)
3. Data Cleaning (Clean and Process Data)
• Clean data means data that is free from misspellings and redundancies.
• There are different functions provided by SQL and Excel to clean the data and formatted
data helps in finding trends and solutions.
• The most important part of the Process phase is to check whether your data is biased or not.
• Bias is an act of favoring a particular group/community while ignoring the rest. Biasing is a
big no-no as it might affect the overall data analysis.
• The data analyst must make sure to include every group while the data is being collected.
Data Analytic Process (Contd…)
4. Analyzing the Data
• The cleaned data is used for analyzing and identifying trends.
• It also performs calculations and combines data for better results.
• The tools used for performing calculations are Excel or SQL.
• Using Excel, we can create pivot tables and perform calculations while SQL
creates temporary tables to perform calculations.
• Programming languages are another way of solving problems for data analysis
is R and Python.
Data Analytic Process (Contd…)
5. Data Visualization
• The data now transformed has to be made into a visual (chart, graph).
• The reason for making data visualizations is that there might be people, mostly
stakeholders that are non-technical.
• Visualizations are made for a simple understanding of complex data.
• Tableau and Looker are the two popular tools used for data visualizations.
• Tableau is a simple drag and drop tool that helps in creating visualizations.
• Looker is a data viz tool that directly connects to the database and creates visualizations.
Data Analytic Process (Contd…)
6. Presenting the Data
• Presenting the data involves transforming raw information into a format that
is easily comprehensible and meaningful for various stakeholders.
• This process encompasses the creation of visual representations, such as
charts, graphs, and tables, to effectively communicate patterns, trends, and
insights gleaned from the data analysis.
• The goal is to facilitate a clear understanding of complex information, making
it accessible to both technical and non-technical audiences.
Types/Levels/Methods of Data Analytics
1. Descriptive Data Analysis
• Descriptive analytics looks at the past performance and understands the performance by
mining historical data to understand the cause of success or failure in the past.
• Almost all management reporting such as sales, marketing, operations, and finance uses this
type of analysis.
• Descriptive analytics is used when the organisation has a large dataset on past events or
historical events.
“What happened?”
Or
“What was the trend?”
Types/Levels/Methods of Data Analytics (Contd…)
2. Diagnostic Analytics
It is the process of using the data to understand the underlying reasons behind
past events, trends and outcomes to answer.
“Why did this happen?”
3. Predictive Analytics
It is the process of applying statistical and machine learning techniques on
historical data to make prediction to answer.
“What might happen in the future?”
Types/Levels/Methods of Data Analytics (Contd…)
4. Prescriptive Analytics
It is the process of using the data to recommend actions in response to a given
forecast to optimized desired outcomes.
“What should we do?”
Data Analytic Tools
Data analytic have many types of tools:
1. Tableau Public
2. OpenRefine
3. KNIME
4. RapidMiner
5. Google Fusion Tables
6. NodeXL
7. Wolfram Alpha
8. Google Search Operators
9. Solver
10. Dataiku DSS
Data Analytic Tools (Contd…)
1. Tableau Public
• Tableau, one of the top 10 Data Analytics tools, is a simple tool which offers
data visualization.
• With Tableau’s visuals, you can investigate a hypothesis, explore the data, and
cross-check your insights.
Data Analytic Tools (Contd…)
• Uses of Tableau Public
1. You can publish interactive data visualizations to the web for free.
2. No programming skills required.
3. Visualizations published to Tableau Public can be embedded into blogs and
web pages and be shared through email or social media. The shared content can
be made availables for downloads.
• Limitations of Tableau Public
1. Data size limitation
Data Analytic Tools (Contd…)
2. OpenRefine
Formerly known as GoogleRefine, the data cleaning software that helps you
clean up data for analysis. It operates on a row of data which have cells under
columns, quite like relational database tables.
• Uses of OpenRefine
1. Cleaning messy data
2. Transformation of data
Data Analytic Tools (Contd…)
Limitations of OpenRefine
• Refine does not work very well with big data.
3. KNIME
KNIME, ranked among the top Data Analytics tools helps you to manipulate,
analyze, and model data through visual programming. It is used to integrate
various components for data mining and machine learning via its modular data
pipelining concept.
Data Analytic Tools (Contd…)
• Uses of KNIME
1. Rather than writing blocks of code, you just have to drop and drag connection
points between activities.
2. This data analysis tool supports programming languages.
3. In fact, analysis tools like these can be extended to run text mining, python,
and R.
• Limitation of KNIME
1. Poor data visualization
Data Analytic Tools (Contd…)
4. RapidMiner
• RapidMiner provides machine learning procedures and data mining including
data visualization, processing, statistical modeling, deployment, evaluation,
and predictive analytics.
• RapidMiner, counted among the top 10 Data Analytics tools, is written in the
Java and fast gaining acceptance.
Uses of RapidMiner
• It provides an integrated environment for business analytics, predictive
analysis, text mining, data mining, and machine learning.
• Along with commercial and business applications, RapidMiner is also used
for application development, training, education, and research.
Data Analytic Tools (Contd…)
• Limitations of RapidMiner
1. RapidMiner has size constraints with respect to the number of rows.
2. For RapidMiner, you need more hardware resources.
5. Google Fusion Tables
An incredible tool for data analysis, mapping, and large dataset visualization,
Google Fusion Tables can be added to business analytics tools list. Ranked
among the top 10 Data Analytics tools, Google Fusion Tables is fast gaining
popularity.
Data Analytic Tools (Contd…)
• Uses of Google Fusion Tables
1. Visualize bigger table data online.
2. Filter and summarize across hundreds of thousands of rows.
• Limitations of Google Fusion Tables
1. Only the first 100,000 rows of data in a table are included in query results or
mapped.
2. The total size of the data sent in one API call cannot be more than 1MB.
Data Analytic Tools (Contd…)
6. NodeXL
NodeXL is a free and open-source network analysis and visualization software.
Ranked among the top 10 Data Analytics tools, it is one of the best statistical
tools for data analysis which includes advanced network metrics, access to
social media network data importers, and automation.
• Uses of NodeXL
This is one of the best data analysis tools in Excel that helps in: 1. Data Import
2. Graph Visualization 3. Graph Analysis 4. Data Representation
Data Analytic Tools (Contd…)
Limitations of NodeXL
1. Multiple seeding terms are required for a particular problem.
2. Need to run the data extractions at slightly different times.
7. Wolfram Alpha
Wolfram Alpha, one of the top 10 Data Analytics tools is a computational
knowledge engine founded by Stephen Wolfram. With Wolfram Alpha, you get
answers to factual queries directly by computing the answer from externally
sourced instead of providing a list of documents or web pages.
Data Analytic Tools (Contd…)
• Uses of Wolfram Alpha
1. Provides detailed responses to technical searches and solves calculus
problems.
2. Helps business users with information charts, graphs and helps in creating
topic overviews and high-level pricing history.
• Limitations of Wolfram Alpha
1. Wolfram Alpha can only deal with the publicly known number and facts, not
with viewpoints.
2. It limits the computation time for each query.
Data Analytic Tools (Contd…)
8. Google Search Operators
It is a powerful resource that helps you filter Google results instantly to get the
most relevant and useful information.
• Uses of Google Search Operators
1. Fast filtering of Google results.
2. Google is powerful data analysis tool can help discover new information or
market research.
Data Analytic Tools (Contd…)
9. Solver
The Solver Add-in is a Microsoft Office Excel add-in program that is available
when you install Microsoft Excel or Office. Ranked among the best-known
Data Analytic tools is a linear programming and optimization tool in excel. It
is an advanced optimization tool that helps in quick problem solving.
Data Analytic Tools (Contd…)
• Uses of Solver
It uses a variety of methods, from nonlinear optimization and linear
programming and genetic algorithms, to find solutions.
• Limitations of Solver
1. Poor scaling is one of the areas where Excel Solver lacks.
2. It can affect solution time and quality.
Data Analytic Tools (Contd…)
10. Dataiku DSS
Ranked among the top 10 Data Analytic tools, Dataiku is a collaborative data
science software platform that helps the team build, prototype, explore, and
deliver their own data products more efficiently.
• Uses of Dataiku DSS
• It provides an interactive visual interface.
• This data analytics tool lets you draft data preparation and modulization in
seconds.
Data Analytic Tools (Contd…)
• Limitation of Dataiku DSS
1. Limited visualization capabilities
2. UI hurdles: Reloading of code/datasets
3. Inability to easily compile entire code into a single document/notebook
Analysis vs Reporting
• Analysis involves data interpreting where reporting involving presenting
factual, accurate data.
• Analysis answers why something is happening based on the data, whereas
reporting tells what’s happening.
• Analysis delivers recommendations, but reporting is more about organizing
and summarizing data.
Analysis vs Reporting (Contd…)
Analytics Reporting
Analytics is the method of examining and
analyzing summarized data to make business
decisions.
Reporting is an action that includes all the
needed information and data and is put together
in an organized way
Questioning the data, understanding it,
investigating it, and presenting it to the end users
are all part of analytics.
Identifying business events, gathering the
required information, organizing, summarizing,
and presenting existing data are all part of
reporting.
The purpose of analytics is to draw Conclusions
based on data.
The purpose of reporting is to organize the data
into meaningful information.
Analytics is used by data analysts, scientists, and
business people to make effective decisions.
Reporting is provided to the appropriate business
leaders to perform effectively and efficiently
within a firm.
Modern Data Analytic Tools
1. Apache Hadoop
2. KNIME
3. Open Refine
4. Orange
5. Splunk
6. Talend
7. Power BI
8. Tableau
9. RapidMiner
10. R-programming
11. Data wrapper
Modern Data Analytic Tools (Contd…)
1. Apache Hadoop
• Apache Hadoop, a Big Data analytics tool which is a Java based free software
framework.
• It helps in effective storage of huge amount of data in a storage place known a
cluster.
• There is a storage system in Hadoop popularly known as the Hadoop Distributed
File System (HDFS), which helps to splits the large volume of data and distribute
across many nodes present in a cluster.
Modern Data Analytic Tools (Contd…)
2. KNIME
• KNIME Analytics Platform is the open-source software for creating data
science.
• KNIME analytics platform is one of the leading open solutions for data-
driven innovation.
• This tool helps in discovering the potential and hidden in a huge volume of
data., it also performs mine for fresh insights or predicts the new futures.
Modern Data Analytic Tools (Contd…)
3. Open Refine
• Open Refine tool is one of the efficient tools to work on the messy and large
volume of data.
• It includes cleansing data, transforming that data from one format another.
• It helps to explore large datasets easily.
Modern Data Analytic Tools (Contd…)
4. Orange
• Orange is famous data visualisation and helps in data analysis for beginner and
as well to the expert.
• It provides a clean, open-source platform.
5. Splunk
• It is a platform used to search, analyse, and visualise the machine-generated data
gathered from applications, websites, etc.
• Splunk has evolved products in various fields such as IT, Security, Analytics.
Modern Data Analytic Tools (Contd…)
6. Talend
• It is one of the most powerful data integration tools available in the market and
is developed in the Eclipse graphical development environment.
• This tool lets you easily manage all the steps involved in the process and aims
to deliver compliant, accessible and clean data for everyone.
Modern Data Analytic Tools (Contd…)
7. Power BI
• It is a Microsoft product used for business analytics.
• It provides interactive visualizations with self-service business intelligence
capabilities, where end users can create dashboards and reports by themselves,
without having to depend on anybody.
Modern Data Analytic Tools (Contd…)
8. Tableau
• It is a market-leading Business Intelligence tool used to analyse and visualise
data in an easy format.
• Tableau allows you to work on live data-set and spend more time on Data
Analysis.
Modern Data Analytic Tools (Contd…)
9. RapidMiner
• RapidMiner tool operates using visual programming and also it is much
capable of manipulating, analysing and modeling the data.
• RapidMiner tools make data science teams easier and productive by using an
open-source platform for all their jobs like machine learning, data preparation,
and model deployment.
Modern Data Analytic Tools (Contd…)
10. R-programming
• R is a free open-source software programming language and a software
environment for statistical computing and graphics.
• It is used by data miners for developing statistical software and data analysis.
• It has become a highly popular tool for Big Data in recent years.
Modern Data Analytic Tools (Contd…)
11. Data wrapper
• It is an online data visualisation tool for making interactive charts.
• It uses data file in a csv, pdf or excel format.
• Data wrapper generate visualisation in the form of bar, line, map etc.
Applications of Data Analytics
Data analytics finds applications across various industries and sectors,
transforming the way organizations operate and make decisions. Here are some
examples of how data analytics is applied in different domains:
1. Healthcare
2. Finance
3. E-commerce
4. Cyber security
Applications of Data Analytics (Contd…)
5. Supply Chain Management
6. Banking
7. Logistics
8. Retail
9. Manufacturing
10. Internet Searching
11. Risk Management
Applications of Data Analytics (Contd…)
1. Healthcare
• Data analytics is the healthcare industry by enabling better patient care,
disease prevention, and resource optimization. For example, hospitals can
analyze patient data to identify high-risk individuals and provide personalized
treatment plans.
• Data analytics can also help detect disease outbreaks, monitor the effectiveness
of treatments, and improve healthcare operations.
Applications of Data Analytics (Contd…)
2. Finance
• In the financial sector, data analytics plays a crucial role in fraud detection, risk
assessment, and investment strategies.
• Banks and financial institutions analyze large volumes of data to identify
suspicious transactions and optimize investment portfolios.
• Data analytics also enables personalized financial advice and the development
of creative financial products and services.
Applications of Data Analytics (Contd…)
3. E-commerce
• E-commerce platforms utilize data analytics to understand customer behavior,
personalize shopping experiences, and optimize marketing campaigns.
• By analyzing customer preferences, purchase history, and browsing patterns, e-
commerce companies can offer personalized product recommendations, target
specific customer segments, and improve customer satisfaction and retention.
Applications of Data Analytics (Contd…)
4. Cyber security
• Data analytics plays a vital role in cyber security by detecting and preventing
cyber threats and attacks.
• Security systems analyze network traffic, user behavior, and system logs to
identify anomalies and potential security breaches.
Applications of Data Analytics (Contd…)
5. Supply Chain Management
• Data analytics improves supply chain management by optimizing inventory
levels, reducing costs, and enhancing overall operational efficiency.
• Organizations can identify bottlenecks, forecast demand, and improve
logistics and distribution processes by analyzing supply chain data.
• Data analytics also enables better supplier management and enhances
transparency throughout the supply chain.
Applications of Data Analytics (Contd…)
6. Banking
• Banks use data analytics to gain insights into customer behavior, manage risks,
and personalize financial services.
• Banks can tailor their offerings, identify potential fraud, and assess credit
worthiness by analyzing transaction data, customer demographics, and credit
histories.
• Data analytics also helps banks to improve regulatory compliance.
Applications of Data Analytics (Contd…)
7. Logistics
• In the logistics industry, data analytics plays a crucial role in optimizing transportation
routes, managing fleet operations, and improving overall supply chain efficiency.
• Logistics companies can minimize costs, reduce delivery times, and enhance customer
satisfaction by analyzing data on routes, delivery times, and vehicle performance.
• Data analytics also enables better demand forecasting and inventory management.
Applications of Data Analytics (Contd…)
8. Retail
• Data analytics transforms the retail industry by providing insights into customer
preferences, optimizing pricing strategies, and improving inventory management.
• Retailers analyze sales data, customer feedback, and market trends to identify
popular products, personalize offers, and forecast demand.
• Data analytics also helps retailers enhance their marketing efforts, improve
customer loyalty, and optimize store layouts.
Applications of Data Analytics (Contd…)
9. Manufacturing
• Data analytics is the manufacturing sector by enabling predictive maintenance,
optimizing production processes, and improving product quality.
• Manufacturers can predict equipment failures, minimize downtime, and ensure
efficient operations by analyzing sensor data, machine performance, and
historical maintenance records.
• Data analytics also enables real-time monitoring of production lines, leading to
higher productivity and cost savings.
Applications of Data Analytics (Contd…)
10. Internet Searching
• Data analytics powers internet search engines, enabling users to find relevant information
quickly and accurately.
• Search engines analyze vast amounts of data, including web pages, user queries, and click-
through rates, to deliver the most relevant search results.
• Data analytics algorithms continuously learn and adapt to user behavior, providing accurate
and personalized search results.
Applications of Data Analytics (Contd…)
11. Risk Management
• Data analytics plays a crucial role in risk management across various industries,
including insurance, finance, and project management.
• Organizations can assess risks, develop strategies, and make informed decisions
by analyzing historical data, market trends, and external factors.
• Data analytics helps organizations identify potential risks and quantify their
impact.
Key Role of Data Analytics Project
There are certain key roles that are required for the complete and fulfilled functioning of the data
science team to execute projects on analytics successfully. The key roles are:
1. Business User
2. Project Sponsor
3. Project Manager
4. Business Intelligence Analyst
5. Database Administrator
6. Data Engineer
7. Data Scientist
Key Role of Data Analytics Project (Contd…)
1. Business User
• The business user is the one who understands the main area of the project and
is also basically benefited from the results.
• This user gives advice and consults the team working on the project about the
value of the results obtained and how the operations on the outputs are done.
• The business manager, line manager, or deep subject matter expert in the
project mains fulfills this role.
Key Role of Data Analytics Project (Contd…)
2. Project Sponsor
• The Project Sponsor is the one who is responsible to initiate the project.
Project Sponsor provides the actual requirements for the project and presents
the basic business issue.
• He generally provides the funds and measures the degree of value from the
final output of the team working on the project.
• This person introduces the prime concern and brooms the desired output.
Key Role of Data Analytics Project (Contd…)
3. Project Manager
This person ensures that key milestone and purpose of the project is met on time
and of the expected quality.
4. Business Intelligence Analyst
• Business Intelligence Analyst provides business domain perfection based on a
detailed and deep understanding of the data, key performance indicators (KPIs),
key matrix, and business intelligence from a reporting point of view.
• This person generally creates reports and knows about the data feeds and sources.
Key Role of Data Analytics Project (Contd…)
5. Database Administrator (DBA)
• DBA facilitates and arranges the database environment to support the
analytics need of the team working on a project.
• His responsibilities may include providing permission to key databases or
tables and for making sure that the appropriate security stages are in their
correct places related to the data repositories or not.
Key Role of Data Analytics Project (Contd…)
6. Data Engineer
• Data engineer grasps deep technical skills to assist with SQL queries for data
management and data extraction and provides support for data intake into the
analytic sandbox.
• The data engineer works jointly with the data scientist to help build data in
correct ways for analysis.
Key Role of Data Analytics Project (Contd…)
7. Data Scientist
• Data scientist facilitates with the subject matter expertise for analytical
techniques, data modelling and applying correct analytical techniques for a
given business issues.
• He ensures overall analytical objectives are met.
• Data scientists outline and apply analytical methods and proceed towards the
data available to the project.
Data Analytics Lifecycle
In today’s digital-first world, data is importance. It undergoes various stages
throughout its life, during its creation, testing, processing, consumption, and
reuse. Data Analytics Lifecycle maps out these stages for professionals working
on data analytics projects. Primarily it has 6 stages.
• Phase 1: Data Discovery
• Phase 2: Data Preparation
• Phase 3: Model Planning
• Phase 4: Model Building
• Phase 5: Communication and Publication of Results
• Phase 6: Operationalize/Measuring of Effectiveness
Data Analytics Lifecycle (Contd…)
Data Analytics Lifecycle (Contd…)
Phase 1: Data Discovery
• The data science team is learns and investigates the problem.
• Create context and gain understanding.
• Learn about the data sources that are needed and accessible to the project.
• The team produces an initial hypothesis, which can be later confirmed with
evidence.
Data Analytics Lifecycle (Contd…)
Phase 2: Data Preparation
• Methods to investigate the possibilities of pre-processing, analysing, and preparing data
before analysis and modelling.
• It is required to have an analytic sandbox. The team performs, loads, and transforms to
bring information to the data sandbox.
• Data preparation tasks can be repeated and not in a predetermined sequence.
• Some of the tools used commonly for this process include - Hadoop, Open Refine, etc.
Data Analytics Lifecycle (Contd…)
Phase 3: Model Planning
• The team studies data to discover the connections between variables. Later, it
selects the most significant variables as well as the most effective models.
• In this phase, the data science teams create data sets that can be used for training
for testing, production, and training goals.
• The team builds and implements models based on the work completed in the
modelling planning phase.
• Some of the tools used commonly for this stage are MATLAB and STASTICA.
Data Analytics Lifecycle (Contd…)
Phase 4: Model Building
• The team creates datasets for training, testing as well as production use.
• The team is also evaluating whether its current tools are sufficient to run the
models or if they require an even more robust environment to run models.
• Commercial tools - MATLAB, STASTICA.
Data Analytics Lifecycle (Contd…)
Phase 5: Communication Results
• After executing the model, team members will need to evaluate the outcomes of the
model to establish criteria for the success or failure of the model.
• The team is considering how best to present findings and outcomes to the various
members of the team and other stakeholders while taking into account warning and
assumptions.
• The team should determine the most important findings, quantify their value to the
business and create a narrative to present findings and summarize them to all
stakeholders.
Data Analytics Lifecycle (Contd…)
Phase 6: Operationalize
• The team distributes the benefits of the project to a wider audience. It sets up a pilot
project that will deploy the work in a controlled manner prior to expanding the project
to the entire enterprise of users.
• This technique allows the team to gain insight into the performance and constraints
related to the model within a production setting at a small scale and then make
necessary adjustments before full deployment.
• The team produces the last reports, presentations, and codes.
• Open source or free tools such as WEKA, SQL and MADlib.
Need of Data Analytics Life Cycle
• The Data Analytics Lifecycle outlines how data is created, gathered, processed,
used, and analyzed to meet corporate objectives.
• It provides a structured method of handling data so that it may be transformed
into knowledge that can be applied to achieve organizational and project
objectives.
• The process offers the guidance and techniques needed to extract information
from the data and move forward to achieve corporate objectives.
Need of Data Analytics Life Cycle (Contd…)
• Data analysts use the circular nature of the lifecycle to go ahead or
backward with data analytics.
• They can choose whether to continue with their current research and
conduct a fresh analysis considering the recently acquired insights. Their
progress is guided by the Data Analytics lifecycle.

Data analytics unit 1 aktu updated syllabus new

  • 1.
    Data Analytics (BCS-052) Unit1 Introduction to Data Analytics
  • 2.
    Syllabus Introduction to DataAnalytics: Sources and nature of data, classification of data (structured, semi-structured, unstructured), characteristics of data, introduction to Big Data platform, need of data analytics, evolution of analytic scalability, analytic process and tools, analysis vs reporting, modern data analytic tools, applications of data analytics. Data Analytics Lifecycle: Need, key roles for successful analytic projects, various phases of data analytics lifecycle – discovery, data preparation, model planning, model building, communicating results, operationalization.
  • 3.
    Data • Data isa collection of information that can be used to answer questions and solve business challenges. • Data can be organized in the form of charts, tables, or graphs. • It can be made up of facts, numbers, names, figures, or descriptions of things.
  • 4.
    Analytics • Analytics isthe process of using math and machine learning to find patterns in data sets and gain insights. • Data analytics is a broad field that includes analytics, as well as other processes like collecting and storing data.
  • 5.
    Data Analytics • Dataanalytics is the process of analyzing raw data to find patterns, draw conclusions, and make informed decisions. • Data analytics is the collection, transformation, and organization of data to draw conclusions, make predictions, and drive informed decision making. • It's a broad field that uses tools, technologies, and processes to transform data into actionable insights.
  • 6.
    Sources of Data Thedata are collected in the following ways. These are: Primary sources and Secondary sources
  • 7.
    Sources of Data(Contd…) 1. Primary Sources Data • The data which is Raw, Original, and Extracted directly from the official sources is known as primary data. • The data which are collected for the first time by an individual or the group of individuals, institutions or organisation are known as Primary sources of data. • This type of data is collected directly by performing techniques such as questionnaires, interviews, and surveys.
  • 8.
    Sources of Data(Contd…) 1.1. Interview Method • The data collected during this process is through interviewing the target audience by a person called interviewer and the person who answers the interview is known as the interviewee. • Some basic business or product related questions are asked and noted down in the form of notes, audio, or video and this data is stored for processing. • These can be both structured and unstructured like personal interviews or formal interviews through telephone, face to face, email, etc.
  • 9.
    Sources of Data(Contd…) 1.2. Survey Method • The survey method is the process of research where a list of relevant questions are asked, and answers are noted down in the form of text, audio, or video. • The survey method can be obtained in both online and offline mode like through website forms and email. • Then that survey answers are stored for analyzing data. • Examples are online surveys are social media polls.
  • 10.
    Sources of Data(Contd…) 1.3. Observation Method • The observation method is a method of data collection in which the researcher observes the behavior and practices of the target audience using some data collecting tool and stores the observed data in the form of text, audio, video, or any raw formats. • In this method, the data is collected directly by posting a few questions on the participants. • For example, observing a group of customers and their behavior towards the products.
  • 11.
    Sources of Data(Contd…) 1.4. Experimental Method • The experimental method is the process of collecting data through performing experiments, research, and investigation. • The most frequently used experiment methods are CRD (completely randomized design), RBD (randomized block design), LSD (latin square design), FD (factorial design).
  • 12.
    Sources of Data(Contd…) 2. Secondary Sources of Data • Secondary data is the data which has already been collected and reused again for some valid purpose. • This type of data is previously recorded from primary data, and it has two types of sources: Internal Source and External source. • Secondary sources of data consist of published and unpublished records which include government publications, documents and reports.
  • 13.
    Sources of Data(Contd…) 2.1. Internal source • These types of data can easily be found within the organization such as market record, a sales record, transactions, customer data, accounting resources etc. • The cost and time consumption is less in obtaining internal sources.
  • 14.
    Sources of Data(Contd…) 2.2. External Source • The data which can’t be found at internal organizations and can be gained only through external third-party resources is external source data. • The cost and time consumption is more because this contains a huge amount of data. • Examples of external sources are Government publications, news publications, Registrar General of India, planning commission, syndicate services, and other non-governmental publications.
  • 15.
    Nature of Data •The nature of data can be understood on the basis of the class to which it belongs. • By nature, data are either quantitative or qualitative. 1. Qualitative Data: which is a group of non-numerical data such as words; sentences mostly focus on behavior and actions of the group. 2. Quantitative Data: which is in numerical forms and can be calculated using different scientific tools and sampling data?
  • 16.
    Nature of Data(Contd…) With reference to the types of data their nature is as follows: 1. Numerical Data 2. Descriptive Data 3. Graphic and Symbolic Data 4. Enumerative Data 5. Descriptive Data
  • 17.
    Nature of Data(Contd…) 1. Numerical Data: All data in science are derived by measurement and stated in numerical values. Most of their nature is numerical. 2. Descriptive Data: Science is not known for descriptive data. However, qualitative data in sciences are expressed in terms of definitive statements concerning objects. These may be viewed as descriptive data. Here, the nature of data is descriptive. 3. Graphic and Symbolic Data: Graphic and symbolic data are modes of presentation. They enable users to grasp data by visual perception. The nature of data, in these cases, is graphic.
  • 18.
    Nature of Data(Contd…) 4. Enumerative Data: Most data in social sciences are enumerative in nature. However, they are refined with the help of statistical techniques to make them more meaningful. They are known as statistical data. 5. Descriptive Data: All qualitative data in social sciences can be descriptive in nature. These can be in the form of definitive statements. However, if necessary, numerical values can be assigned to descriptive statements, which may be reduced to numerical data.
  • 19.
    Classification of Data •Data classification is the process of organising data into categories that make it easy to retrieve, sort and store for future use. • The classification of data makes it easy for the user to retrieve it. • Data classification is important for data security and for fulfilling different types of business or personal objectives.
  • 20.
    Purpose of DataClassification • Systematic classification of data helps organisations to manipulate, track and analyse individual pieces of data. • Data professionals have a specific goal when categorising data. • The goal affects the approach they take and classification levels, they use.
  • 21.
    Why is DataClassification • Data classification is used to categorise structured data, but it especially important for getting the most out of unstructured data. • Data categorisation also helps to identify duplicate copies of data. • Eliminating redundant data contributes to efficient use of storage and maximises data security measures.
  • 22.
    Types of DataClassification Three types of data classification: • Structured Data • Semi-structured Data • Unstructured Data
  • 23.
    Structured Data • Datahaving a pre-defined structure which can also be categorized as quantitative data and is well-organized defined as Structured Data. • Because of having a pre-defined structure-property, data can be organized into tables — columns and rows just like in spreadsheets. • Most of the time when data is having relations and can’t store in spreadsheets due to the large size in this case structured data stored in relational databases tables.
  • 24.
    Characteristics of StructuredData • Data conforms to a data model and has an easily identifiable structure. • Data is stored in the form of rows and columns. • Data is well-organised so, definition, format and meaning of data is explicitly known. • Data resides in fixed fields within a record or file. • Similar entities are grouped together to form relations or classes. • Entities in the same group have same attribute. • Data elements are addressable, so efficient to analyse and process.
  • 25.
    Sources of StructuredData • SQL Databases • Spreadsheets such as Excel • OLTP System • Online forms • Sensors such as GPS • Network and Web server logs • Medical devices
  • 26.
    Advantages of StructuredData • Structured data has a well-defined structure that helps in easy storage and access of data. • Data mining is easy, i.e., knowledge can be easily extracted from data. • Operations such as updating and deleting is easy due to well-structured from of data. • Business Intelligence operations such as data warehousing can be easily undertaken. • Easily scalable in case there is an increment of data. • Ensuring security to data is easy.
  • 27.
    Unstructured Data • Unstructureddata is typically categorized as qualitative rather than quantitative. • It doesn’t have a pre-defined structure or specific format. • Data that lies in this category are audio, video, images, and text files contents which have different properties for making these data available for analysis and can’t be stored in relational databases tables. • So, these are stored in their raw format and analysis is done by applying Image processing, Natural Language Processing, and Machine Learning.
  • 28.
    Characteristics of UnstructuredData • Data neither conforms to a data model nor has any structure. • Data cannot be stored in the form of rows and columns as in databases. • Data does not follow any semantic or rules. • Data lacks a particular format or sequence. • Data has no easily identifiable structure. • Due to lack of identifiable structure, it cannot be used by computer program easily.
  • 29.
    Sources of UnstructuredData • Web pages • Images (JPEG, GIF, PNG, etc) • Videos • Reports • Word documents • Surveys
  • 30.
    Advantages of UnstructuredData • Its supports the data which lacks a proper format or sequence. • The data is not constrained by a fixed schema. • Very flexible due to the absence of schema. • Data is portable. • It is very scalable. • It can deal easily with the heterogeneity of sources. • These types of data have a variety of business intelligence and analytics applications.
  • 31.
    Disadvantages of UnstructuredData • It is difficult to store and manage unstructured data due to lack of schema and structure. • Ensuring security to data is a difficult task. • Indexing the data is difficult and is error due to unclear structure and not having pre-defined attributes, due to which search results are not very accurate.
  • 32.
    Semi-Structured Data • Semi-Structureddata contains elements of both structured and unstructured, its schema is not fixed as structured data and with the help of metadata (which enables users to define some partial structure or hierarchy), it can be organized to some extent so not unorganized as unstructured data. • Metadata includes tags and other markers just like in JSON, XML, or CSV which separates the elements and enforces the hierarchy, but the size of the element varies, and order is not important.
  • 33.
    Characteristics of Semi-structuredData • Data does not conform to a data model but has some structure. • Data cannot be stored in the form of rows and columns as in databases. • Similar entities are grouped together and organised in a hierarchy. • Entities in the same group may or may not have the same attributes or properties. • Size and type of the same attribute in a group may differ. • Due to lack of a well-defined structure, it cannot use by computer programs easily.
  • 34.
    Sources of Semi-structuredData • Emails • XML and other markup languages • TCP/IP packets • Zipped files • Integration of data from different sources • Web pages
  • 35.
    Advantages of Semi-structuredData • The data is not constrained by a fixed schema. • Flexible, i.e., Schema can be easily changed. • Data is portable. • It is possible to view structured data as semi-structured data. • Its supports users who cannot express their need in SQL. • It can deal easily with the heterogeneity of sources.
  • 36.
    Disadvantages of Semi-structuredData • Lack of fixed, schema makes it difficult to store the data. • Interpreting the relationship between data is difficult as there is no separation of the schema and the data. • Queries are less efficient as compared to structured data.
  • 37.
    Difference Between Structured,Semi-structured, and Unstructured Data Parameters Structured Data Semi-Structured Data Unstructured Data Data Structure The information and data have a predefined organization. The contained data and information have organizational properties- but are different from predefined structured data. There is no predefined organization for the available data and information in the system or database. Technology Used Structured Data words based on relational database tables. Semi-Structured Data works based on Relational Data Framework (RDF) or XML. Unstructured data works based on binary data and the available characters. Flexibility The data depends a lot on the schema. Thus, there is less flexibility. The data is comparatively less flexible than unstructured data but way more flexible than the structured data. Schema is totally absent. Thus, it is the most flexible of all. Management of Transaction It has a mature type of transaction. Also, there are various techniques of concurrency. It adapts the transaction from DBMS. It is not of mature type. It consists of no management of transaction or concurrency. Management of Version It is possible to version over tables, rows, and tuples. It is possible to version over graphs or tuples. It is possible to version the data as a whole. Scalability Scaling a database schema is very difficult. Thus, a structured database offers lower scalability. Scaling a Semi-Structured type of data is comparatively much more feasible. An unstructured data type is the most scalable in nature. Performance of Query A structured type of query makes complex joining possible. Semi-structured queries over various nodes (anonymous) are most definitely possible. Unstructured data only allows textual types of queries.
  • 38.
    Characteristics of Data Severalcharacteristics in the data such as: 1. Accuracy : The data must conform to actual, real-world scenarios and reflect real-world objects and events. Analysts should use verifiable sources to confirm the measure of accuracy, determined by how close the values with the verified correct information sources. 2. Completeness : Completeness measures the data's ability to deliver all the mandatory values that are available successfully.
  • 39.
    Characteristics of Data(Contd…) 3.Consistency: Data consistency describes the data’s uniformity as it moves across applications and networks and when it comes from multiple sources. Consistency also means that the same datasets stored in different locations should be the same and not conflict. Note that consistent data can still be wrong. 4. Timeliness: Timely data is information that is readily available whenever it’s needed. This dimension also covers keeping the data current; data should undergo real-time updates to ensure that it is always available and accessible.
  • 40.
    Characteristics of Data(Contd…) 5.Uniqueness: Uniqueness means that no duplications or redundant information are overlapping across all the datasets. No record in the dataset exists multiple times. Analysts use data cleansing and deduplication to help address a low uniqueness score. 6. Validity: Data must be collected according to the organization’s defined business rules and parameters. The information should also conform to the correct, accepted formats, and all dataset values should fall within the proper range.
  • 41.
    Introduction to BigData Platform Big Data • Big Data is a collection of large datasets that cannot be processed using traditional computing techniques. • It is not a single technique or a tool, rather it has become a complete subject, which involves various tools, techniques and frameworks. • Big Data is the technical term used in reference to the vast quantity of heterogeneous datasets. • Examples of big data includes Cell phone details, Social media content, Health records, Transactional data, Web searches, Financial documents, Weather information.
  • 42.
    Introduction to BigData Platform (Contd…) • Data which are very large in size are called Big Data. • Normally, we work on data of size MB (Word Doc, Excel) or maximum GB (Movies, Codes) but data which are Petabytes, is called Big Data. • The size of large data can range from several terabytes (1 trillion bytes) to petabytes and even Exabytes. • It is the concept of gathering useful insights from such voluminous amounts of structured, semi-structured and unstructured data that can be used for effective decision-making in the business environment.
  • 43.
    Sources of BigData These data come from many sources like: • Social networking sites: Facebook, Google, LinkedIn all these sites generate huge amount of data on a day-to-day basis as they have billions of users worldwide. • E-commerce site: Sites like Amazon, Flipkart generates huge amount of logs from which users buying trends can be traced. • Weather Station: All the weather station and satellite gives very huge data which are stored and manipulated to forecast weather.
  • 44.
    Sources of BigData (Contd…) • Telecom company: Telecom giants like Airtel, Vodafone study the user trends and accordingly publish their plans and for this they store the data of its million users. • Share Market: Stock exchange across the world generates huge amount of data through its daily transaction.
  • 45.
    Applications of BigData • Banking and Securities • Communications, Media and Entertainment • Healthcare Providers • Education • Manufacturing and Natural Resources • Government Insurance • Retail and Wholesale trade • Transportation • Energy and Utilities
  • 46.
    Uses of BigData • Location Tracking • Fraud Detection and Handling • Advertising • Entertainment and Media
  • 47.
    Real World BigData Examples • Discovering consumer shopping habits • Personalised marketing • Fuel optimisation tools for the transportation industry • Monitoring health conditions through data from wearables • Live road mapping for autonomous vehicles • Streamlined media streaming • Predictive inventory ordering
  • 48.
    Issues with BigData There are three issues with Big Data, and they are as follows: • Low Quality and Inaccurate Data: Low-quality data or inaccurate data quality may lead to inaccurate results or predictions which does nothing apart from wasting time and effort of the individuals. • Processing Large Data Sets: Due to a large amount of data, no traditional data management tool or software can directly process because the size of these large datasets are usually in Terabytes which is difficult to process.
  • 49.
    Issues with BigData (Contd…) • Integrating Data from a Variety of Sources: Data comes from various types of sources like social media, different websites, captured images/ videos, customer logs, reports created by individuals, newspapers, e-mails, etc. Collecting and integrating various data which are of different types is basically a challenging task.
  • 50.
    Big Data Characteristics 1.Volume (Huge Amount of Data) 2. Veracity (Inconsistencies & uncertainty in data) 3. Variety (Different formats of data from various sources) 4. Value (Extract Useful Data) 5. Velocity (High speed of accumulation of data)
  • 51.
    Big Data Characteristics(Contd…) 1. Volume • Big Data is a vast 'volumes' of data generated from many sources daily, such as business processes, social media platforms, networks, human interactions, and many more. • Facebook can generate approximately a billion messages, 4.5 billion times that the "Like" button is recorded, and more than 350 million new posts are uploaded each day. Big data technologies can handle large amounts of data.
  • 52.
    Big Data Characteristics(Contd…) 2. Variety • Big Data can be structured, unstructured, and semi-structured that are being collected from different sources. Data will only be collected from databases and sheets in the past, But these days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.
  • 53.
    Big Data Characteristics(Contd…) 3. Veracity • Veracity means how much the data is reliable. It has many ways to filter or translate the data. Veracity is the process of being able to handle and manage data efficiently. 4. Value • Value is an essential characteristic of big data. It is not the data that we process or store. It is valuable and reliable data that we store, process, and analyze.
  • 54.
    Big Data Characteristics(Contd…) 5. Velocity • Velocity plays an important role compared to others. Velocity creates the speed by which the data is created in real-time. • Big data velocity deals with the speed at the data flows from sources like application logs, business processes, networks, and social media sites, sensors, mobile devices, etc.
  • 55.
    What is abig data platform? • A big data platform is an integrated computing solution that combines numerous software systems, tools, and hardware for big data management. • Big data Platform workflow is divided into the following stages: 1. Data Collection 2. Data Storage 3. Data Processing 4. Data Analytics
  • 56.
    What is abig data platform? (Contd…) 5. Data Management and Warehousing 6. Data Catalog and Metadata Management 7. Data Observability 8. Data Intelligence
  • 57.
    Characteristics of aBig Data Platform 1. Ability to accommodate new applications and tools depending on the evolving business needs. 2. Support several data formats. 3. Ability to accommodate large volumes of streaming data. 4. Have a wide variety of conversion tools to transform data to different preferred formats. 5. Capacity to accommodate data at any speed. 7. The ability for quick deployment. 8. Have the tools for data analysis and reporting requirements.
  • 58.
    Different Types ofBig Data Platforms and Tools 1. Hadoop Delta Lake Migration Platform 2. Data Catalog and Data Observability Platform 3. Data Ingestion and Integration Platform 4. Big Data and IoT Analytics Platform 5. Data Discovery and Management Platform 6. Cloud ETL Data Transformation Platform
  • 59.
    Need/Importance of BigData • Reduction in cost • Time reduction • New product development in optimized offer • Well groomed decision making
  • 60.
    Challenges of BigData • Rapid data growth: The growth velocity at such a high rate creates a problem to look for insights using it. There no 100% efficient way to filter out relevant data. • Storage: The generation of such a massive amount of data needs space for storage, and organizations face challenges to handle such extensive data without suitable tools and technologies.
  • 61.
    Challenges of BigData (Contd…) • Unreliable data: It cannot be guaranteed that the big data collected and analyzed are totally (100%) accurate. • Data security: Firms and organizations storing such massive data (of users) can be a target of cybercriminals, and there is a risk of data getting stolen. Hence, encrypting such data is also a challenge for firms and organizations.
  • 62.
    Data analytics • Dataanalytics is the process of examining datasets to find trends and draw conclusions about the information they contain. • Data analytics technologies and techniques are widely used in commercial industries to enable organisations to make more-informed business decisions. • Scientists and researchers also use analytics tools to verify or disprove scientific models, theories and hypotheses.
  • 63.
    Need of DataAnalytics Data analytics is important for many reasons, including: • Informed decision-making: Data analytics helps businesses make better decisions by providing a holistic view of their performance and identifying opportunities for improvement. • Improved customer experience: Data analytics can help businesses understand their customers' preferences and needs, which can lead to personalized experiences and better customer outcomes • Fraud detection and security: Data analytics can help businesses identify suspicious activity and minimize risk. • Healthcare: Data analytics can help healthcare professionals make evidence- based decisions about patient care, disease diagnosis, and treatment optimization.
  • 64.
    Analytic Scalability • Analyticscalability is the ability to use data to understand and solve a large variety of problems. And because problems come in many forms, analytics must be flexible enough to address problems in different ways. This might include the use of statistical tools and forecasting.
  • 65.
    Evolution of AnalyticScalability • The amount of data, organizations process continues to increase. • For instance, it is estimated that Walmart collects more than 2.5 petabytes of data every hour from its customer transactions. • A petabyte is the equivalent of about 20 million filing cabinets ‘worth of text. • An exabyte is 1,000 times that amount, or one billion gigabytes.
  • 66.
    Data Analytic Process •The collection, transformation, and organization of data to draw conclusions make predictions for the future and make informed data- driven decisions is called Data Analysis. • The profession that handles data analysis is called a Data Analytic.
  • 67.
    Data Analytic Process(Contd…) Six steps of data analytic process: 1. Define the Problem or Research Question 2. Collect Data 3. Data Cleaning 4. Analyzing the Data 5. Data Visualization 6. Presenting Data
  • 68.
    Data Analytic Process(Contd…) 1. Define the Problem or Research Question • Data analyst is given a problem/business task. • The analyst has to understand the task and the stakeholder’s expectations for the solution. • A stakeholder is a person that has invested their money and resources to a project. • Questions to ask yourself for the Ask phase are: 1. What are the problems that are being mentioned by my stakeholders? (The analyst must find the root cause of the problem to fully understand the problem) 2. What are their expectations for the solutions? (The analyst must be able to ask different questions to find the right solution to their problem.)
  • 69.
    Data Analytic Process(Contd…) 2. Collect Data • The data has to be collected from various sources like Internal or External Sources. • Internal data is the data available in the organization that you work for while external data is the data available in sources other than your organization. • The data that is collected by an individual from their own resources is called First- Party Data. • The data that is collected and sold is called Second-Party Data. • Data that is collected from outside sources is called Third-Party Data. • The common sources from where the data is collected are Interviews, Surveys and Questionnaires. • The collected data can be stored in a spreadsheet or SQL database. • The best tools to store the data are MS Excel or Google Sheets in the case of Spreadsheets and there are so many databases like Oracle, Microsoft to store the data
  • 70.
    Data Analytic Process(Contd…) 3. Data Cleaning (Clean and Process Data) • Clean data means data that is free from misspellings and redundancies. • There are different functions provided by SQL and Excel to clean the data and formatted data helps in finding trends and solutions. • The most important part of the Process phase is to check whether your data is biased or not. • Bias is an act of favoring a particular group/community while ignoring the rest. Biasing is a big no-no as it might affect the overall data analysis. • The data analyst must make sure to include every group while the data is being collected.
  • 71.
    Data Analytic Process(Contd…) 4. Analyzing the Data • The cleaned data is used for analyzing and identifying trends. • It also performs calculations and combines data for better results. • The tools used for performing calculations are Excel or SQL. • Using Excel, we can create pivot tables and perform calculations while SQL creates temporary tables to perform calculations. • Programming languages are another way of solving problems for data analysis is R and Python.
  • 72.
    Data Analytic Process(Contd…) 5. Data Visualization • The data now transformed has to be made into a visual (chart, graph). • The reason for making data visualizations is that there might be people, mostly stakeholders that are non-technical. • Visualizations are made for a simple understanding of complex data. • Tableau and Looker are the two popular tools used for data visualizations. • Tableau is a simple drag and drop tool that helps in creating visualizations. • Looker is a data viz tool that directly connects to the database and creates visualizations.
  • 73.
    Data Analytic Process(Contd…) 6. Presenting the Data • Presenting the data involves transforming raw information into a format that is easily comprehensible and meaningful for various stakeholders. • This process encompasses the creation of visual representations, such as charts, graphs, and tables, to effectively communicate patterns, trends, and insights gleaned from the data analysis. • The goal is to facilitate a clear understanding of complex information, making it accessible to both technical and non-technical audiences.
  • 74.
    Types/Levels/Methods of DataAnalytics 1. Descriptive Data Analysis • Descriptive analytics looks at the past performance and understands the performance by mining historical data to understand the cause of success or failure in the past. • Almost all management reporting such as sales, marketing, operations, and finance uses this type of analysis. • Descriptive analytics is used when the organisation has a large dataset on past events or historical events. “What happened?” Or “What was the trend?”
  • 75.
    Types/Levels/Methods of DataAnalytics (Contd…) 2. Diagnostic Analytics It is the process of using the data to understand the underlying reasons behind past events, trends and outcomes to answer. “Why did this happen?” 3. Predictive Analytics It is the process of applying statistical and machine learning techniques on historical data to make prediction to answer. “What might happen in the future?”
  • 76.
    Types/Levels/Methods of DataAnalytics (Contd…) 4. Prescriptive Analytics It is the process of using the data to recommend actions in response to a given forecast to optimized desired outcomes. “What should we do?”
  • 77.
    Data Analytic Tools Dataanalytic have many types of tools: 1. Tableau Public 2. OpenRefine 3. KNIME 4. RapidMiner 5. Google Fusion Tables 6. NodeXL 7. Wolfram Alpha 8. Google Search Operators 9. Solver 10. Dataiku DSS
  • 78.
    Data Analytic Tools(Contd…) 1. Tableau Public • Tableau, one of the top 10 Data Analytics tools, is a simple tool which offers data visualization. • With Tableau’s visuals, you can investigate a hypothesis, explore the data, and cross-check your insights.
  • 79.
    Data Analytic Tools(Contd…) • Uses of Tableau Public 1. You can publish interactive data visualizations to the web for free. 2. No programming skills required. 3. Visualizations published to Tableau Public can be embedded into blogs and web pages and be shared through email or social media. The shared content can be made availables for downloads. • Limitations of Tableau Public 1. Data size limitation
  • 80.
    Data Analytic Tools(Contd…) 2. OpenRefine Formerly known as GoogleRefine, the data cleaning software that helps you clean up data for analysis. It operates on a row of data which have cells under columns, quite like relational database tables. • Uses of OpenRefine 1. Cleaning messy data 2. Transformation of data
  • 81.
    Data Analytic Tools(Contd…) Limitations of OpenRefine • Refine does not work very well with big data. 3. KNIME KNIME, ranked among the top Data Analytics tools helps you to manipulate, analyze, and model data through visual programming. It is used to integrate various components for data mining and machine learning via its modular data pipelining concept.
  • 82.
    Data Analytic Tools(Contd…) • Uses of KNIME 1. Rather than writing blocks of code, you just have to drop and drag connection points between activities. 2. This data analysis tool supports programming languages. 3. In fact, analysis tools like these can be extended to run text mining, python, and R. • Limitation of KNIME 1. Poor data visualization
  • 83.
    Data Analytic Tools(Contd…) 4. RapidMiner • RapidMiner provides machine learning procedures and data mining including data visualization, processing, statistical modeling, deployment, evaluation, and predictive analytics. • RapidMiner, counted among the top 10 Data Analytics tools, is written in the Java and fast gaining acceptance. Uses of RapidMiner • It provides an integrated environment for business analytics, predictive analysis, text mining, data mining, and machine learning. • Along with commercial and business applications, RapidMiner is also used for application development, training, education, and research.
  • 84.
    Data Analytic Tools(Contd…) • Limitations of RapidMiner 1. RapidMiner has size constraints with respect to the number of rows. 2. For RapidMiner, you need more hardware resources. 5. Google Fusion Tables An incredible tool for data analysis, mapping, and large dataset visualization, Google Fusion Tables can be added to business analytics tools list. Ranked among the top 10 Data Analytics tools, Google Fusion Tables is fast gaining popularity.
  • 85.
    Data Analytic Tools(Contd…) • Uses of Google Fusion Tables 1. Visualize bigger table data online. 2. Filter and summarize across hundreds of thousands of rows. • Limitations of Google Fusion Tables 1. Only the first 100,000 rows of data in a table are included in query results or mapped. 2. The total size of the data sent in one API call cannot be more than 1MB.
  • 86.
    Data Analytic Tools(Contd…) 6. NodeXL NodeXL is a free and open-source network analysis and visualization software. Ranked among the top 10 Data Analytics tools, it is one of the best statistical tools for data analysis which includes advanced network metrics, access to social media network data importers, and automation. • Uses of NodeXL This is one of the best data analysis tools in Excel that helps in: 1. Data Import 2. Graph Visualization 3. Graph Analysis 4. Data Representation
  • 87.
    Data Analytic Tools(Contd…) Limitations of NodeXL 1. Multiple seeding terms are required for a particular problem. 2. Need to run the data extractions at slightly different times. 7. Wolfram Alpha Wolfram Alpha, one of the top 10 Data Analytics tools is a computational knowledge engine founded by Stephen Wolfram. With Wolfram Alpha, you get answers to factual queries directly by computing the answer from externally sourced instead of providing a list of documents or web pages.
  • 88.
    Data Analytic Tools(Contd…) • Uses of Wolfram Alpha 1. Provides detailed responses to technical searches and solves calculus problems. 2. Helps business users with information charts, graphs and helps in creating topic overviews and high-level pricing history. • Limitations of Wolfram Alpha 1. Wolfram Alpha can only deal with the publicly known number and facts, not with viewpoints. 2. It limits the computation time for each query.
  • 89.
    Data Analytic Tools(Contd…) 8. Google Search Operators It is a powerful resource that helps you filter Google results instantly to get the most relevant and useful information. • Uses of Google Search Operators 1. Fast filtering of Google results. 2. Google is powerful data analysis tool can help discover new information or market research.
  • 90.
    Data Analytic Tools(Contd…) 9. Solver The Solver Add-in is a Microsoft Office Excel add-in program that is available when you install Microsoft Excel or Office. Ranked among the best-known Data Analytic tools is a linear programming and optimization tool in excel. It is an advanced optimization tool that helps in quick problem solving.
  • 91.
    Data Analytic Tools(Contd…) • Uses of Solver It uses a variety of methods, from nonlinear optimization and linear programming and genetic algorithms, to find solutions. • Limitations of Solver 1. Poor scaling is one of the areas where Excel Solver lacks. 2. It can affect solution time and quality.
  • 92.
    Data Analytic Tools(Contd…) 10. Dataiku DSS Ranked among the top 10 Data Analytic tools, Dataiku is a collaborative data science software platform that helps the team build, prototype, explore, and deliver their own data products more efficiently. • Uses of Dataiku DSS • It provides an interactive visual interface. • This data analytics tool lets you draft data preparation and modulization in seconds.
  • 93.
    Data Analytic Tools(Contd…) • Limitation of Dataiku DSS 1. Limited visualization capabilities 2. UI hurdles: Reloading of code/datasets 3. Inability to easily compile entire code into a single document/notebook
  • 94.
    Analysis vs Reporting •Analysis involves data interpreting where reporting involving presenting factual, accurate data. • Analysis answers why something is happening based on the data, whereas reporting tells what’s happening. • Analysis delivers recommendations, but reporting is more about organizing and summarizing data.
  • 95.
    Analysis vs Reporting(Contd…) Analytics Reporting Analytics is the method of examining and analyzing summarized data to make business decisions. Reporting is an action that includes all the needed information and data and is put together in an organized way Questioning the data, understanding it, investigating it, and presenting it to the end users are all part of analytics. Identifying business events, gathering the required information, organizing, summarizing, and presenting existing data are all part of reporting. The purpose of analytics is to draw Conclusions based on data. The purpose of reporting is to organize the data into meaningful information. Analytics is used by data analysts, scientists, and business people to make effective decisions. Reporting is provided to the appropriate business leaders to perform effectively and efficiently within a firm.
  • 96.
    Modern Data AnalyticTools 1. Apache Hadoop 2. KNIME 3. Open Refine 4. Orange 5. Splunk 6. Talend 7. Power BI 8. Tableau 9. RapidMiner 10. R-programming 11. Data wrapper
  • 97.
    Modern Data AnalyticTools (Contd…) 1. Apache Hadoop • Apache Hadoop, a Big Data analytics tool which is a Java based free software framework. • It helps in effective storage of huge amount of data in a storage place known a cluster. • There is a storage system in Hadoop popularly known as the Hadoop Distributed File System (HDFS), which helps to splits the large volume of data and distribute across many nodes present in a cluster.
  • 98.
    Modern Data AnalyticTools (Contd…) 2. KNIME • KNIME Analytics Platform is the open-source software for creating data science. • KNIME analytics platform is one of the leading open solutions for data- driven innovation. • This tool helps in discovering the potential and hidden in a huge volume of data., it also performs mine for fresh insights or predicts the new futures.
  • 99.
    Modern Data AnalyticTools (Contd…) 3. Open Refine • Open Refine tool is one of the efficient tools to work on the messy and large volume of data. • It includes cleansing data, transforming that data from one format another. • It helps to explore large datasets easily.
  • 100.
    Modern Data AnalyticTools (Contd…) 4. Orange • Orange is famous data visualisation and helps in data analysis for beginner and as well to the expert. • It provides a clean, open-source platform. 5. Splunk • It is a platform used to search, analyse, and visualise the machine-generated data gathered from applications, websites, etc. • Splunk has evolved products in various fields such as IT, Security, Analytics.
  • 101.
    Modern Data AnalyticTools (Contd…) 6. Talend • It is one of the most powerful data integration tools available in the market and is developed in the Eclipse graphical development environment. • This tool lets you easily manage all the steps involved in the process and aims to deliver compliant, accessible and clean data for everyone.
  • 102.
    Modern Data AnalyticTools (Contd…) 7. Power BI • It is a Microsoft product used for business analytics. • It provides interactive visualizations with self-service business intelligence capabilities, where end users can create dashboards and reports by themselves, without having to depend on anybody.
  • 103.
    Modern Data AnalyticTools (Contd…) 8. Tableau • It is a market-leading Business Intelligence tool used to analyse and visualise data in an easy format. • Tableau allows you to work on live data-set and spend more time on Data Analysis.
  • 104.
    Modern Data AnalyticTools (Contd…) 9. RapidMiner • RapidMiner tool operates using visual programming and also it is much capable of manipulating, analysing and modeling the data. • RapidMiner tools make data science teams easier and productive by using an open-source platform for all their jobs like machine learning, data preparation, and model deployment.
  • 105.
    Modern Data AnalyticTools (Contd…) 10. R-programming • R is a free open-source software programming language and a software environment for statistical computing and graphics. • It is used by data miners for developing statistical software and data analysis. • It has become a highly popular tool for Big Data in recent years.
  • 106.
    Modern Data AnalyticTools (Contd…) 11. Data wrapper • It is an online data visualisation tool for making interactive charts. • It uses data file in a csv, pdf or excel format. • Data wrapper generate visualisation in the form of bar, line, map etc.
  • 107.
    Applications of DataAnalytics Data analytics finds applications across various industries and sectors, transforming the way organizations operate and make decisions. Here are some examples of how data analytics is applied in different domains: 1. Healthcare 2. Finance 3. E-commerce 4. Cyber security
  • 108.
    Applications of DataAnalytics (Contd…) 5. Supply Chain Management 6. Banking 7. Logistics 8. Retail 9. Manufacturing 10. Internet Searching 11. Risk Management
  • 109.
    Applications of DataAnalytics (Contd…) 1. Healthcare • Data analytics is the healthcare industry by enabling better patient care, disease prevention, and resource optimization. For example, hospitals can analyze patient data to identify high-risk individuals and provide personalized treatment plans. • Data analytics can also help detect disease outbreaks, monitor the effectiveness of treatments, and improve healthcare operations.
  • 110.
    Applications of DataAnalytics (Contd…) 2. Finance • In the financial sector, data analytics plays a crucial role in fraud detection, risk assessment, and investment strategies. • Banks and financial institutions analyze large volumes of data to identify suspicious transactions and optimize investment portfolios. • Data analytics also enables personalized financial advice and the development of creative financial products and services.
  • 111.
    Applications of DataAnalytics (Contd…) 3. E-commerce • E-commerce platforms utilize data analytics to understand customer behavior, personalize shopping experiences, and optimize marketing campaigns. • By analyzing customer preferences, purchase history, and browsing patterns, e- commerce companies can offer personalized product recommendations, target specific customer segments, and improve customer satisfaction and retention.
  • 112.
    Applications of DataAnalytics (Contd…) 4. Cyber security • Data analytics plays a vital role in cyber security by detecting and preventing cyber threats and attacks. • Security systems analyze network traffic, user behavior, and system logs to identify anomalies and potential security breaches.
  • 113.
    Applications of DataAnalytics (Contd…) 5. Supply Chain Management • Data analytics improves supply chain management by optimizing inventory levels, reducing costs, and enhancing overall operational efficiency. • Organizations can identify bottlenecks, forecast demand, and improve logistics and distribution processes by analyzing supply chain data. • Data analytics also enables better supplier management and enhances transparency throughout the supply chain.
  • 114.
    Applications of DataAnalytics (Contd…) 6. Banking • Banks use data analytics to gain insights into customer behavior, manage risks, and personalize financial services. • Banks can tailor their offerings, identify potential fraud, and assess credit worthiness by analyzing transaction data, customer demographics, and credit histories. • Data analytics also helps banks to improve regulatory compliance.
  • 115.
    Applications of DataAnalytics (Contd…) 7. Logistics • In the logistics industry, data analytics plays a crucial role in optimizing transportation routes, managing fleet operations, and improving overall supply chain efficiency. • Logistics companies can minimize costs, reduce delivery times, and enhance customer satisfaction by analyzing data on routes, delivery times, and vehicle performance. • Data analytics also enables better demand forecasting and inventory management.
  • 116.
    Applications of DataAnalytics (Contd…) 8. Retail • Data analytics transforms the retail industry by providing insights into customer preferences, optimizing pricing strategies, and improving inventory management. • Retailers analyze sales data, customer feedback, and market trends to identify popular products, personalize offers, and forecast demand. • Data analytics also helps retailers enhance their marketing efforts, improve customer loyalty, and optimize store layouts.
  • 117.
    Applications of DataAnalytics (Contd…) 9. Manufacturing • Data analytics is the manufacturing sector by enabling predictive maintenance, optimizing production processes, and improving product quality. • Manufacturers can predict equipment failures, minimize downtime, and ensure efficient operations by analyzing sensor data, machine performance, and historical maintenance records. • Data analytics also enables real-time monitoring of production lines, leading to higher productivity and cost savings.
  • 118.
    Applications of DataAnalytics (Contd…) 10. Internet Searching • Data analytics powers internet search engines, enabling users to find relevant information quickly and accurately. • Search engines analyze vast amounts of data, including web pages, user queries, and click- through rates, to deliver the most relevant search results. • Data analytics algorithms continuously learn and adapt to user behavior, providing accurate and personalized search results.
  • 119.
    Applications of DataAnalytics (Contd…) 11. Risk Management • Data analytics plays a crucial role in risk management across various industries, including insurance, finance, and project management. • Organizations can assess risks, develop strategies, and make informed decisions by analyzing historical data, market trends, and external factors. • Data analytics helps organizations identify potential risks and quantify their impact.
  • 120.
    Key Role ofData Analytics Project There are certain key roles that are required for the complete and fulfilled functioning of the data science team to execute projects on analytics successfully. The key roles are: 1. Business User 2. Project Sponsor 3. Project Manager 4. Business Intelligence Analyst 5. Database Administrator 6. Data Engineer 7. Data Scientist
  • 121.
    Key Role ofData Analytics Project (Contd…) 1. Business User • The business user is the one who understands the main area of the project and is also basically benefited from the results. • This user gives advice and consults the team working on the project about the value of the results obtained and how the operations on the outputs are done. • The business manager, line manager, or deep subject matter expert in the project mains fulfills this role.
  • 122.
    Key Role ofData Analytics Project (Contd…) 2. Project Sponsor • The Project Sponsor is the one who is responsible to initiate the project. Project Sponsor provides the actual requirements for the project and presents the basic business issue. • He generally provides the funds and measures the degree of value from the final output of the team working on the project. • This person introduces the prime concern and brooms the desired output.
  • 123.
    Key Role ofData Analytics Project (Contd…) 3. Project Manager This person ensures that key milestone and purpose of the project is met on time and of the expected quality. 4. Business Intelligence Analyst • Business Intelligence Analyst provides business domain perfection based on a detailed and deep understanding of the data, key performance indicators (KPIs), key matrix, and business intelligence from a reporting point of view. • This person generally creates reports and knows about the data feeds and sources.
  • 124.
    Key Role ofData Analytics Project (Contd…) 5. Database Administrator (DBA) • DBA facilitates and arranges the database environment to support the analytics need of the team working on a project. • His responsibilities may include providing permission to key databases or tables and for making sure that the appropriate security stages are in their correct places related to the data repositories or not.
  • 125.
    Key Role ofData Analytics Project (Contd…) 6. Data Engineer • Data engineer grasps deep technical skills to assist with SQL queries for data management and data extraction and provides support for data intake into the analytic sandbox. • The data engineer works jointly with the data scientist to help build data in correct ways for analysis.
  • 126.
    Key Role ofData Analytics Project (Contd…) 7. Data Scientist • Data scientist facilitates with the subject matter expertise for analytical techniques, data modelling and applying correct analytical techniques for a given business issues. • He ensures overall analytical objectives are met. • Data scientists outline and apply analytical methods and proceed towards the data available to the project.
  • 127.
    Data Analytics Lifecycle Intoday’s digital-first world, data is importance. It undergoes various stages throughout its life, during its creation, testing, processing, consumption, and reuse. Data Analytics Lifecycle maps out these stages for professionals working on data analytics projects. Primarily it has 6 stages. • Phase 1: Data Discovery • Phase 2: Data Preparation • Phase 3: Model Planning • Phase 4: Model Building • Phase 5: Communication and Publication of Results • Phase 6: Operationalize/Measuring of Effectiveness
  • 128.
  • 129.
    Data Analytics Lifecycle(Contd…) Phase 1: Data Discovery • The data science team is learns and investigates the problem. • Create context and gain understanding. • Learn about the data sources that are needed and accessible to the project. • The team produces an initial hypothesis, which can be later confirmed with evidence.
  • 130.
    Data Analytics Lifecycle(Contd…) Phase 2: Data Preparation • Methods to investigate the possibilities of pre-processing, analysing, and preparing data before analysis and modelling. • It is required to have an analytic sandbox. The team performs, loads, and transforms to bring information to the data sandbox. • Data preparation tasks can be repeated and not in a predetermined sequence. • Some of the tools used commonly for this process include - Hadoop, Open Refine, etc.
  • 131.
    Data Analytics Lifecycle(Contd…) Phase 3: Model Planning • The team studies data to discover the connections between variables. Later, it selects the most significant variables as well as the most effective models. • In this phase, the data science teams create data sets that can be used for training for testing, production, and training goals. • The team builds and implements models based on the work completed in the modelling planning phase. • Some of the tools used commonly for this stage are MATLAB and STASTICA.
  • 132.
    Data Analytics Lifecycle(Contd…) Phase 4: Model Building • The team creates datasets for training, testing as well as production use. • The team is also evaluating whether its current tools are sufficient to run the models or if they require an even more robust environment to run models. • Commercial tools - MATLAB, STASTICA.
  • 133.
    Data Analytics Lifecycle(Contd…) Phase 5: Communication Results • After executing the model, team members will need to evaluate the outcomes of the model to establish criteria for the success or failure of the model. • The team is considering how best to present findings and outcomes to the various members of the team and other stakeholders while taking into account warning and assumptions. • The team should determine the most important findings, quantify their value to the business and create a narrative to present findings and summarize them to all stakeholders.
  • 134.
    Data Analytics Lifecycle(Contd…) Phase 6: Operationalize • The team distributes the benefits of the project to a wider audience. It sets up a pilot project that will deploy the work in a controlled manner prior to expanding the project to the entire enterprise of users. • This technique allows the team to gain insight into the performance and constraints related to the model within a production setting at a small scale and then make necessary adjustments before full deployment. • The team produces the last reports, presentations, and codes. • Open source or free tools such as WEKA, SQL and MADlib.
  • 135.
    Need of DataAnalytics Life Cycle • The Data Analytics Lifecycle outlines how data is created, gathered, processed, used, and analyzed to meet corporate objectives. • It provides a structured method of handling data so that it may be transformed into knowledge that can be applied to achieve organizational and project objectives. • The process offers the guidance and techniques needed to extract information from the data and move forward to achieve corporate objectives.
  • 136.
    Need of DataAnalytics Life Cycle (Contd…) • Data analysts use the circular nature of the lifecycle to go ahead or backward with data analytics. • They can choose whether to continue with their current research and conduct a fresh analysis considering the recently acquired insights. Their progress is guided by the Data Analytics lifecycle.