From Science to Data
Dr. Anestis Fachantidis
From Science to Data
Following a principled path
to Data Science
Dr. Anestis Fachantidis
CEO & Chief Data Scientist
Medoid AI
Anestis Fachantidis – From Science to Data 2
A Data Scientist's Ikigai
Anestis Fachantidis – From Science to Data 3
A Data Scientist's Ikigai
Anestis Fachantidis – From Science to Data 4
How to become
good at DS
What kind of
DS does the
world need?
What to love in
DS?
Which Industries
pay for DS now?
What you are good at
Become good at some of the tools used in real projects
Anestis Fachantidis – From Science to Data 5
Become good at
• Next slides:
• Reasons that the theoretical background is mandatory
• No one will ever master all these skills ! This is a maximal set, we
are just looking for your entry point as a beginner.
• This could be the pair of your weakest skill with the strongest one,
one to make you a better DS and one to give you early
gradification.
• The field is leaning more and more towards specialization.
Theoretical Background Web apps & services for DS
Accessing Data Sources Cluster Computation
Data Preparation Git & Documentation
Learning Models Business Understanding
Data Visualization & Reporting Team Work & Project Management
Anestis Fachantidis – From Science to Data 6
Theoretical Background
Minimum set of keywords for beginners
Anestis Fachantidis – From Science to Data 7
*External .png file available
Accessing Data
• Reading Data 101: read .csv and excel files
(read.xlsx)
• Connect to a database
• Set up MySQL or SQL Server Express locally
and download a sample DB (e.g., Employers
DB)
• Connect and read through R using
frameworks like DBI and dbplyr
• You will still need basic knowledge of SQL!
Anestis Fachantidis – From Science to Data 8
Accessing Data
dbplyr Example:
con <- DBI::dbConnect(RSQLite::SQLite())
flights <- tbl(con,”flights”)
flights %>%
select(distance, air_time) %>%
mutate(speed = distance / (air_time / 60))%>%
show_query()
#> <SQL>
#> SELECT `distance`, `air_time`, `distance` /
(`air_time` / 60.0) AS `speed`
#> FROM (SELECT `distance`, `air_time`
#> FROM `nycflights13::flights`)
Anestis Fachantidis – From Science to Data 9
Data Preparation
• Data preparation takes 60% to 80% of the whole
analytical pipeline in a real DS project
• In R, frameworks like dplyr and data.table make data
handling and processing easier
• Numerous packages and libraries for data cleaning
• Learning magrittr pipelines will make your code
readable and will clear up the data manipulation
process
• Understanding feature extraction, feature selection
and their interconnection with the business context at
hand.
Anestis Fachantidis – From Science to Data 10
Learning Models
• Frameworks like scikit-learn (Python) and
caret (R) will make your first ML
experimentation steps much easier.
• They provide a standardized interface to
training, testing and hyper-parameter tuning.
• Try them on a Kaggle dataset!
Anestis Fachantidis – From Science to Data 11
GIT & Documentation
GIT, the most popular version control system today. Among other
reasons you need that in DS too:
• Results are paired with {parameters, features selected, code}
which comprise an (almost) deterministic state. Capture that state
for all your results!
• Make a Bitbucket, GitLab or GitHub account to:
– Create a small DS “portfolio” of personal projects
– Collaborate on open source projects
• Comment your code and always consider packaging for reusability
• Comment your objects:
comment(object) <- “…”
Don’t name their files like:
Results23-5withoutsumofmoney2.rds
Anestis Fachantidis – From Science to Data 12
Data Visualization & Reporting
• Significant DS task on their own, the most powerful
communication tools of a DS
• Learn at least one plotting “language”
• In R, the most well known plotting grammar is that of
ggplot
• For interactive charts you can also use platforms like
plot.ly and rCharts
• The fully reproducible paradigm of a compiled report:
“Code and text in one document side by side”
• Learn tools like rMarkdown (R) and Jupiter (Python)
which use an easy markdown syntax
Anestis Fachantidis – From Science to Data 13
Data Visualization & Reporting
Anestis Fachantidis – From Science to Data 14
Web Apps & Services for DS
Analytics APIs and ML web services:
• For small teams a Platform-as-a-service solution
like Heroku makes it easy to deploy a data
product
• Basic understanding of the HTTP protocol and
data formats such as JSON
Data Products as Web Apps:
• Web app frameworks like Shiny (R) or Django
(Python) can help deliver analytics or ML results
having limited knowledge of web development
Anestis Fachantidis – From Science to Data 15
Web Apps & Services for DS
Anestis Fachantidis – From Science to Data 16
Cluster Computation
• SparkR and Sparklyr will let you easily setup a
distributed computation cluster in R
• Access to MLlib library
• Package H2O for access to H20 open source
engine for analytics and ML
• Set them up locally and try them on sample
data!
Anestis Fachantidis – From Science to Data 17
Business Understanding
• More general: Domain understanding
• Requirements Analysis = following a
structured analytical process +
communication skills
• Beyond understanding the meaning of each
business variable, understanding:
– If the variable is directly controlled or not
– How does this influences other variables
Anestis Fachantidis – From Science to Data 18
Team Work & Project Management
• Learn about some software development
processes like Scrum and Kanban
• Differences to Software Development
processes
• Data mining process model: CRISP-DM
• Dedicated/Sophisticated platforms like
Atlassian JIRA
• For a small team or a small DS project a Trello
board is more than enough!
Anestis Fachantidis – From Science to Data 19
What the world needs
It should not be only about “performance” and revenue streams
Anestis Fachantidis – From Science to Data 20
Insights
• Open Insights
– People need to make sense of data
– Think of any NGO organization you love its purpose
or even the shop around the corner you just love its
products:
• Wouldn't you want to give them relevant insights about
their business environment?
• Insights that will make them act accordingly and become
sustainable
– Data Scientists also have the responsibility to
educate people on the interpretation of results and
on how they could identify bad data journalism
Anestis Fachantidis – From Science to Data 21
…Openness…
Open Source
• The open source communities need support and people that care
about their projects too
– Don’t just use them, think of ways to contribute
• Write your own R package or Python library
– Share it on a Git web platform and it might be the next big thing in
open source!
Open Data, which are of critical value for the following reasons:
– Transparency and democratic control
– Improved or new private products and services
– Improved efficiency and effectiveness of government services
– New knowledge from combined data sources and patterns in large
data volumes
…which all need a Data Scientist!
Anestis Fachantidis – From Science to Data 22
Business
Technology
Computational
Methods &
Algorithms
Maths & Pure
Sciences
Innovation
Known
Anestis Fachantidis – From Science to Data 23
Unknown Unknown
Business
Technology
Computational
Methods &
Algorithms
Maths & Pure
Sciences
The reverse pyramid
of Technology Innovation
Known
“The vast amount
of products is
based on just a
handful of
theorems”
Anestis Fachantidis – From Science to Data 24
Unknown Unknown
Business
Technology
Computational
Methods &
Algorithms
Maths & Pure
Sciences
The reverse pyramid
of Technology Innovation
Unknown UnknownKnown
“The vast amount
of products is
based on just a
handful of
theorems”
Anestis Fachantidis – From Science to Data 25
An innovation
just happened
here !
Business
Technology
Computational
Methods &
Algorithms
Maths & Pure
Sciences
The reverse pyramid
of Technology Innovation
Known
Innovation
Opportunity
Area
Innovation
propagation
time
“The vast amount
of products is
based on just a
handful of
theorems”
Anestis Fachantidis – From Science to Data 26
Innovation
“angle” -
magnitude
Unknown Unknown
Business
Technology
Computational
Methods &
Algorithms
Maths & Pure
Sciences
The reverse pyramid
of Technology Innovation
Known
Innovation
Opportunity
Area
“The vast amount
of products is
based on just a
handful of
theorems”
Anestis Fachantidis – From Science to Data 27
Innovation
“angle” -
magnitude
Innovation
propagation
time
Unknown Unknown
Another
Innovation!
Business
Technology
Computational
Methods &
Algorithms
Maths & Pure
Sciences
The reverse pyramid
of Technology Innovation
Known
Innovation
Opportunity
Area
“The vast amount
of products is
based on just a
handful of
theorems”
Anestis Fachantidis – From Science to Data 28
Innovation
“angle” -
magnitude
Innovation
propagation
time
Unknown Unknown
Business
Technology
Computational
Methods &
Algorithms
Maths & Pure
Sciences
The reverse pyramid
of Technology Innovation
Known
Innovation
Opportunity
Area
“The vast amount
of products is
based on just a
handful of
theorems”
“A DS can act
as innovation
facilitator”
Anestis Fachantidis – From Science to Data 29
Innovation
“angle” -
magnitude
Innovation
propagation
time
Unknown Unknown
What you can be paid for
Get hired in industries that have already adopted DS and ML
Anestis Fachantidis – From Science to Data 30
Hiring, per Indusry
Anestis Fachantidis – From Science to Data 31
KDnuggets survey 2019: 1,001 data scientist’s LinkedIn
profiles
Skills/Industry Comparison
Anestis Fachantidis – From Science to Data 32
DS Importance, per Industry
Anestis Fachantidis – From Science to Data 33
Technology/software
Anestis Fachantidis – From Science to Data 34
source: O’Reilly Media - spring 2017 - 875 respondents
Financial Services
Anestis Fachantidis – From Science to Data 35
source: O’Reilly Media - spring 2017 - 875 respondents
What you love
A reason to wake up happily in the morning
Anestis Fachantidis – From Science to Data 36
“Love at first result”
You will probably love DS when:
• You produce your first insight on actual business
or real-life data
• You see a learning process reducing its error on
test data
• Your result dazzled a business stakeholder and…
…actually made him/her take a decision…
…and a successful one, as measured later
based on some KPI.
Anestis Fachantidis – From Science to Data 37
The Hype Tree
Trunk
=
Solid
understanding of
the underlying
technology
Anestis Fachantidis – From Science to Data 38
≔ 𝐬𝐢𝐧
𝟏 𝟎
𝟎 𝟏
' ( 𝒆 𝒙
Thank You!
• Follow us on LinkedIn!
• Contact Details and Talk Notes in:
j.mp/fromsciencetodata
Anestis Fachantidis – From Science to Data 39
From Science to Data: Following a principled path to Data Science

From Science to Data: Following a principled path to Data Science

  • 1.
    From Science toData Dr. Anestis Fachantidis
  • 2.
    From Science toData Following a principled path to Data Science Dr. Anestis Fachantidis CEO & Chief Data Scientist Medoid AI Anestis Fachantidis – From Science to Data 2
  • 3.
    A Data Scientist'sIkigai Anestis Fachantidis – From Science to Data 3
  • 4.
    A Data Scientist'sIkigai Anestis Fachantidis – From Science to Data 4 How to become good at DS What kind of DS does the world need? What to love in DS? Which Industries pay for DS now?
  • 5.
    What you aregood at Become good at some of the tools used in real projects Anestis Fachantidis – From Science to Data 5
  • 6.
    Become good at •Next slides: • Reasons that the theoretical background is mandatory • No one will ever master all these skills ! This is a maximal set, we are just looking for your entry point as a beginner. • This could be the pair of your weakest skill with the strongest one, one to make you a better DS and one to give you early gradification. • The field is leaning more and more towards specialization. Theoretical Background Web apps & services for DS Accessing Data Sources Cluster Computation Data Preparation Git & Documentation Learning Models Business Understanding Data Visualization & Reporting Team Work & Project Management Anestis Fachantidis – From Science to Data 6
  • 7.
    Theoretical Background Minimum setof keywords for beginners Anestis Fachantidis – From Science to Data 7 *External .png file available
  • 8.
    Accessing Data • ReadingData 101: read .csv and excel files (read.xlsx) • Connect to a database • Set up MySQL or SQL Server Express locally and download a sample DB (e.g., Employers DB) • Connect and read through R using frameworks like DBI and dbplyr • You will still need basic knowledge of SQL! Anestis Fachantidis – From Science to Data 8
  • 9.
    Accessing Data dbplyr Example: con<- DBI::dbConnect(RSQLite::SQLite()) flights <- tbl(con,”flights”) flights %>% select(distance, air_time) %>% mutate(speed = distance / (air_time / 60))%>% show_query() #> <SQL> #> SELECT `distance`, `air_time`, `distance` / (`air_time` / 60.0) AS `speed` #> FROM (SELECT `distance`, `air_time` #> FROM `nycflights13::flights`) Anestis Fachantidis – From Science to Data 9
  • 10.
    Data Preparation • Datapreparation takes 60% to 80% of the whole analytical pipeline in a real DS project • In R, frameworks like dplyr and data.table make data handling and processing easier • Numerous packages and libraries for data cleaning • Learning magrittr pipelines will make your code readable and will clear up the data manipulation process • Understanding feature extraction, feature selection and their interconnection with the business context at hand. Anestis Fachantidis – From Science to Data 10
  • 11.
    Learning Models • Frameworkslike scikit-learn (Python) and caret (R) will make your first ML experimentation steps much easier. • They provide a standardized interface to training, testing and hyper-parameter tuning. • Try them on a Kaggle dataset! Anestis Fachantidis – From Science to Data 11
  • 12.
    GIT & Documentation GIT,the most popular version control system today. Among other reasons you need that in DS too: • Results are paired with {parameters, features selected, code} which comprise an (almost) deterministic state. Capture that state for all your results! • Make a Bitbucket, GitLab or GitHub account to: – Create a small DS “portfolio” of personal projects – Collaborate on open source projects • Comment your code and always consider packaging for reusability • Comment your objects: comment(object) <- “…” Don’t name their files like: Results23-5withoutsumofmoney2.rds Anestis Fachantidis – From Science to Data 12
  • 13.
    Data Visualization &Reporting • Significant DS task on their own, the most powerful communication tools of a DS • Learn at least one plotting “language” • In R, the most well known plotting grammar is that of ggplot • For interactive charts you can also use platforms like plot.ly and rCharts • The fully reproducible paradigm of a compiled report: “Code and text in one document side by side” • Learn tools like rMarkdown (R) and Jupiter (Python) which use an easy markdown syntax Anestis Fachantidis – From Science to Data 13
  • 14.
    Data Visualization &Reporting Anestis Fachantidis – From Science to Data 14
  • 15.
    Web Apps &Services for DS Analytics APIs and ML web services: • For small teams a Platform-as-a-service solution like Heroku makes it easy to deploy a data product • Basic understanding of the HTTP protocol and data formats such as JSON Data Products as Web Apps: • Web app frameworks like Shiny (R) or Django (Python) can help deliver analytics or ML results having limited knowledge of web development Anestis Fachantidis – From Science to Data 15
  • 16.
    Web Apps &Services for DS Anestis Fachantidis – From Science to Data 16
  • 17.
    Cluster Computation • SparkRand Sparklyr will let you easily setup a distributed computation cluster in R • Access to MLlib library • Package H2O for access to H20 open source engine for analytics and ML • Set them up locally and try them on sample data! Anestis Fachantidis – From Science to Data 17
  • 18.
    Business Understanding • Moregeneral: Domain understanding • Requirements Analysis = following a structured analytical process + communication skills • Beyond understanding the meaning of each business variable, understanding: – If the variable is directly controlled or not – How does this influences other variables Anestis Fachantidis – From Science to Data 18
  • 19.
    Team Work &Project Management • Learn about some software development processes like Scrum and Kanban • Differences to Software Development processes • Data mining process model: CRISP-DM • Dedicated/Sophisticated platforms like Atlassian JIRA • For a small team or a small DS project a Trello board is more than enough! Anestis Fachantidis – From Science to Data 19
  • 20.
    What the worldneeds It should not be only about “performance” and revenue streams Anestis Fachantidis – From Science to Data 20
  • 21.
    Insights • Open Insights –People need to make sense of data – Think of any NGO organization you love its purpose or even the shop around the corner you just love its products: • Wouldn't you want to give them relevant insights about their business environment? • Insights that will make them act accordingly and become sustainable – Data Scientists also have the responsibility to educate people on the interpretation of results and on how they could identify bad data journalism Anestis Fachantidis – From Science to Data 21
  • 22.
    …Openness… Open Source • Theopen source communities need support and people that care about their projects too – Don’t just use them, think of ways to contribute • Write your own R package or Python library – Share it on a Git web platform and it might be the next big thing in open source! Open Data, which are of critical value for the following reasons: – Transparency and democratic control – Improved or new private products and services – Improved efficiency and effectiveness of government services – New knowledge from combined data sources and patterns in large data volumes …which all need a Data Scientist! Anestis Fachantidis – From Science to Data 22
  • 23.
    Business Technology Computational Methods & Algorithms Maths &Pure Sciences Innovation Known Anestis Fachantidis – From Science to Data 23 Unknown Unknown
  • 24.
    Business Technology Computational Methods & Algorithms Maths &Pure Sciences The reverse pyramid of Technology Innovation Known “The vast amount of products is based on just a handful of theorems” Anestis Fachantidis – From Science to Data 24 Unknown Unknown
  • 25.
    Business Technology Computational Methods & Algorithms Maths &Pure Sciences The reverse pyramid of Technology Innovation Unknown UnknownKnown “The vast amount of products is based on just a handful of theorems” Anestis Fachantidis – From Science to Data 25 An innovation just happened here !
  • 26.
    Business Technology Computational Methods & Algorithms Maths &Pure Sciences The reverse pyramid of Technology Innovation Known Innovation Opportunity Area Innovation propagation time “The vast amount of products is based on just a handful of theorems” Anestis Fachantidis – From Science to Data 26 Innovation “angle” - magnitude Unknown Unknown
  • 27.
    Business Technology Computational Methods & Algorithms Maths &Pure Sciences The reverse pyramid of Technology Innovation Known Innovation Opportunity Area “The vast amount of products is based on just a handful of theorems” Anestis Fachantidis – From Science to Data 27 Innovation “angle” - magnitude Innovation propagation time Unknown Unknown Another Innovation!
  • 28.
    Business Technology Computational Methods & Algorithms Maths &Pure Sciences The reverse pyramid of Technology Innovation Known Innovation Opportunity Area “The vast amount of products is based on just a handful of theorems” Anestis Fachantidis – From Science to Data 28 Innovation “angle” - magnitude Innovation propagation time Unknown Unknown
  • 29.
    Business Technology Computational Methods & Algorithms Maths &Pure Sciences The reverse pyramid of Technology Innovation Known Innovation Opportunity Area “The vast amount of products is based on just a handful of theorems” “A DS can act as innovation facilitator” Anestis Fachantidis – From Science to Data 29 Innovation “angle” - magnitude Innovation propagation time Unknown Unknown
  • 30.
    What you canbe paid for Get hired in industries that have already adopted DS and ML Anestis Fachantidis – From Science to Data 30
  • 31.
    Hiring, per Indusry AnestisFachantidis – From Science to Data 31 KDnuggets survey 2019: 1,001 data scientist’s LinkedIn profiles
  • 32.
  • 33.
    DS Importance, perIndustry Anestis Fachantidis – From Science to Data 33
  • 34.
    Technology/software Anestis Fachantidis –From Science to Data 34 source: O’Reilly Media - spring 2017 - 875 respondents
  • 35.
    Financial Services Anestis Fachantidis– From Science to Data 35 source: O’Reilly Media - spring 2017 - 875 respondents
  • 36.
    What you love Areason to wake up happily in the morning Anestis Fachantidis – From Science to Data 36
  • 37.
    “Love at firstresult” You will probably love DS when: • You produce your first insight on actual business or real-life data • You see a learning process reducing its error on test data • Your result dazzled a business stakeholder and… …actually made him/her take a decision… …and a successful one, as measured later based on some KPI. Anestis Fachantidis – From Science to Data 37
  • 38.
    The Hype Tree Trunk = Solid understandingof the underlying technology Anestis Fachantidis – From Science to Data 38 ≔ 𝐬𝐢𝐧 𝟏 𝟎 𝟎 𝟏 ' ( 𝒆 𝒙
  • 39.
    Thank You! • Followus on LinkedIn! • Contact Details and Talk Notes in: j.mp/fromsciencetodata Anestis Fachantidis – From Science to Data 39