ss
Docker, Monitoring and SLURM Specific Visualisations
QNIBTerminal @ work
• Docker in a Nutshell
• QNIBx
Terminal
Monitoring
Inventory
• SLURM Autogenerated Dashboards
2
Agenda
3
About Me
• Christian Kniep
@CQnib, christian@qnib.org
4
About Me
• Christian Kniep
@CQnib, christian@qnib.org
• >10y Iteration
SysAdmin, SysOps, SysEngineer, R&D Engineer
DevOps @Locafox (hyper-scale web-service)
5
About Me
• Christian Kniep
@CQnib, christian@qnib.org
• >10y Iteration
SysAdmin, SysOps, SysEngineer, R&D Engineer
DevOps @Locafox (hyper-scale web-service)
• Founder of QNIB Solutions
Holistic System Management
Containerization of SysOps and Workload
Consultancy / Software Design & Development
Docker in a Nutshell
7
Multiple Guests
SERVER SERVER
Traditional Virtualisation Containerisation
8
Multiple Guests
SERVER
HOST	
  KERNEL
SERVER
HOST	
  KERNEL
Traditional Virtualisation Containerisation
9
Multiple Guests
SERVER
HOST	
  KERNEL
Userland
SERVER
HOST	
  KERNEL
Userland
Traditional Virtualisation Containerisation
10
Multiple Guests
SERVER
HOST	
  KERNEL
HYPERVISOR	
  (Type	
  II)
Userland
SERVER
HOST	
  KERNEL
Userland
Traditional Virtualisation Containerisation
11
Multiple Guests
SERVER
HOST	
  KERNEL
HYPERVISOR	
  (Type	
  II)
KERNEL
Userland
KERNEL KERNEL
SERVER
HOST	
  KERNEL
Userland
Traditional Virtualisation Containerisation
12
Multiple Guests
SERVER
HOST	
  KERNEL
HYPERVISOR	
  (Type	
  II)
KERNEL
Userland
KERNEL KERNEL
Userland
Userland Userland
SERVER
HOST	
  KERNEL
Userland
Traditional Virtualisation Containerisation
13
Multiple Guests
SERVER
HOST	
  KERNEL
HYPERVISOR	
  (Type	
  II)
KERNEL
SERVICE
Userland
KERNEL KERNEL
Userland
Userland Userland
SERVICE SERVICE
SERVER
HOST	
  KERNEL
Userland
Traditional Virtualisation Containerisation
14
Multiple Guests
SERVER
HOST	
  KERNEL
HYPERVISOR	
  (Type	
  II)
KERNEL
SERVICE
Userland
KERNEL KERNEL
Userland
Userland Userland
SERVICE SERVICE
SERVER
HOST	
  KERNEL
Userland
Traditional Virtualisation Containerisation
Docker
15
Multiple Guests
SERVER
HOST	
  KERNEL
HYPERVISOR	
  (Type	
  II)
KERNEL
SERVICE
Userland
KERNEL KERNEL
Userland
Userland Userland
SERVICE SERVICE
SERVER
HOST	
  KERNEL
Userland
Userland	
  (#1) Userland	
  (#2)
Traditional Virtualisation Containerisation
Docker
16
Multiple Guests
SERVER
HOST	
  KERNEL
HYPERVISOR	
  (Type	
  II)
KERNEL
SERVICE
Userland
KERNEL KERNEL
Userland
Userland Userland
SERVICE SERVICE
SERVER
HOST	
  KERNEL
Userland
Userland	
  (#1) Userland	
  (#2)
SERVICE SERVICE
Traditional Virtualisation Containerisation
Docker
HOST
container1
17
Docker Internal View
• Containers are ‘grouped processes’
isolated by Kernel Namespaces (PID, network, mount, …)
resource restrictions applicable through CGroups
bash
ls -l
container2
apache
container3
mysqld
container4
slurmd
ssh
• 1/2 Day, July 16th @ISC High Performance
Deep dive into the talking points
How Docker might impact System Operations & HPC Applications
Further discussion beyond what I am talking about today
18
Docker Workshop
• Full Day, September 28th @ISC Cloud&BigData
19
Docker Workshop #2
QNIBTerminal
• Framework of system container to spin up stacks
SLURM
21
QNIBTerminal
• Framework of system container to spin up stacks
SLURM
22
QNIBTerminal
• Framework of system container to spin up stacks
SLURM
23
QNIBTerminal
• Framework of system container to spin up stacks
SLURM
24
QNIBTerminal
1
2
3
QNIBMonitoring
• Current monitoring systems do not connect
overlaying metrics with log events
use/build inventory system to provide connections usually hidden
users perspective and scope/context/background
26
QNIBMonitoring
• Current monitoring systems do not connect
overlaying metrics with log events
use/build inventory system to provide connections usually hidden
users perspective and scope/context/background
• QNIBMonitoring provides
open metrics system (system / application metrics, log aggregates)
log event framework, consuming/processing/visualise events
auto discovery / configuration through consul
27
QNIBMonitoring
28
QNIBMonitoring
• Logstash (Log/Event Monitoring)
29
QNIBMonitoring
• Grafana (Performance Monitoring)
30
QNIBMonitoring
• Overlay Metrics w/ Events
QNIBInventory
32
QNIBInventory
• Network Topology
33
QNIBInventory
• Installed Software
34
QNIBInventory
• SLURM Cluster
• Enrich Log/Events
35
QNIBInventory
1
2
• Enrich Log/Events
• Help visualise connections
36
QNIBInventory
• Enrich Log/Events
• Help visualise connections
• Build up history
37
QNIBInventory
Cluster Use-Case
• Multiple backgrounds have to be considered
Enduser (Engineer, Software Developer, Scientist)
Operation Personel
Management
• Psychology plays important role
Local rationality / context
10.000ft Overview vs. verifying hypothesis vs. Reporting
Empower users to extend their domain knowledge by providing toolset
39
Context Sensitive Dashboards
• Small SLURM cluster
couple of nodes, two user groups, couple of users
script & MPI workload
40
Cluster Usecase
• Small SLURM cluster
couple of nodes, two user groups, couple of users
script & MPI workload
41
Cluster Usecase
srv backend consul
slurmctld
slurmctld
compute0
slurmd
compute<N>
slurmd
Compute
• Small SLURM cluster
couple of nodes, two user groups, couple of users
script & MPI workload
42
Cluster Usecase
srv backend consul
slurmctld
slurmctld
compute0
slurmd
compute<N>
slurmd
Compute
• Small SLURM cluster
couple of nodes, two user groups, couple of users
script & MPI workload
43
Cluster Usecase
srv backend consul
slurmctld
slurmctld
compute0
slurmd
compute<N>
slurmd
Compute
carbon
carbon
graphite-api
graphite-api
Performance
grafana
grafana
elasticsearch
• Small SLURM cluster
couple of nodes, two user groups, couple of users
script & MPI workload
44
Cluster Usecase
srv backend consul
slurmctld
slurmctld
compute0
slurmd
compute<N>
slurmd
Compute
carbon
carbon
graphite-api
graphite-api
Performance
grafana
grafana
Log/Events
elasticsearch
logger logstash
kibana kiabana
kopf es-kopf
elasticsearch
• Small SLURM cluster
couple of nodes, two user groups, couple of users
script & MPI workload
45
Cluster Usecase
srv backend consul
slurmctld
slurmctld
compute0
slurmd
compute<N>
slurmd
Compute
carbon
carbon
graphite-api
graphite-api
Performance
grafana
grafana
Log/Events
elasticsearch
logger logstash
kibana kiabana
kopf es-kopf
neo4j neo4j
Inventory
inventory QINBInv
elasticsearch
• Small SLURM cluster
couple of nodes, two user groups, couple of users
script & MPI workload
46
Cluster Usecase
srv backend consul
slurmctld
slurmctld
compute0
slurmd
compute<N>
slurmd
Compute
carbon
carbon
graphite-api
graphite-api
Performance
grafana
grafana
Log/Events
elasticsearch
logger logstash
kibana kiabana
kopf es-kopf
neo4j neo4j
Inventory
inventory QINBInv
postgres postgres
galaxy
galaxy galaxy
• Live cluster Status
Utilisation per cluster / user / user-group
SLA met by SysOps
Most common jobs, misbehaving enduser
47
Management Context
• Live cluster Status
Utilisation per cluster / user / user-group
SLA met by SysOps
Most common jobs, misbehaving enduser
• Reports
per day / user / job-type / …
48
Management Context
• Live cluster Status
Utilisation per cluster / user / user-group
SLA met by SysOps
Most common jobs, misbehaving enduser
• Reports
per day / user / job-type / …
• Capacity Planning
utilisation over time, comparison of HW generations, global FS capacity
49
Management Context
50
SLURM Dashboard
51
SLURM Dashboard
• Nodes are connected to Partitions
52
SLURM Inventar
• Nodes are connected to Partitions
• Jobs are connected to both
53
SLURM Inventar
54
SLURM Dashboard
• Live progress of SLURM job
Monitor iteration speed to estimate workload behaviour
Get to know job while it’s running (instead of postmortem)
Introduce application profiling / log events (enhance feedback)
55
Enduser Context
• Live progress of SLURM job
Monitor iteration speed to estimate workload behaviour
Get to know job while it’s running (instead of postmortem)
Introduce application profiling / log events (enhance feedback)
• Post Mortem
Get detailed report after job has finished
56
Enduser Context
• Live progress of SLURM job
Monitor iteration speed to estimate workload behaviour
Get to know job while it’s running (instead of postmortem)
Introduce application profiling / log events (enhance feedback)
• Post Mortem
Get detailed report after job has finished
• MDO jobs
depending on outcome and progression submit next iteration(s)
57
Enduser Context
58
SLURM Dashboard
59
SLURM Dashboard
• Live cluster Status
USE method overviews (Utilisation/Saturation/Errors)
Anomaly detection (w/ and w/o humans)
Spotting abnormal behaviour
60
SysOps Context
• Live cluster Status
USE method overviews (Utilisation/Saturation/Errors)
Anomaly detection (w/ and w/o humans)
Spotting abnormal behaviour
• Drill into monitoring
verify hypothesis about incidents/problems
correlate events, metrics and inventory
61
SysOps Context
• Live cluster Status
USE method overviews (Utilisation/Saturation/Errors)
Anomaly detection (w/ and w/o humans)
Spotting abnormal behaviour
• Drill into monitoring
verify hypothesis about incidents/problems
correlate events, metrics and inventory
• Guid through ‘known problems’
close feedback loops provide confidence
62
SysOps Context
63
Central Logging
64
Galaxy
65
Galaxy Use-Cases
SLURM
66
Galaxy Use-Cases
SLURM
Log
Events
WORKFLOW
Metrics Inventory
• Model Assess Workflow in Galaxy
Easy to grasp (in contrast to Hadoop, Spark, …)
Event triggered, Cronjob?
Using idle compute resources
67
Thank you!
• Contact
christian@qnib.org
@CQnib, @_qnib
• Web
www.qnib.org (blog)
doc.qnib.org (Paper)
• Feel free…
…ask questions (now / later)
…ask for a Demo

Docker, Monitoring and SLURM Specific Visualisations

  • 1.
    ss Docker, Monitoring andSLURM Specific Visualisations QNIBTerminal @ work
  • 2.
    • Docker ina Nutshell • QNIBx Terminal Monitoring Inventory • SLURM Autogenerated Dashboards 2 Agenda
  • 3.
    3 About Me • ChristianKniep @CQnib, christian@qnib.org
  • 4.
    4 About Me • ChristianKniep @CQnib, christian@qnib.org • >10y Iteration SysAdmin, SysOps, SysEngineer, R&D Engineer DevOps @Locafox (hyper-scale web-service)
  • 5.
    5 About Me • ChristianKniep @CQnib, christian@qnib.org • >10y Iteration SysAdmin, SysOps, SysEngineer, R&D Engineer DevOps @Locafox (hyper-scale web-service) • Founder of QNIB Solutions Holistic System Management Containerization of SysOps and Workload Consultancy / Software Design & Development
  • 6.
    Docker in aNutshell
  • 7.
    7 Multiple Guests SERVER SERVER TraditionalVirtualisation Containerisation
  • 8.
    8 Multiple Guests SERVER HOST  KERNEL SERVER HOST  KERNEL Traditional Virtualisation Containerisation
  • 9.
    9 Multiple Guests SERVER HOST  KERNEL Userland SERVER HOST  KERNEL Userland Traditional Virtualisation Containerisation
  • 10.
    10 Multiple Guests SERVER HOST  KERNEL HYPERVISOR  (Type  II) Userland SERVER HOST  KERNEL Userland Traditional Virtualisation Containerisation
  • 11.
    11 Multiple Guests SERVER HOST  KERNEL HYPERVISOR  (Type  II) KERNEL Userland KERNEL KERNEL SERVER HOST  KERNEL Userland Traditional Virtualisation Containerisation
  • 12.
    12 Multiple Guests SERVER HOST  KERNEL HYPERVISOR  (Type  II) KERNEL Userland KERNEL KERNEL Userland Userland Userland SERVER HOST  KERNEL Userland Traditional Virtualisation Containerisation
  • 13.
    13 Multiple Guests SERVER HOST  KERNEL HYPERVISOR  (Type  II) KERNEL SERVICE Userland KERNEL KERNEL Userland Userland Userland SERVICE SERVICE SERVER HOST  KERNEL Userland Traditional Virtualisation Containerisation
  • 14.
    14 Multiple Guests SERVER HOST  KERNEL HYPERVISOR  (Type  II) KERNEL SERVICE Userland KERNEL KERNEL Userland Userland Userland SERVICE SERVICE SERVER HOST  KERNEL Userland Traditional Virtualisation Containerisation Docker
  • 15.
    15 Multiple Guests SERVER HOST  KERNEL HYPERVISOR  (Type  II) KERNEL SERVICE Userland KERNEL KERNEL Userland Userland Userland SERVICE SERVICE SERVER HOST  KERNEL Userland Userland  (#1) Userland  (#2) Traditional Virtualisation Containerisation Docker
  • 16.
    16 Multiple Guests SERVER HOST  KERNEL HYPERVISOR  (Type  II) KERNEL SERVICE Userland KERNEL KERNEL Userland Userland Userland SERVICE SERVICE SERVER HOST  KERNEL Userland Userland  (#1) Userland  (#2) SERVICE SERVICE Traditional Virtualisation Containerisation Docker
  • 17.
    HOST container1 17 Docker Internal View •Containers are ‘grouped processes’ isolated by Kernel Namespaces (PID, network, mount, …) resource restrictions applicable through CGroups bash ls -l container2 apache container3 mysqld container4 slurmd ssh
  • 18.
    • 1/2 Day,July 16th @ISC High Performance Deep dive into the talking points How Docker might impact System Operations & HPC Applications Further discussion beyond what I am talking about today 18 Docker Workshop
  • 19.
    • Full Day,September 28th @ISC Cloud&BigData 19 Docker Workshop #2
  • 20.
  • 21.
    • Framework ofsystem container to spin up stacks SLURM 21 QNIBTerminal
  • 22.
    • Framework ofsystem container to spin up stacks SLURM 22 QNIBTerminal
  • 23.
    • Framework ofsystem container to spin up stacks SLURM 23 QNIBTerminal
  • 24.
    • Framework ofsystem container to spin up stacks SLURM 24 QNIBTerminal 1 2 3
  • 25.
  • 26.
    • Current monitoringsystems do not connect overlaying metrics with log events use/build inventory system to provide connections usually hidden users perspective and scope/context/background 26 QNIBMonitoring
  • 27.
    • Current monitoringsystems do not connect overlaying metrics with log events use/build inventory system to provide connections usually hidden users perspective and scope/context/background • QNIBMonitoring provides open metrics system (system / application metrics, log aggregates) log event framework, consuming/processing/visualise events auto discovery / configuration through consul 27 QNIBMonitoring
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
    • Enrich Log/Events •Help visualise connections 36 QNIBInventory
  • 37.
    • Enrich Log/Events •Help visualise connections • Build up history 37 QNIBInventory
  • 38.
  • 39.
    • Multiple backgroundshave to be considered Enduser (Engineer, Software Developer, Scientist) Operation Personel Management • Psychology plays important role Local rationality / context 10.000ft Overview vs. verifying hypothesis vs. Reporting Empower users to extend their domain knowledge by providing toolset 39 Context Sensitive Dashboards
  • 40.
    • Small SLURMcluster couple of nodes, two user groups, couple of users script & MPI workload 40 Cluster Usecase
  • 41.
    • Small SLURMcluster couple of nodes, two user groups, couple of users script & MPI workload 41 Cluster Usecase srv backend consul slurmctld slurmctld compute0 slurmd compute<N> slurmd Compute
  • 42.
    • Small SLURMcluster couple of nodes, two user groups, couple of users script & MPI workload 42 Cluster Usecase srv backend consul slurmctld slurmctld compute0 slurmd compute<N> slurmd Compute
  • 43.
    • Small SLURMcluster couple of nodes, two user groups, couple of users script & MPI workload 43 Cluster Usecase srv backend consul slurmctld slurmctld compute0 slurmd compute<N> slurmd Compute carbon carbon graphite-api graphite-api Performance grafana grafana
  • 44.
    elasticsearch • Small SLURMcluster couple of nodes, two user groups, couple of users script & MPI workload 44 Cluster Usecase srv backend consul slurmctld slurmctld compute0 slurmd compute<N> slurmd Compute carbon carbon graphite-api graphite-api Performance grafana grafana Log/Events elasticsearch logger logstash kibana kiabana kopf es-kopf
  • 45.
    elasticsearch • Small SLURMcluster couple of nodes, two user groups, couple of users script & MPI workload 45 Cluster Usecase srv backend consul slurmctld slurmctld compute0 slurmd compute<N> slurmd Compute carbon carbon graphite-api graphite-api Performance grafana grafana Log/Events elasticsearch logger logstash kibana kiabana kopf es-kopf neo4j neo4j Inventory inventory QINBInv
  • 46.
    elasticsearch • Small SLURMcluster couple of nodes, two user groups, couple of users script & MPI workload 46 Cluster Usecase srv backend consul slurmctld slurmctld compute0 slurmd compute<N> slurmd Compute carbon carbon graphite-api graphite-api Performance grafana grafana Log/Events elasticsearch logger logstash kibana kiabana kopf es-kopf neo4j neo4j Inventory inventory QINBInv postgres postgres galaxy galaxy galaxy
  • 47.
    • Live clusterStatus Utilisation per cluster / user / user-group SLA met by SysOps Most common jobs, misbehaving enduser 47 Management Context
  • 48.
    • Live clusterStatus Utilisation per cluster / user / user-group SLA met by SysOps Most common jobs, misbehaving enduser • Reports per day / user / job-type / … 48 Management Context
  • 49.
    • Live clusterStatus Utilisation per cluster / user / user-group SLA met by SysOps Most common jobs, misbehaving enduser • Reports per day / user / job-type / … • Capacity Planning utilisation over time, comparison of HW generations, global FS capacity 49 Management Context
  • 50.
  • 51.
  • 52.
    • Nodes areconnected to Partitions 52 SLURM Inventar
  • 53.
    • Nodes areconnected to Partitions • Jobs are connected to both 53 SLURM Inventar
  • 54.
  • 55.
    • Live progressof SLURM job Monitor iteration speed to estimate workload behaviour Get to know job while it’s running (instead of postmortem) Introduce application profiling / log events (enhance feedback) 55 Enduser Context
  • 56.
    • Live progressof SLURM job Monitor iteration speed to estimate workload behaviour Get to know job while it’s running (instead of postmortem) Introduce application profiling / log events (enhance feedback) • Post Mortem Get detailed report after job has finished 56 Enduser Context
  • 57.
    • Live progressof SLURM job Monitor iteration speed to estimate workload behaviour Get to know job while it’s running (instead of postmortem) Introduce application profiling / log events (enhance feedback) • Post Mortem Get detailed report after job has finished • MDO jobs depending on outcome and progression submit next iteration(s) 57 Enduser Context
  • 58.
  • 59.
  • 60.
    • Live clusterStatus USE method overviews (Utilisation/Saturation/Errors) Anomaly detection (w/ and w/o humans) Spotting abnormal behaviour 60 SysOps Context
  • 61.
    • Live clusterStatus USE method overviews (Utilisation/Saturation/Errors) Anomaly detection (w/ and w/o humans) Spotting abnormal behaviour • Drill into monitoring verify hypothesis about incidents/problems correlate events, metrics and inventory 61 SysOps Context
  • 62.
    • Live clusterStatus USE method overviews (Utilisation/Saturation/Errors) Anomaly detection (w/ and w/o humans) Spotting abnormal behaviour • Drill into monitoring verify hypothesis about incidents/problems correlate events, metrics and inventory • Guid through ‘known problems’ close feedback loops provide confidence 62 SysOps Context
  • 63.
  • 64.
  • 65.
  • 66.
    66 Galaxy Use-Cases SLURM Log Events WORKFLOW Metrics Inventory •Model Assess Workflow in Galaxy Easy to grasp (in contrast to Hadoop, Spark, …) Event triggered, Cronjob? Using idle compute resources
  • 67.
    67 Thank you! • Contact christian@qnib.org @CQnib,@_qnib • Web www.qnib.org (blog) doc.qnib.org (Paper) • Feel free… …ask questions (now / later) …ask for a Demo