Docker, Monitoring and SLURM Specific Visualisations

ss
Docker, Monitoring and SLURM Specific Visualisations
QNIBTerminal @ work

• Docker in a Nutshell
• QNIBx
Terminal
Monitoring
Inventory
• SLURM Autogenerated Dashboards
2
Agenda

3
About Me
• Christian Kniep
@CQnib, christian@qnib.org

4
About Me
• Christian Kniep
• >10y Iteration
SysAdmin, SysOps, SysEngineer, R&D Engineer
DevOps @Locafox (hyper-scale web-service)

5
About Me
• Christian Kniep
• >10y Iteration
SysAdmin, SysOps, SysEngineer, R&D Engineer
DevOps @Locafox (hyper-scale web-service)
• Founder of QNIB Solutions
Holistic System Management
Containerization of SysOps and Workload
Consultancy / Software Design & Development

7
Multiple Guests
SERVER SERVER
Traditional Virtualisation Containerisation

8
Multiple Guests
SERVER
HOST
KERNEL
SERVER
HOST
KERNEL

9
Multiple Guests
SERVER
HOST
KERNEL
Userland
SERVER
HOST
KERNEL
Userland

10
Multiple Guests
SERVER
HOST
KERNEL
HYPERVISOR
(Type
II)
Userland
SERVER
HOST
KERNEL
Userland

11
Multiple Guests
SERVER
HOST
KERNEL
HYPERVISOR
(Type
II)
KERNEL
Userland
KERNEL KERNEL
SERVER
HOST
KERNEL
Userland

12
Multiple Guests
SERVER
HOST
KERNEL
HYPERVISOR
(Type
II)
KERNEL
Userland
KERNEL KERNEL
Userland
Userland Userland
SERVER
HOST
KERNEL
Userland

13
Multiple Guests
SERVER
HOST
KERNEL
HYPERVISOR
(Type
II)
KERNEL
SERVICE
Userland
KERNEL KERNEL
Userland
Userland Userland
SERVICE SERVICE
SERVER
HOST
KERNEL
Userland

14
Multiple Guests
SERVER
HOST
KERNEL
HYPERVISOR
(Type
II)
KERNEL
SERVICE
Userland
KERNEL KERNEL
Userland
Userland Userland
SERVICE SERVICE
SERVER
HOST
KERNEL
Userland
Docker

15
Multiple Guests
SERVER
HOST
KERNEL
HYPERVISOR
(Type
II)
KERNEL
SERVICE
Userland
KERNEL KERNEL
Userland
Userland Userland
SERVICE SERVICE
SERVER
HOST
KERNEL
Userland
Userland
(#1) Userland
(#2)
Docker

16
Multiple Guests
SERVER
HOST
KERNEL
HYPERVISOR
(Type
II)
KERNEL
SERVICE
Userland
KERNEL KERNEL
Userland
Userland Userland
SERVICE SERVICE
SERVER
HOST
KERNEL
Userland
Userland
(#1) Userland
(#2)
SERVICE SERVICE
Docker

HOST
container1
17
Docker Internal View
• Containers are ‘grouped processes’
isolated by Kernel Namespaces (PID, network, mount, …)
resource restrictions applicable through CGroups
bash
ls -l
container2
apache
container3
mysqld
container4
slurmd
ssh

• 1/2 Day, July 16th @ISC High Performance
Deep dive into the talking points
How Docker might impact System Operations & HPC Applications
Further discussion beyond what I am talking about today
18
Docker Workshop

• Full Day, September 28th @ISC Cloud&BigData
19
Docker Workshop #2

• Framework of system container to spin up stacks
SLURM
21
QNIBTerminal

SLURM
22
QNIBTerminal

SLURM
23
QNIBTerminal

SLURM
24
QNIBTerminal
1
2
3

• Current monitoring systems do not connect
overlaying metrics with log events
use/build inventory system to provide connections usually hidden
users perspective and scope/context/background
26
QNIBMonitoring

• Current monitoring systems do not connect
overlaying metrics with log events
use/build inventory system to provide connections usually hidden
users perspective and scope/context/background
• QNIBMonitoring provides
open metrics system (system / application metrics, log aggregates)
log event framework, consuming/processing/visualise events
auto discovery / configuration through consul
27
QNIBMonitoring

28
QNIBMonitoring
• Logstash (Log/Event Monitoring)

29
QNIBMonitoring
• Grafana (Performance Monitoring)

30
QNIBMonitoring
• Overlay Metrics w/ Events

32
QNIBInventory
• Network Topology

33
QNIBInventory
• Installed Software

34
QNIBInventory
• SLURM Cluster

• Enrich Log/Events
35
QNIBInventory
1
2

• Help visualise connections
36
QNIBInventory

• Help visualise connections
• Build up history
37
QNIBInventory

• Multiple backgrounds have to be considered
Enduser (Engineer, Software Developer, Scientist)
Operation Personel
Management
• Psychology plays important role
Local rationality / context
10.000ft Overview vs. verifying hypothesis vs. Reporting
Empower users to extend their domain knowledge by providing toolset
39
Context Sensitive Dashboards

• Small SLURM cluster
couple of nodes, two user groups, couple of users
script & MPI workload
40
Cluster Usecase

41
Cluster Usecase
srv backend consul
slurmctld
slurmctld
compute0
slurmd
compute<N>
slurmd
Compute

42
Cluster Usecase
srv backend consul
slurmctld
slurmctld
compute0
slurmd
compute<N>
slurmd
Compute

43
Cluster Usecase
srv backend consul
slurmctld
slurmctld
compute0
slurmd
compute<N>
slurmd
Compute
carbon
carbon
graphite-api
graphite-api
Performance
grafana
grafana

elasticsearch
44
Cluster Usecase
srv backend consul
slurmctld
slurmctld
compute0
slurmd
compute<N>
slurmd
Compute
carbon
carbon
graphite-api
graphite-api
Performance
grafana
grafana
Log/Events
elasticsearch
logger logstash
kibana kiabana
kopf es-kopf

elasticsearch
45
Cluster Usecase
srv backend consul
slurmctld
slurmctld
compute0
slurmd
compute<N>
slurmd
Compute
carbon
carbon
graphite-api
graphite-api
Performance
grafana
grafana
Log/Events
elasticsearch
logger logstash
kibana kiabana
kopf es-kopf
neo4j neo4j
Inventory
inventory QINBInv

elasticsearch
46
Cluster Usecase
srv backend consul
slurmctld
slurmctld
compute0
slurmd
compute<N>
slurmd
Compute
carbon
carbon
graphite-api
graphite-api
Performance
grafana
grafana
Log/Events
elasticsearch
logger logstash
kibana kiabana
kopf es-kopf
neo4j neo4j
Inventory
inventory QINBInv
postgres postgres
galaxy
galaxy galaxy

• Live cluster Status
Utilisation per cluster / user / user-group
SLA met by SysOps
Most common jobs, misbehaving enduser
47
Management Context

SLA met by SysOps
• Reports
per day / user / job-type / …
48
Management Context

SLA met by SysOps
• Reports
per day / user / job-type / …
• Capacity Planning
utilisation over time, comparison of HW generations, global FS capacity
49
Management Context

• Nodes are connected to Partitions
52
SLURM Inventar

• Nodes are connected to Partitions
• Jobs are connected to both
53
SLURM Inventar

• Live progress of SLURM job
Monitor iteration speed to estimate workload behaviour
Get to know job while it’s running (instead of postmortem)
Introduce application profiling / log events (enhance feedback)
55
Enduser Context

• Post Mortem
Get detailed report after job has finished
56
Enduser Context

• Post Mortem
Get detailed report after job has finished
• MDO jobs
depending on outcome and progression submit next iteration(s)
57
Enduser Context

USE method overviews (Utilisation/Saturation/Errors)
Anomaly detection (w/ and w/o humans)
Spotting abnormal behaviour
60
SysOps Context

• Drill into monitoring
verify hypothesis about incidents/problems
correlate events, metrics and inventory
61
SysOps Context

• Drill into monitoring
verify hypothesis about incidents/problems
correlate events, metrics and inventory
• Guid through ‘known problems’
close feedback loops provide confidence
62
SysOps Context

66
Galaxy Use-Cases
SLURM
Log
Events
WORKFLOW
Metrics Inventory
• Model Assess Workflow in Galaxy
Easy to grasp (in contrast to Hadoop, Spark, …)
Event triggered, Cronjob?
Using idle compute resources

67
Thank you!
• Contact
christian@qnib.org
@CQnib, @_qnib
• Web
www.qnib.org (blog)
doc.qnib.org (Paper)
• Feel free…
…ask questions (now / later)
…ask for a Demo

Docker, Monitoring and SLURM Specific Visualisations

More Related Content

Similar to Docker, Monitoring and SLURM Specific Visualisations

Recently uploaded

Docker, Monitoring and SLURM Specific Visualisations