graph analysis beyond linear algebra
E. Jason Riedy
DMML, 24 October 2015
HPC Lab, School of Computational Science and Engineering
Georgia Institute of Technology
motivation and applications
(insert prefix here)-scale data analysis
Cyber-security Identify anomalies, malicious actors
Health care Finding outbreaks, population epidemiology
Social networks Advertising, searching, grouping
Intelligence Decisions at scale, regulating algorithms
Systems biology Understanding interactions, drug design
Power grid Disruptions, conservation
Simulation Discrete events, cracking meshes
• Graphs are a motif / theme in data analysis.
• Changing and dynamic graphs are important! 3
outline
1. Motivation and background
2. Linear algebra leads to a better graph algorithm:
incremental PageRank
3. Sparse linear algebra techniques lead to a scoop:
community detection
4. And something else: connected components
4
why graphs?
Another tool, like dense and sparse linear algebra.
• Combine things with pairwise
relationships
• Smaller, more generic than raw data.
• Taught (roughly) to all CS students...
• Semantic attributions can capture
essential relationships.
• Traversals can be faster than filtering
DB joins.
• Provide clear phrasing for queries
about relationships.
5
potential applications
• Social Networks
• Identify communities, influences, bridges, trends,
anomalies (trends before they happen)...
• Potential to help social sciences, city planning, and
others with large-scale data.
• Cybersecurity
• Determine if new connections can access a device or
represent new threat in < 5ms...
• Is the transfer by a virus / persistent threat?
• Bioinformatics, health
• Construct gene sequences, analyze protein
interactions, map brain interactions
• Credit fraud forensics ⇒ detection ⇒ monitoring
• Integrate all the customer’s data, identify in real-time
6
streaming graph data
Networks data rates:
• Gigabit ethernet: 81k – 1.5M packets per second
• Over 130 000 flows per second on 10 GigE (< 7.7 µs)
Person-level data rates:
• 500M posts per day on Twitter (6k / sec)1
• 3M posts per minute on Facebook (50k / sec)2
We need to analyze only changes and not entire graph.
Throughput & latency trade off and expose different
levels of concurrency.
1
www.internetlivestats.com/twitter-statistics/
2
www.jeffbullas.com/2015/04/17/21-awesome-facebook-facts-and-statistics-you-need-to-check-out/
7
incremental pagerank
pagerank
Everyone’s “favorite” metric: PageRank.
• Stationary distribution of the random surfer model.
• Eigenvalue problem can be re-phrased as a linear
system
(
I − αAT
D−1
)
x = kv,
with
α teleportation constant, much < 1
A adjacency matrix
D diagonal matrix of out degrees, with
x/0 = x (self-loop)
v personalization vector, here 1/|V|
k irrelevant scaling constant
• Amenable to analysis, etc. 9
incremental pagerank
• Streaming data setting, update PageRank without
touching the entire graph.
• Existing methods maintain databases of walks, etc.
• Let A∆ = A + ∆A, D∆ = D + ∆D for the new graph,
want to solve for x + ∆x.
• Simple algebra:
(
I − αAT
∆D−1
∆
)
∆x = α
(
A∆D−1
∆ − AD−1
)
x,
and the right-hand side is sparse.
• Re-arrange for Jacobi,
∆x(k+1)
= αAT
∆D−1
∆ ∆x(k)
+ α
(
A∆D−1
∆ − AD−1
)
x,
iterate, ...
10
incremental pagerank: whoops
1000 100 10
q
q q q q q q q q q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
qq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
1e−12
1e−10
1e−12
1e−10
1e−12
1e−10
1e−12
1e−10
1e−12
1e−10
1e−12
1e−10
caidaRouterLevelcoPapersCiteseercoPapersDBLPgreat_britain.osmPGPgiantcompopower
0 2500 5000 7500 10000 0 2500 5000 7500 10000 0 2500 5000 7500 10000
k
val
graphname q caidaRouterLevel coPapersCiteseer coPapersDBLP great_britain.osm PGPgiantcompo power
• And fail. The updated solution wanders away from
the true solution. Top rankings stay the same...
11
incremental pagerank: think instead
• The old solution x is an ok, not exact, solution to the
original problem, now a nearby problem.
• How close? Residual:
r′
= kv − x + αA∆D−1
∆ x
= r + α
(
A∆D−1
∆ − AD−1
)
x.
• Solve (I − αA∆D−1
∆ )∆x = r′
.
• Cheat by not refining all of r′
, only region growing
around the changes.
• (Also cheat by updating r rather than recomputing at
the changes.)
12
incremental pagerank: works
belgium.osm caidaRouterLevel coPapersDBLP luxembourg.osm
1e−04
1e−02
1e−04
1e−02
1e−04
1e−02
100100010000
0 5000 10000 15000 200000 5000 10000 15000 200000 5000 10000 15000 200000 5000 10000 15000 20000
Number of edges added
Relative1−norm−wisebackwarderrorv.restartedPageRank
alg dpr dprheld pr pr_restart
• Thinking about the numerical linear algebra issues
can lead to better graph algorithms. 13
community detection
graphs: big, nasty hairballs
Yifan Hu’s (AT&T) visualization of the in-2004 data set
http://www2.research.att.com/~yifanhu/gallery.html
15
but no shortage of structure...
Protein interactions, Giot et al., “A Protein
Interaction Map of Drosophila melanogaster”,
Science 302, 1722-1736, 2003.
Jason’s network via LinkedIn Labs
• Locally, there are clusters or communities.
• Until 2011, no parallel method for community
detection.
• But, gee, the problem looks familiar...
16
community detection
• Partition a graph’s
vertices into disjoint
communities.
• A community locally
optimizes some
metric, NP-hard.
• Trying to capture that
vertices are more
similar within one
community than
between
communities. Jason’s network via LinkedIn Labs
17
common community metric: modularity
• Modularity: Deviation of connectivity in the
community induced by a vertex set S from some
expected background model of connectivity.
• Newman’s uniform model, modularity of a cluster is
fraction of edges in the community −
fraction expected from uniformly sampling graphs
with the same degree sequence
QS = (mS − x2
S/4m)/m
• Modularity: sum of cluster contributions
• “Sufficiently large” modularity ⇒ some structure
• Known issues: Resolution limit, NP, etc.
18
sequential agglomerative method
A
B
C
D
E
FG
• A common method (e.g. Clauset,
Newman, & Moore, 2004)
agglomerates vertices into
communities.
• Each vertex begins in its own
community.
• An edge is chosen to contract.
• Merging maximally increases
modularity.
• Priority queue.
• Known often to fall into an O(n2
)
performance trap with
modularity (Wakita & Tsurumi ’07). 19
sequential agglomerative method
A
B
C
D
E
FG
C
B
• A common method (e.g. Clauset,
Newman, & Moore, 2004)
agglomerates vertices into
communities.
• Each vertex begins in its own
community.
• An edge is chosen to contract.
• Merging maximally increases
modularity.
• Priority queue.
• Known often to fall into an O(n2
)
performance trap with
modularity (Wakita & Tsurumi ’07). 19
sequential agglomerative method
A
B
C
D
E
FG
C
B
D
A
• A common method (e.g. Clauset,
Newman, & Moore, 2004)
agglomerates vertices into
communities.
• Each vertex begins in its own
community.
• An edge is chosen to contract.
• Merging maximally increases
modularity.
• Priority queue.
• Known often to fall into an O(n2
)
performance trap with
modularity (Wakita & Tsurumi ’07). 19
sequential agglomerative method
A
B
C
D
E
FG
C
B
D
A
B
C
• A common method (e.g. Clauset,
Newman, & Moore, 2004)
agglomerates vertices into
communities.
• Each vertex begins in its own
community.
• An edge is chosen to contract.
• Merging maximally increases
modularity.
• Priority queue.
• Known often to fall into an O(n2
)
performance trap with
modularity (Wakita & Tsurumi ’07). 19
parallel agglomerative method
A
B
C
D
E
FG
• Use a matching to avoid the queue.
• Compute a heavy weight matching.
• Simple greedy, maximal algorithm.
• Within factor of 2 from heaviest.
• Merge all communities at once.
• Maintains some balance.
• Produces different results.
• Agnostic to weighting, matching
• Up until 2011, no one tried this...
20
parallel agglomerative method
A
B
C
D
E
FG
C
D
G
• Use a matching to avoid the queue.
• Compute a heavy weight matching.
• Simple greedy, maximal algorithm.
• Within factor of 2 from heaviest.
• Merge all communities at once.
• Maintains some balance.
• Produces different results.
• Agnostic to weighting, matching
• Up until 2011, no one tried this...
20
parallel agglomerative method
A
B
C
D
E
FG
C
D
G
E
B
C
• Use a matching to avoid the queue.
• Compute a heavy weight matching.
• Simple greedy, maximal algorithm.
• Within factor of 2 from heaviest.
• Merge all communities at once.
• Maintains some balance.
• Produces different results.
• Agnostic to weighting, matching
• Up until 2011, no one tried this...
20
parallel agglomerative community detection
Graph |V| |E| Reference
soc-LiveJournal1 4 847 571 68 993 773 “SNAP”
uk-2007-05 105 896 555 3 301 876 564 Ubicrawler
Peak processing rates in edges/second:
Platform Mem soc-LiveJournal1 uk-2007-05
E7-8870 256GiB 6.90 × 106
6.54 × 106
XMT2 2TiB 1.73 × 106
3.11 × 106
Clustering: Sufficiently good. Won 10th
DIMACS
Implementation Challenge’s mix category in 2012. Later:
Fagginger Auer & Bisseling (2012), add star detection.
LaSalle and Karypis (2014), add m-l “refinement.” 21
what about streaming?
Data and plots from Pushkar Godbolé.
Preliminary experiments...
Simple re-agglomeration: Fast,
decreasing modularity.
“Backtracking” appears to work,
but carries more data (see also
Görke, et al. at KIT).
Clusterings are very sensitive.
22
connected components
sensitivity and components
• Ok, clusterings are optimizing over a bumpy surface,
of course they’re sensitive... (Streaming exacerbates.)
• Pick a clean problem: connected components
• Where could errors occur?
• Streaming: Dropped, forgotten information
• Computing: Stop for energy or time, thresholds
• Real-life: Surveys not returned
• How do you even measure errors?
• Pairwise co-membership counts
• Empirical distributions from vertex membership
• No one measure...
• All need the true solution...
24
sensitivity of connected components
−Error+
− Fraction of graph remaing (kinda) +
From Zakrzewska & Bader, “Measuring the Sensitivity of Graph Metrics to Missing Data,” PPAM 2013
25
questions / future
Can graph analysis learn from linear algebra and
numerical analysis?
• Are there relevant concepts of backward error?
• Don’t need the true solution to evaluate (or estimate)
some distance.
• Should graph analysis look more in the statistical
direction?
• Moderate graphs hit converging limits.
• Are there other easy analogies / low-hanging fruit?
• Environments for playing with large graphs?
• Sane threading and atomic operations (not data)
Feel free to join in...
26
acknowledgements
hpc lab people
Faculty:
• David A. Bader
• Oded Green (was
student)
Data:
• Pushkar Godbolé
• Anita Zakrzewska
STINGER:
• Robert McColl,
• James Fairbanks,
• Adam McLaughlin,
• Daniel Henderson,
• David Ediger (now
GTRI),
• Jason Poovey (GTRI),
• Karl Jiang, and
• feedback from users in
industry, government,
academia
28
stinger: where do you get it?
Home: www.cc.gatech.edu/stinger/
Code: git.cc.gatech.edu/git/u/eriedy3/stinger.git/
Gateway to
• code,
• development,
• documentation,
• presentations...
Remember: Academic code, but maturing
with contributions.
Users / contributors / questioners:
Georgia Tech, PNNL, CMU, Berkeley, Intel,
Cray, NVIDIA, IBM, Federal Government,
Ionic Security, Citi, ...
29

Graph Analysis Beyond Linear Algebra

  • 1.
    graph analysis beyondlinear algebra E. Jason Riedy DMML, 24 October 2015 HPC Lab, School of Computational Science and Engineering Georgia Institute of Technology
  • 2.
  • 3.
    (insert prefix here)-scaledata analysis Cyber-security Identify anomalies, malicious actors Health care Finding outbreaks, population epidemiology Social networks Advertising, searching, grouping Intelligence Decisions at scale, regulating algorithms Systems biology Understanding interactions, drug design Power grid Disruptions, conservation Simulation Discrete events, cracking meshes • Graphs are a motif / theme in data analysis. • Changing and dynamic graphs are important! 3
  • 4.
    outline 1. Motivation andbackground 2. Linear algebra leads to a better graph algorithm: incremental PageRank 3. Sparse linear algebra techniques lead to a scoop: community detection 4. And something else: connected components 4
  • 5.
    why graphs? Another tool,like dense and sparse linear algebra. • Combine things with pairwise relationships • Smaller, more generic than raw data. • Taught (roughly) to all CS students... • Semantic attributions can capture essential relationships. • Traversals can be faster than filtering DB joins. • Provide clear phrasing for queries about relationships. 5
  • 6.
    potential applications • SocialNetworks • Identify communities, influences, bridges, trends, anomalies (trends before they happen)... • Potential to help social sciences, city planning, and others with large-scale data. • Cybersecurity • Determine if new connections can access a device or represent new threat in < 5ms... • Is the transfer by a virus / persistent threat? • Bioinformatics, health • Construct gene sequences, analyze protein interactions, map brain interactions • Credit fraud forensics ⇒ detection ⇒ monitoring • Integrate all the customer’s data, identify in real-time 6
  • 7.
    streaming graph data Networksdata rates: • Gigabit ethernet: 81k – 1.5M packets per second • Over 130 000 flows per second on 10 GigE (< 7.7 µs) Person-level data rates: • 500M posts per day on Twitter (6k / sec)1 • 3M posts per minute on Facebook (50k / sec)2 We need to analyze only changes and not entire graph. Throughput & latency trade off and expose different levels of concurrency. 1 www.internetlivestats.com/twitter-statistics/ 2 www.jeffbullas.com/2015/04/17/21-awesome-facebook-facts-and-statistics-you-need-to-check-out/ 7
  • 8.
  • 9.
    pagerank Everyone’s “favorite” metric:PageRank. • Stationary distribution of the random surfer model. • Eigenvalue problem can be re-phrased as a linear system ( I − αAT D−1 ) x = kv, with α teleportation constant, much < 1 A adjacency matrix D diagonal matrix of out degrees, with x/0 = x (self-loop) v personalization vector, here 1/|V| k irrelevant scaling constant • Amenable to analysis, etc. 9
  • 10.
    incremental pagerank • Streamingdata setting, update PageRank without touching the entire graph. • Existing methods maintain databases of walks, etc. • Let A∆ = A + ∆A, D∆ = D + ∆D for the new graph, want to solve for x + ∆x. • Simple algebra: ( I − αAT ∆D−1 ∆ ) ∆x = α ( A∆D−1 ∆ − AD−1 ) x, and the right-hand side is sparse. • Re-arrange for Jacobi, ∆x(k+1) = αAT ∆D−1 ∆ ∆x(k) + α ( A∆D−1 ∆ − AD−1 ) x, iterate, ... 10
  • 11.
    incremental pagerank: whoops 1000100 10 q q q q q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq 1e−12 1e−10 1e−12 1e−10 1e−12 1e−10 1e−12 1e−10 1e−12 1e−10 1e−12 1e−10 caidaRouterLevelcoPapersCiteseercoPapersDBLPgreat_britain.osmPGPgiantcompopower 0 2500 5000 7500 10000 0 2500 5000 7500 10000 0 2500 5000 7500 10000 k val graphname q caidaRouterLevel coPapersCiteseer coPapersDBLP great_britain.osm PGPgiantcompo power • And fail. The updated solution wanders away from the true solution. Top rankings stay the same... 11
  • 12.
    incremental pagerank: thinkinstead • The old solution x is an ok, not exact, solution to the original problem, now a nearby problem. • How close? Residual: r′ = kv − x + αA∆D−1 ∆ x = r + α ( A∆D−1 ∆ − AD−1 ) x. • Solve (I − αA∆D−1 ∆ )∆x = r′ . • Cheat by not refining all of r′ , only region growing around the changes. • (Also cheat by updating r rather than recomputing at the changes.) 12
  • 13.
    incremental pagerank: works belgium.osmcaidaRouterLevel coPapersDBLP luxembourg.osm 1e−04 1e−02 1e−04 1e−02 1e−04 1e−02 100100010000 0 5000 10000 15000 200000 5000 10000 15000 200000 5000 10000 15000 200000 5000 10000 15000 20000 Number of edges added Relative1−norm−wisebackwarderrorv.restartedPageRank alg dpr dprheld pr pr_restart • Thinking about the numerical linear algebra issues can lead to better graph algorithms. 13
  • 14.
  • 15.
    graphs: big, nastyhairballs Yifan Hu’s (AT&T) visualization of the in-2004 data set http://www2.research.att.com/~yifanhu/gallery.html 15
  • 16.
    but no shortageof structure... Protein interactions, Giot et al., “A Protein Interaction Map of Drosophila melanogaster”, Science 302, 1722-1736, 2003. Jason’s network via LinkedIn Labs • Locally, there are clusters or communities. • Until 2011, no parallel method for community detection. • But, gee, the problem looks familiar... 16
  • 17.
    community detection • Partitiona graph’s vertices into disjoint communities. • A community locally optimizes some metric, NP-hard. • Trying to capture that vertices are more similar within one community than between communities. Jason’s network via LinkedIn Labs 17
  • 18.
    common community metric:modularity • Modularity: Deviation of connectivity in the community induced by a vertex set S from some expected background model of connectivity. • Newman’s uniform model, modularity of a cluster is fraction of edges in the community − fraction expected from uniformly sampling graphs with the same degree sequence QS = (mS − x2 S/4m)/m • Modularity: sum of cluster contributions • “Sufficiently large” modularity ⇒ some structure • Known issues: Resolution limit, NP, etc. 18
  • 19.
    sequential agglomerative method A B C D E FG •A common method (e.g. Clauset, Newman, & Moore, 2004) agglomerates vertices into communities. • Each vertex begins in its own community. • An edge is chosen to contract. • Merging maximally increases modularity. • Priority queue. • Known often to fall into an O(n2 ) performance trap with modularity (Wakita & Tsurumi ’07). 19
  • 20.
    sequential agglomerative method A B C D E FG C B •A common method (e.g. Clauset, Newman, & Moore, 2004) agglomerates vertices into communities. • Each vertex begins in its own community. • An edge is chosen to contract. • Merging maximally increases modularity. • Priority queue. • Known often to fall into an O(n2 ) performance trap with modularity (Wakita & Tsurumi ’07). 19
  • 21.
    sequential agglomerative method A B C D E FG C B D A •A common method (e.g. Clauset, Newman, & Moore, 2004) agglomerates vertices into communities. • Each vertex begins in its own community. • An edge is chosen to contract. • Merging maximally increases modularity. • Priority queue. • Known often to fall into an O(n2 ) performance trap with modularity (Wakita & Tsurumi ’07). 19
  • 22.
    sequential agglomerative method A B C D E FG C B D A B C •A common method (e.g. Clauset, Newman, & Moore, 2004) agglomerates vertices into communities. • Each vertex begins in its own community. • An edge is chosen to contract. • Merging maximally increases modularity. • Priority queue. • Known often to fall into an O(n2 ) performance trap with modularity (Wakita & Tsurumi ’07). 19
  • 23.
    parallel agglomerative method A B C D E FG •Use a matching to avoid the queue. • Compute a heavy weight matching. • Simple greedy, maximal algorithm. • Within factor of 2 from heaviest. • Merge all communities at once. • Maintains some balance. • Produces different results. • Agnostic to weighting, matching • Up until 2011, no one tried this... 20
  • 24.
    parallel agglomerative method A B C D E FG C D G •Use a matching to avoid the queue. • Compute a heavy weight matching. • Simple greedy, maximal algorithm. • Within factor of 2 from heaviest. • Merge all communities at once. • Maintains some balance. • Produces different results. • Agnostic to weighting, matching • Up until 2011, no one tried this... 20
  • 25.
    parallel agglomerative method A B C D E FG C D G E B C •Use a matching to avoid the queue. • Compute a heavy weight matching. • Simple greedy, maximal algorithm. • Within factor of 2 from heaviest. • Merge all communities at once. • Maintains some balance. • Produces different results. • Agnostic to weighting, matching • Up until 2011, no one tried this... 20
  • 26.
    parallel agglomerative communitydetection Graph |V| |E| Reference soc-LiveJournal1 4 847 571 68 993 773 “SNAP” uk-2007-05 105 896 555 3 301 876 564 Ubicrawler Peak processing rates in edges/second: Platform Mem soc-LiveJournal1 uk-2007-05 E7-8870 256GiB 6.90 × 106 6.54 × 106 XMT2 2TiB 1.73 × 106 3.11 × 106 Clustering: Sufficiently good. Won 10th DIMACS Implementation Challenge’s mix category in 2012. Later: Fagginger Auer & Bisseling (2012), add star detection. LaSalle and Karypis (2014), add m-l “refinement.” 21
  • 27.
    what about streaming? Dataand plots from Pushkar Godbolé. Preliminary experiments... Simple re-agglomeration: Fast, decreasing modularity. “Backtracking” appears to work, but carries more data (see also Görke, et al. at KIT). Clusterings are very sensitive. 22
  • 28.
  • 29.
    sensitivity and components •Ok, clusterings are optimizing over a bumpy surface, of course they’re sensitive... (Streaming exacerbates.) • Pick a clean problem: connected components • Where could errors occur? • Streaming: Dropped, forgotten information • Computing: Stop for energy or time, thresholds • Real-life: Surveys not returned • How do you even measure errors? • Pairwise co-membership counts • Empirical distributions from vertex membership • No one measure... • All need the true solution... 24
  • 30.
    sensitivity of connectedcomponents −Error+ − Fraction of graph remaing (kinda) + From Zakrzewska & Bader, “Measuring the Sensitivity of Graph Metrics to Missing Data,” PPAM 2013 25
  • 31.
    questions / future Cangraph analysis learn from linear algebra and numerical analysis? • Are there relevant concepts of backward error? • Don’t need the true solution to evaluate (or estimate) some distance. • Should graph analysis look more in the statistical direction? • Moderate graphs hit converging limits. • Are there other easy analogies / low-hanging fruit? • Environments for playing with large graphs? • Sane threading and atomic operations (not data) Feel free to join in... 26
  • 32.
  • 33.
    hpc lab people Faculty: •David A. Bader • Oded Green (was student) Data: • Pushkar Godbolé • Anita Zakrzewska STINGER: • Robert McColl, • James Fairbanks, • Adam McLaughlin, • Daniel Henderson, • David Ediger (now GTRI), • Jason Poovey (GTRI), • Karl Jiang, and • feedback from users in industry, government, academia 28
  • 34.
    stinger: where doyou get it? Home: www.cc.gatech.edu/stinger/ Code: git.cc.gatech.edu/git/u/eriedy3/stinger.git/ Gateway to • code, • development, • documentation, • presentations... Remember: Academic code, but maturing with contributions. Users / contributors / questioners: Georgia Tech, PNNL, CMU, Berkeley, Intel, Cray, NVIDIA, IBM, Federal Government, Ionic Security, Citi, ... 29