JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course

PostgreSQL – Tomasz Borek
Database for next project?
Why, PostgreSQL of course!
@LAFK_pl
Consultant @

About me
@LAFK_pl
Consultant @
Tomasz Borek

What will I tell you?
● Colourful history of PostgreSQL
– So, DB wars
● Chosen features
● Architecture and internals
● Query path and optimization (no hinting)
● Multithreading (very briefly, too little time)

My Faves
● Error reporting / logging
● PL/xSQL – feel free to use Perl, Python, Ruby, Java,
LISP...
● XML and JSON handling
● Foreign Data Wrappers (FDW)
● Windowing functions
● Common table expressions (CTE) and recursive queries
● Power of Indexes

Will DB eat your cake?
● Thanks @anandology

Will DB eat your cake?
● Thanks @anandology
Consider password VARCHAR(8)

Logging, ‘gotchas’
● Default is to stderr only
●
Set on CLI or in config, not through sets
● Where is it?
●
How to log queries… or turning log_collector on

Where is it?
● Default
– data/pg_log
● Launchers can set it (Mac Homebrew/plist)
● Version and config dependent

Logging, turn it on
● Default is to stderr only
● In PG:
logging_collector = on
log_filename = strftime-patterned filename
[log_destination = [stderr|syslog|csvlog] ]
log_statement = [none|ddl|mod|all] // all
log_min_error_statement = ERROR
log_line_prefix = '%t %c %u ' # time sessionid user

PL/pgSQL
● Stored procedure dilemma
– Where to keep your logic?
– How your logic is NOT in your SCM

PL/pgSQL
● Over dozen of options:
– Perl, Python, Ruby,
– pgSQL, Java,
– TCL, LISP…

PL/pgSQL
● Over dozen of options:
– Perl, Python, Ruby,
– pgSQL, Java,
– TCL, LISP…
● DevOps, SysAdmins, DBAs… ETLs etc.

Perl function example
CREATE FUNCTION perl_max (integer, integer) RETURNS integer AS $$
my ($x, $y) = @_;
if (not defined $x) {
return undef if not defined $y;
return $y;
}
return $x if not defined $y;
return $x if $x > $y;
return $y;
$$ LANGUAGE plperl;

XML or JSON support
● Parsing and retrieving XML (functions)
● Valid JSON checks (type)
● Careful with encoding!
– PG allows only one server encoding per database
– Specify it to UTF-8 or weep
● Document database instead of OO or rel
– JSON, JSONB, HSTORE – noSQL fun welcome!

HSTORE?
CREATE TABLE example (
id serial PRIMARY KEY,
data hstore);

HSTORE?
data hstore);
INSERT INTO example (data) VALUES
('name => "John Smith", age => 28, gender => "M"'),
('name => "Jane Smith", age => 24');

HSTORE?
data hstore);
INSERT INTO example (data)
VALUES
('name => "John Smith", age => 28,
gender => "M"'),
('name => "Jane Smith", age => 24');
SELECT id,
data->'name'
FROM example;
SELECT id, data->'age'
FROM example
WHERE data->'age' >=
'25';

XML and JSON datatype
CREATE TABLE test (
...,
xml_file xml,
json_file json,
...
);

XML functions example
XMLROOT (
XMLELEMENT (
NAME gazonk,
XMLATTRIBUTES (
’val’ AS name,
1 + 1 AS num
),
XMLELEMENT (
NAME qux,
’foo’
)
),
VERSION ’1.0’,
STANDALONE YES
)
<?xml version=’1.0’
standalone=’yes’ ?>
<gazonk name=’val’
num=’2’>
<qux>foo</qux>
</gazonk>
xml '<foo>bar</foo>'
'<foo>bar</foo>'::xml

Foreign Data Wrappers (FDW)
● Stop ETL, start FDW
● Read AND write
● FS, Mongo, Hadoop, Redis…
● You can write your own!

Windowing functions
● Replacement for procedures (somewhat)
● In a nutshell:
– Take row,
– find related rows,
– compute things over related rows,
– return result along with the row
● Ranking, averaging, growth per time...
http://www.craigkerstiens.com/2014/02/26/Tracking-MoM-growth-in-SQL/
https://www.postgresql.org/docs/9.1/static/tutorial-window.html

CTEs and recursive queries
● Common table expressions (CTE) and
recursive queries

Index power
● Geo and spherical indexes
● Partial indexes (email like @company.com)
● Function indexes
● JSON(B) has index support
● You may create your own index

Check out processes
●
pgrep -l postgres
●
htop > filter: postgres
● Whatever you like / use usually
●
Careful with kill -9 on connections
– kill -15 better

Query path and optimization (no hinting)

Query Path
http://www.slideshare.net/SFScon/sfscon15-peter-moser-the-path-of-a-query-postgresql-internals

Parser
● Syntax checks, like FRIM is not a keyword
– SELECT * FRIM myTable;
● Catalog lookup
– MyTable may not exist
● In the end query tree is built
– Query tokenization: SELECT (keyword)
employeeName (field id) count (function call)...

Planner
● Where Planner Tree is built
● Where best execution is decided upon
– Seq or index scan? Index or bitmap index?
– Which join order?
– Which join strategy (nested, hashed, merge)?
– Inner or outer?
– Aggregation: plain, hashed, sorted…
● Heuristic, if finding all plans too costly

Example to explain EXPLAIN
EXPLAIN SELECT * FROM tenk1;
QUERY PLAN
------------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..458.00
rows=10000 width=244)

Explaining EXPLAIN - what
QUERY PLAN
------------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..458.00 rows=10000
width=244)
● Startup cost – time before output phase begins
● Total cost – in page fetches, may change, assumed to
run node to completion
●
Rows – estimated number to scan (but LIMIT etc.)
● Estimated average width of output from that node (in
bytes)

Explaining EXPLAIN - how
QUERY PLAN
------------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244)
SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1'; //358|10k
●
No WHERE, no index
● Cost = disk pages read * seq page cost + rows scanned
* cpu tuple cost
● 358 * 1.0 + 10000 * 0.01 = 458 // default values

Analyzing EXPLAIN ANALYZE
EXPLAIN ANALYZE SELECT *
FROM tenk1 t1, tenk2 t2
WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=4.65..118.62 rows=10 width=488) (actual time=0.128..0.377 rows=10 loops=1)
-> Bitmap Heap Scan on tenk1 t1 (cost=4.36..39.47 rows=10 width=244) (actual time=0.057..0.121 rows=10 loops=1)
Recheck Cond: (unique1 < 10)
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.36 rows=10 width=0) (actual time=0.024..0.024 rows=10 loops=1)
Index Cond: (unique1 < 10)
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.91 rows=1 width=244) (actual time=0.021..0.022 rows=1 loops=10)
Index Cond: (unique2 = t1.unique2)
Planning time: 0.181 ms
Execution time: 0.501 ms
● Actually runs the query
● More info: actual times, rows removed by filter,
sort method used, disk/memory used...

Analyzing EXPLAIN ANALYZE
EXPLAIN ANALYZE SELECT *
FROM tenk1 t1, tenk2 t2
WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=4.65..118.62 rows=10 width=488) (actual time=0.128..0.377 rows=10 loops=1)
-> Bitmap Heap Scan on tenk1 t1 (cost=4.36..39.47 rows=10 width=244) (actual time=0.057..0.121 rows=10
loops=1)
Recheck Cond: (unique1 < 10)
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.36 rows=10 width=0) (actual time=0.024..0.024
rows=10 loops=1)
Index Cond: (unique1 < 10)
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.91 rows=1 width=244) (actual time=0.021..0.022
rows=1 loops=10)
Index Cond: (unique2 = t1.unique2)
Planning time: 0.181 ms
Execution time: 0.501 ms

Battle-tested
● Could mature since 1987
● Comes in many flavours (forks)
● Largest cluster – PBs in Yahoo
● Skype, NASA, Instagram
● Stable:
– Many years on one version
– Good version support
– Every year something new
– Follows ANSI SQL standards
https://www.postgresql.org/about/users/

Great features
● Java, Perl, Python for stored procedures
● Add CTEs and FDWs => great ETL or µservice
● Handles XMLs and JSONs
● Error reporting / logging
● MVCC built-in
● Windowing functions
● ...

Solid internals
● Well-thought out processes
● Built-in security (dozen of solutions)
● WAL, stats collector, vacuum
● Good rule engine and clear query optimization
– No hinting will bother some people
● Plethora of data types

Disadvantages
● More like Python then Perl/PHP
● Some learning curve
● Some say:
– replication(‘s performance)
● I can’t think of more, doesn’t mean none are
present :-)

JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course

More Related Content

What's hot

Viewers also liked

Similar to JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course

Recently uploaded

JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course