Exploring Open Data with BigQuery
Jenny Tong
Developer Advocate
Google Cloud Platform
@MimmingCodes
Agenda
● Origin story
● Count stuff
● How it works
● Some cool open data
● Do something useful
Google Research Publications
Google Research Publications
Managed Cloud Versions
Bigtable
Flume
Dremel
Bigtable
Dataflow
BigQuery
Google BigQueryGoogle BigQuery
Let's count some stuff
SELECT count(word)
FROM publicdata:samples.shakespeare
Words in Shakespeare
SELECT sum(requests) as total
FROM [fh-bigquery:wikipedia.pagecounts_20150511_05]
Wikipedia hits over 1 hour
SELECT sum(requests) as total
FROM [fh-bigquery:wikipedia.pagecounts_201505]
Wikipedia hits over 1 month
Several years of Wikipedia data
SELECT sum(requests) as total
FROM
[fh-bigquery:wikipedia.pagecounts_201105],
[fh-bigquery:wikipedia.pagecounts_201106],
[fh-bigquery:wikipedia.pagecounts_201107],
...
SELECT
SUM(requests) AS total
FROM
TABLE_QUERY(
[fh-bigquery:wikipedia],
'REGEXP_MATCH(
table_id,
r"pagecounts_2015[0-9]{2}$")')
Several years of Wikipedia data
How about a RegExp
SELECT
SUM(requests) AS total
FROM
TABLE_QUERY(
[fh-bigquery:wikipedia],
'REGEXP_MATCH(
table_id,
r"pagecounts_2015[0-9]{2}$")')
WHERE
(REGEXP_MATCH(title, '.*[dD]inosaur.*'))
How did it do that?
o_O
Qualities of a good RDBMS
Qualities of a good RDBMS
● Inserts & locking
● Indexing
● Cache
● Query planning
Qualities of a good RDBMS
● Inserts & locking
● Indexing
● Cache
● Query planning
Storing data
-- -- -- --
-- -- -- --
-- -- -- --
Table
Columns
Disks
Reading data: Life of a BigQuery
SELECT sum(requests) as sum
FROM (
SELECT requests, title
FROM [fh-bigquery:wikipedia.
pagecounts_201501]
WHERE
(REGEXP_MATCH(title, '[Jj]en.+'))
)
Life of a BigQuery
L L
MMixer
Leaf
Storage
L L L L
M M
M
Life of a BigQuery
Root Mixer
Mixer
Leaf
Storage
Life of a BigQuery
Query
L L L L
M M
MRoot Mixer
Mixer
Leaf
Storage
Life of a BigQueryLife of a BigQuery
L L L L
M M
MRoot Mixer
Mixer
Leaf
Storage
SELECT requests, title
Life of a BigQueryLife of a BigQuery
L L L L
M M
MRoot Mixer
Mixer
Leaf
Storage
5.4 Bil
SELECT requests, title
WHERE
(REGEXP_MATCH(title, '[Jj]en.+'))
Life of a BigQueryLife of a BigQuery
L L L L
M M
MRoot Mixer
Mixer
Leaf
Storage
5.4 Bil
SELECT sum(requests)
5.8 Mil
WHERE
(REGEXP_MATCH(title, '[Jj]en.+'))
SELECT requests, title
Life of a BigQueryLife of a BigQuery
L L L L
M M
MRoot Mixer
Mixer
Leaf
Storage
5.4 Bil
SELECT sum(requests)
5.8 Mil
WHERE
(REGEXP_MATCH(title, '[Jj]en.+'))
SELECT requests, title
SELECT sum(requests)
Open Data
Finding Open Data
opendata.stackexchange.com
Finding Open Data
reddit.com/r/dataisbeautiful
Time to explore
GSOD
Weather in Half Moon Bay
SELECT DATE(year+mo+da) day, min, max
FROM [fh-bigquery:weather_gsod.gsod2013]
WHERE stn IN (
SELECT usaf FROM [fh-bigquery:weather_gsod.stations]
WHERE name = 'HALF MOON BAY AIRPOR')
AND max < 200
ORDER BY day;
Weather in Half Moon Bay
SELECT DATE(year+mo+da) day, min, max
FROM [fh-bigquery:weather_gsod.gsod2013]
WHERE stn IN (
SELECT usaf FROM [fh-bigquery:weather_gsod.stations]
WHERE name = 'HALF MOON BAY AIRPOR')
AND max < 200
ORDER BY day;
Global high temperatures
SELECT year, max(max) as max
FROM
TABLE_QUERY(
[fh-bigquery:weather_gsod],
'table_id CONTAINS "gsod"')
where max < 200
group by year order by year asc
GDELT
Stories per month - Massachusetts
SELECT DATE(STRING(MonthYear) + '01') month,
SUM(ActionGeo_ADM1Code='USMA') US
FROM [gdelt-bq:full.events]
WHERE MonthYear > 0
GROUP BY 1 ORDER BY 1
SELECT DATE(STRING(MonthYear) + '01') month,
SUM(ActionGeo_ADM1Code='USMA') / COUNT(*) newsyness
FROM [gdelt-bq:full.events]
WHERE MonthYear > 0
GROUP BY 1 ORDER BY 1
Stories per month, normalized
https://developers.google.com/genomics/
Genomics
Genomics
SELECT Sample, SUM(single), SUM(double),
FROM (
SELECT call.call_set_name AS Sample,
SOME(call.genotype > 0) AND NOT EVERY(call.
genotype > 0) WITHIN call AS single,
EVERY(call.genotype > 0) WITHIN call AS double,
FROM[genomics-public-data:1000_genomes.variants]
OMIT RECORD IF reference_name IN ("X","Y","MT"))
GROUP BY Sample ORDER BY Sample
Genomics
SELECT Sample, SUM(single), SUM(double),
FROM (
SELECT call.call_set_name AS Sample,
SOME(call.genotype > 0) AND NOT EVERY(call.
genotype > 0) WITHIN call AS single,
EVERY(call.genotype > 0) WITHIN call AS double,
FROM[genomics-public-data:1000_genomes.variants]
OMIT RECORD IF reference_name IN ("X","Y","MT"))
GROUP BY Sample ORDER BY Sample
Something useful:
Use Wikipedia data to pick a movie
1. Wikipedia edits
2. ???
3. Movie recommendation
Follow the edits
Same
editor
select title, id, count(id) as edits
from [publicdata:samples.wikipedia]
where
title contains 'Hackers'
and title contains '(film)'
and wp_namespace = 0
group by title, id
order by edits
limit 10
Pick a great movie
select title, id, count(id) as edits
from [publicdata:samples.wikipedia]
where contributor_id in (
select contributor_id
from [publicdata:samples.wikipedia]
where
id=264176
and contributor_id is not null
and is_bot is null
and wp_namespace = 0
and title CONTAINS '(film)'
group by contributor_id)
and wp_namespace = 0
and id != 264176
and title CONTAINS '(film)'
group each by title, id
order by edits desc
limit 100
Find edits in common
Discover the most broadly popular films
select id from (
select id, count(id) as edits
from [publicdata:samples.wikipedia]
where
wp_namespace = 0
and title CONTAINS '(film)'
group each by id
order by edits desc
limit 20)
Edits in common, minus broadly popular
select title, id, count(id) as edits
from [publicdata:samples.wikipedia]
where contributor_id in (
select contributor_id
from [publicdata:samples.wikipedia]
where
id=264176
and contributor_id is not null
and is_bot is null
and wp_namespace = 0
and title CONTAINS '(film)'
group by contributor_id)
and wp_namespace = 0
and id != 264176
and title CONTAINS '(film)'
and id not in (
select id from (
select id, count(id) as edits
from [publicdata:samples.
wikipedia]
where
wp_namespace = 0
and title CONTAINS '(film)'
group each by id
order by edits desc
limit 20
)
)
group each by title, id
order by edits desc
limit 100
What we talked about
● Origin story
● Count stuff
● How it works
● Some cool open data
● Practical applications
● Try BigQuery
○ bigquery.cloud.google.com
● Queries we ran
○ github.com/mimming/snippets
● Me
○ @MimmingCodes
○ google.com/+mimming
The end
Exploring Open Date with BigQuery: Jenny Tong

Exploring Open Date with BigQuery: Jenny Tong

  • 1.
    Exploring Open Datawith BigQuery
  • 2.
    Jenny Tong Developer Advocate GoogleCloud Platform @MimmingCodes
  • 3.
    Agenda ● Origin story ●Count stuff ● How it works ● Some cool open data ● Do something useful
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
    SELECT sum(requests) astotal FROM [fh-bigquery:wikipedia.pagecounts_20150511_05] Wikipedia hits over 1 hour
  • 11.
    SELECT sum(requests) astotal FROM [fh-bigquery:wikipedia.pagecounts_201505] Wikipedia hits over 1 month
  • 12.
    Several years ofWikipedia data SELECT sum(requests) as total FROM [fh-bigquery:wikipedia.pagecounts_201105], [fh-bigquery:wikipedia.pagecounts_201106], [fh-bigquery:wikipedia.pagecounts_201107], ...
  • 13.
  • 14.
    How about aRegExp SELECT SUM(requests) AS total FROM TABLE_QUERY( [fh-bigquery:wikipedia], 'REGEXP_MATCH( table_id, r"pagecounts_2015[0-9]{2}$")') WHERE (REGEXP_MATCH(title, '.*[dD]inosaur.*'))
  • 15.
    How did itdo that? o_O
  • 16.
    Qualities of agood RDBMS
  • 17.
    Qualities of agood RDBMS ● Inserts & locking ● Indexing ● Cache ● Query planning
  • 18.
    Qualities of agood RDBMS ● Inserts & locking ● Indexing ● Cache ● Query planning
  • 22.
    Storing data -- ---- -- -- -- -- -- -- -- -- -- Table Columns Disks
  • 23.
    Reading data: Lifeof a BigQuery SELECT sum(requests) as sum FROM ( SELECT requests, title FROM [fh-bigquery:wikipedia. pagecounts_201501] WHERE (REGEXP_MATCH(title, '[Jj]en.+')) )
  • 24.
    Life of aBigQuery L L MMixer Leaf Storage
  • 25.
    L L LL M M M Life of a BigQuery Root Mixer Mixer Leaf Storage
  • 26.
    Life of aBigQuery Query L L L L M M MRoot Mixer Mixer Leaf Storage
  • 27.
    Life of aBigQueryLife of a BigQuery L L L L M M MRoot Mixer Mixer Leaf Storage SELECT requests, title
  • 28.
    Life of aBigQueryLife of a BigQuery L L L L M M MRoot Mixer Mixer Leaf Storage 5.4 Bil SELECT requests, title WHERE (REGEXP_MATCH(title, '[Jj]en.+'))
  • 29.
    Life of aBigQueryLife of a BigQuery L L L L M M MRoot Mixer Mixer Leaf Storage 5.4 Bil SELECT sum(requests) 5.8 Mil WHERE (REGEXP_MATCH(title, '[Jj]en.+')) SELECT requests, title
  • 30.
    Life of aBigQueryLife of a BigQuery L L L L M M MRoot Mixer Mixer Leaf Storage 5.4 Bil SELECT sum(requests) 5.8 Mil WHERE (REGEXP_MATCH(title, '[Jj]en.+')) SELECT requests, title SELECT sum(requests)
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
    Weather in HalfMoon Bay SELECT DATE(year+mo+da) day, min, max FROM [fh-bigquery:weather_gsod.gsod2013] WHERE stn IN ( SELECT usaf FROM [fh-bigquery:weather_gsod.stations] WHERE name = 'HALF MOON BAY AIRPOR') AND max < 200 ORDER BY day;
  • 37.
    Weather in HalfMoon Bay SELECT DATE(year+mo+da) day, min, max FROM [fh-bigquery:weather_gsod.gsod2013] WHERE stn IN ( SELECT usaf FROM [fh-bigquery:weather_gsod.stations] WHERE name = 'HALF MOON BAY AIRPOR') AND max < 200 ORDER BY day;
  • 38.
    Global high temperatures SELECTyear, max(max) as max FROM TABLE_QUERY( [fh-bigquery:weather_gsod], 'table_id CONTAINS "gsod"') where max < 200 group by year order by year asc
  • 39.
  • 40.
    Stories per month- Massachusetts SELECT DATE(STRING(MonthYear) + '01') month, SUM(ActionGeo_ADM1Code='USMA') US FROM [gdelt-bq:full.events] WHERE MonthYear > 0 GROUP BY 1 ORDER BY 1
  • 41.
    SELECT DATE(STRING(MonthYear) +'01') month, SUM(ActionGeo_ADM1Code='USMA') / COUNT(*) newsyness FROM [gdelt-bq:full.events] WHERE MonthYear > 0 GROUP BY 1 ORDER BY 1 Stories per month, normalized
  • 42.
  • 44.
    Genomics SELECT Sample, SUM(single),SUM(double), FROM ( SELECT call.call_set_name AS Sample, SOME(call.genotype > 0) AND NOT EVERY(call. genotype > 0) WITHIN call AS single, EVERY(call.genotype > 0) WITHIN call AS double, FROM[genomics-public-data:1000_genomes.variants] OMIT RECORD IF reference_name IN ("X","Y","MT")) GROUP BY Sample ORDER BY Sample
  • 45.
    Genomics SELECT Sample, SUM(single),SUM(double), FROM ( SELECT call.call_set_name AS Sample, SOME(call.genotype > 0) AND NOT EVERY(call. genotype > 0) WITHIN call AS single, EVERY(call.genotype > 0) WITHIN call AS double, FROM[genomics-public-data:1000_genomes.variants] OMIT RECORD IF reference_name IN ("X","Y","MT")) GROUP BY Sample ORDER BY Sample
  • 46.
    Something useful: Use Wikipediadata to pick a movie
  • 47.
    1. Wikipedia edits 2.??? 3. Movie recommendation
  • 48.
  • 49.
    select title, id,count(id) as edits from [publicdata:samples.wikipedia] where title contains 'Hackers' and title contains '(film)' and wp_namespace = 0 group by title, id order by edits limit 10 Pick a great movie
  • 50.
    select title, id,count(id) as edits from [publicdata:samples.wikipedia] where contributor_id in ( select contributor_id from [publicdata:samples.wikipedia] where id=264176 and contributor_id is not null and is_bot is null and wp_namespace = 0 and title CONTAINS '(film)' group by contributor_id) and wp_namespace = 0 and id != 264176 and title CONTAINS '(film)' group each by title, id order by edits desc limit 100 Find edits in common
  • 51.
    Discover the mostbroadly popular films select id from ( select id, count(id) as edits from [publicdata:samples.wikipedia] where wp_namespace = 0 and title CONTAINS '(film)' group each by id order by edits desc limit 20)
  • 52.
    Edits in common,minus broadly popular select title, id, count(id) as edits from [publicdata:samples.wikipedia] where contributor_id in ( select contributor_id from [publicdata:samples.wikipedia] where id=264176 and contributor_id is not null and is_bot is null and wp_namespace = 0 and title CONTAINS '(film)' group by contributor_id) and wp_namespace = 0 and id != 264176 and title CONTAINS '(film)' and id not in ( select id from ( select id, count(id) as edits from [publicdata:samples. wikipedia] where wp_namespace = 0 and title CONTAINS '(film)' group each by id order by edits desc limit 20 ) ) group each by title, id order by edits desc limit 100
  • 53.
    What we talkedabout ● Origin story ● Count stuff ● How it works ● Some cool open data ● Practical applications
  • 54.
    ● Try BigQuery ○bigquery.cloud.google.com ● Queries we ran ○ github.com/mimming/snippets ● Me ○ @MimmingCodes ○ google.com/+mimming The end