Exploring Open Date with BigQuery: Jenny Tong

Exploring Open Data with BigQuery

Jenny Tong
Developer Advocate
Google Cloud Platform
@MimmingCodes

Agenda
● Origin story
● Count stuff
● How it works
● Some cool open data
● Do something useful

Managed Cloud Versions
Bigtable
Flume
Dremel
Bigtable
Dataflow
BigQuery

Google BigQueryGoogle BigQuery

SELECT count(word)
FROM publicdata:samples.shakespeare
Words in Shakespeare

SELECT sum(requests) as total
FROM [fh-bigquery:wikipedia.pagecounts_20150511_05]
Wikipedia hits over 1 hour

FROM [fh-bigquery:wikipedia.pagecounts_201505]
Wikipedia hits over 1 month

Several years of Wikipedia data
FROM
[fh-bigquery:wikipedia.pagecounts_201105],
...

SELECT
SUM(requests) AS total
FROM
TABLE_QUERY(
[fh-bigquery:wikipedia],
'REGEXP_MATCH(
table_id,
r"pagecounts_2015[0-9]{2}$")')
Several years of Wikipedia data

How about a RegExp
SELECT
SUM(requests) AS total
FROM
TABLE_QUERY(
[fh-bigquery:wikipedia],
'REGEXP_MATCH(
table_id,
r"pagecounts_2015[0-9]{2}$")')
WHERE
(REGEXP_MATCH(title, '.*[dD]inosaur.*'))

Qualities of a good RDBMS
● Inserts & locking
● Indexing
● Cache
● Query planning

Storing data
-- -- -- --
-- -- -- --
-- -- -- --
Table
Columns
Disks

Reading data: Life of a BigQuery
SELECT sum(requests) as sum
FROM (
SELECT requests, title
FROM [fh-bigquery:wikipedia.
pagecounts_201501]
WHERE
(REGEXP_MATCH(title, '[Jj]en.+'))
)

Life of a BigQuery
L L
MMixer
Leaf
Storage

L L L L
M M
M
Life of a BigQuery
Root Mixer
Mixer
Leaf
Storage

Life of a BigQuery
Query
L L L L
M M
MRoot Mixer
Mixer
Leaf
Storage

Life of a BigQueryLife of a BigQuery
L L L L
M M
MRoot Mixer
Mixer
Leaf
Storage

L L L L
M M
MRoot Mixer
Mixer
Leaf
Storage
5.4 Bil
WHERE

L L L L
M M
MRoot Mixer
Mixer
Leaf
Storage
5.4 Bil
SELECT sum(requests)
5.8 Mil
WHERE

L L L L
M M
MRoot Mixer
Mixer
Leaf
Storage
5.4 Bil
5.8 Mil
WHERE

Finding Open Data
opendata.stackexchange.com

Finding Open Data
reddit.com/r/dataisbeautiful

Weather in Half Moon Bay
SELECT DATE(year+mo+da) day, min, max
FROM [fh-bigquery:weather_gsod.gsod2013]
WHERE stn IN (
SELECT usaf FROM [fh-bigquery:weather_gsod.stations]
WHERE name = 'HALF MOON BAY AIRPOR')
AND max < 200
ORDER BY day;

Global high temperatures
SELECT year, max(max) as max
FROM
TABLE_QUERY(
[fh-bigquery:weather_gsod],
'table_id CONTAINS "gsod"')
where max < 200
group by year order by year asc

Stories per month - Massachusetts
SELECT DATE(STRING(MonthYear) + '01') month,
SUM(ActionGeo_ADM1Code='USMA') US
FROM [gdelt-bq:full.events]
WHERE MonthYear > 0
GROUP BY 1 ORDER BY 1

SELECT DATE(STRING(MonthYear) + '01') month,
SUM(ActionGeo_ADM1Code='USMA') / COUNT(*) newsyness
FROM [gdelt-bq:full.events]
WHERE MonthYear > 0
GROUP BY 1 ORDER BY 1
Stories per month, normalized

https://developers.google.com/genomics/
Genomics

Genomics
SELECT Sample, SUM(single), SUM(double),
FROM (
SELECT call.call_set_name AS Sample,
SOME(call.genotype > 0) AND NOT EVERY(call.
genotype > 0) WITHIN call AS single,
EVERY(call.genotype > 0) WITHIN call AS double,
FROM[genomics-public-data:1000_genomes.variants]
OMIT RECORD IF reference_name IN ("X","Y","MT"))
GROUP BY Sample ORDER BY Sample

Something useful:
Use Wikipedia data to pick a movie

1. Wikipedia edits
2. ???
3. Movie recommendation

select title, id, count(id) as edits
from [publicdata:samples.wikipedia]
where
title contains 'Hackers'
and title contains '(film)'
and wp_namespace = 0
group by title, id
order by edits
limit 10
Pick a great movie

where contributor_id in (
select contributor_id
where
id=264176
and contributor_id is not null
and is_bot is null
and title CONTAINS '(film)'
group by contributor_id)
and id != 264176
group each by title, id
order by edits desc
limit 100
Find edits in common

Discover the most broadly popular films
select id from (
select id, count(id) as edits
where
wp_namespace = 0
group each by id
order by edits desc
limit 20)

Edits in common, minus broadly popular
where contributor_id in (
select contributor_id
where
id=264176
and contributor_id is not null
and is_bot is null
group by contributor_id)
and id != 264176
and id not in (
select id from (
select id, count(id) as edits
from [publicdata:samples.
wikipedia]
where
wp_namespace = 0
group each by id
order by edits desc
limit 20
)
)
group each by title, id
order by edits desc
limit 100

What we talked about
● Origin story
● Count stuff
● How it works
● Some cool open data
● Practical applications

● Try BigQuery
○ bigquery.cloud.google.com
● Queries we ran
○ github.com/mimming/snippets
● Me
○ @MimmingCodes
○ google.com/+mimming
The end

Exploring Open Date with BigQuery: Jenny Tong

Exploring Open Date with BigQuery: Jenny Tong

More Related Content

What's hot

Viewers also liked

Similar to Exploring Open Date with BigQuery: Jenny Tong

More from Future Insights

Recently uploaded

Exploring Open Date with BigQuery: Jenny Tong