NOT ALL DATA

NEEDS A STACK

Issues

can be difficult to test ideas, especially locally
tools are costly or complex to setup
expectation to be production ready

Solution

Solution Playground

Using FuelWatch Data to find a good Servo

Duck DB 🦆

Duck DB

Created ~2019
Academic Origins
- Centrum Wiskunde & Informatica, Amsterdam NL
- Python was created here!

DuckDB - What is it?

DuckDB is an in-process analytical database

no daemon
cross platform, cross arch
no dependencies
multi modal
- Command Line tool
- Library (e.g. Py, R, Java, Node)
- ODBC/JDBC Driver

What it do?

work with complex data
convert between data formats
read and write data lakes

DuckDB Extensions

Non Features

Access Control
Write Concurrency
Network Access

For some of these non-features, check out https://motherduck.com

Use Case #1 - supersize JSON

AEMO WEMDE Dispatch Solution

Single document 25mb JSON file
Deeply nested and complex
One file every 5 minutes - over 6Gb per day

Using CLI or DBeaver with DuckDB to interrogate large JSON files

Use UNNEST to unpack JSON arrays
CTE for multiple levels of nesting

Use Case #2 - GeoSpatial

Finding neighbouring postcodes using ST_Distance

						install spatial;
load spatial;
with crossjoin as (
    select
     p1.postcode p1_pc,
     p2.postcode p2_pc,
    --  approximate conversion to km
     111 * st_distance(
        st_point(p1.long,p1.lat),
        st_point(p2.long,p2.lat)
     ) home_distance 
     from postcodes p1     
     cross join postcodes p2
     where p1.state='WA' and not (p1.postcode like '69%' or p2.postcode  LIKE '69%')
),
filtered as (
    select * from crossjoin where home_distance <= 7
)

select
    p1_pc postcode,
    array_to_string(list_distinct(array_agg(p2_pc ))[:10],',') neighbours 
    from filtered
    group by 1 order by p1_pc;

Use Case #3 - Write Data Lake

						
							LOAD httpfs; LOAD aws;
							CALL load_aws_credentials();
							CREATE TABLE fuelprices AS
								SELECT * FROM 'FuelWatch**.json';
							COPY fuelprices TO 's3://my-bucket/fuelwatch' (
								FORMAT PARQUET,
								COMPRESSION ZSTD,
								OVERWRITE_OR_IGNORE,
								PARTITION_BY(publish_date)
							);

... or something like that

DuckDB — Sharp Edges

Avoiding memory issues

write to a database file, not in-memory
write intermediate results to a table
set a bigger memory limit
limit threads
avoid window functions
split the work up
ask kindly for support

What is Evidence.dev?

## Data Application Frameworks - Streamlit - RShiny - Panel - Redash - Cube.js - Evidence.dev _It shouldn't be so painful to build interactive Data Products_

Static Site Generators

Docusaurus
mkdocs
gitbook

People like them because

Easy to contribute to
Easy to review
Looks good by default
Reduced complexity

How Evidence.dev works

Toolset built in Node.js, Svelte, Apache ECharts and Duckdb - but as a developer, you rarely have to interact directly with those tools.

It has 3 modes of operation:

npm run sources - gather data sources
npm run dev - run in dev mode
npm run build - build for production

Connection Sources

from https://docs.evidence.dev/

Install Evidence.dev

scaffold new project by cloning evidence template
install node dependencies
add a data source
write a source query

Start Creating!

Use your favourite text editor; I like vscode because Evidence.dev create a plugin for it. Then repeat the following steps:

What does a typical Evidence Markdown file consist of?

Input components

Dropdown
Date Selectors
Slider
Text Input
Button Groups

What does a typical Evidence Markdown file consist of?

Chart components

Area, Line and Bar Charts
Sankey Diagram
Bubble Maps and BoxPlots
Geo Maps
Heatmaps

What does a typical Evidence Markdown file consist of?

SQL Queries

Inline queries can only reference source abstractions
source abstractions are defined in SQL files in source directory
common queries can be defined and re-used across multiple markdown files

Deploy your evidence website to production

npm run build
upload build/ to web server
that's it

NOT ALL DATA

NEEDS A STACK

Issues

Solution

Solution Playground

Using FuelWatch Data to find a good Servo

Duck DB 🦆

Duck DB

DuckDB - What is it?

DuckDB is an in-process analytical database

What it do?

DuckDB Extensions

Non Features

Use Case #1 - supersize JSON

Using CLI or DBeaver with DuckDB to interrogate large JSON files

Use Case #2 - GeoSpatial

Use Case #3 - Write Data Lake

DuckDB — Sharp Edges

Avoiding memory issues

What is Evidence.dev?

Static Site Generators

People like them because

How Evidence.dev works

Connection Sources

Install Evidence.dev

Start Creating!

What does a typical Evidence Markdown file consist of?

Input components

What does a typical Evidence Markdown file consist of?

Chart components

What does a typical Evidence Markdown file consist of?

SQL Queries

Deploy your evidence website to production

Thank you!

QUestions?