NOT ALL DATA

NEEDS A STACK

Issues

  • can be difficult to test ideas, especially locally
  • tools are costly or complex to setup
  • expectation to be production ready

Solution

Solution Playground

Using FuelWatch Data to find a good Servo

Duck DB 🦆

Duck DB

  • Created ~2019
  • Academic Origins
    • Centrum Wiskunde & Informatica, Amsterdam NL
    • Python was created here!

DuckDB - What is it?

DuckDB is an in-process analytical database

  • no daemon
  • cross platform, cross arch
  • no dependencies
  • multi modal
    • Command Line tool
    • Library (e.g. Py, R, Java, Node)
    • ODBC/JDBC Driver

What it do?

  • work with complex data
  • convert between data formats
  • read and write data lakes

DuckDB Extensions

Non Features

  • Access Control
  • Write Concurrency
  • Network Access

For some of these non-features, check out https://motherduck.com

Use Case #1 - supersize JSON

AEMO WEMDE Dispatch Solution

  • Single document 25mb JSON file
  • Deeply nested and complex
  • One file every 5 minutes - over 6Gb per day

Using CLI or DBeaver with DuckDB to interrogate large JSON files

  • Use UNNEST to unpack JSON arrays
  • CTE for multiple levels of nesting

Use Case #2 - GeoSpatial

Finding neighbouring postcodes using ST_Distance

						install spatial;
load spatial;
with crossjoin as (
    select
     p1.postcode p1_pc,
     p2.postcode p2_pc,
    --  approximate conversion to km
     111 * st_distance(
        st_point(p1.long,p1.lat),
        st_point(p2.long,p2.lat)
     ) home_distance 
     from postcodes p1     
     cross join postcodes p2
     where p1.state='WA' and not (p1.postcode like '69%' or p2.postcode  LIKE '69%')
),
filtered as (
    select * from crossjoin where home_distance <= 7
)

select
    p1_pc postcode,
    array_to_string(list_distinct(array_agg(p2_pc ))[:10],',') neighbours 
    from filtered
    group by 1 order by p1_pc;
						
					

Use Case #3 - Write Data Lake

						
							LOAD httpfs; LOAD aws;
							CALL load_aws_credentials();
							CREATE TABLE fuelprices AS
								SELECT * FROM 'FuelWatch**.json';
							COPY fuelprices TO 's3://my-bucket/fuelwatch' (
								FORMAT PARQUET,
								COMPRESSION ZSTD,
								OVERWRITE_OR_IGNORE,
								PARTITION_BY(publish_date)
							);
						
					

... or something like that

DuckDB — Sharp Edges

Avoiding memory issues

  • write to a database file, not in-memory
  • write intermediate results to a table
  • set a bigger memory limit
  • limit threads
  • avoid window functions
  • split the work up
  • ask kindly for support

What is Evidence.dev?

## Data Application Frameworks - Streamlit - RShiny - Panel - Redash - Cube.js - Evidence.dev _It shouldn't be so painful to build interactive Data Products_

Static Site Generators

  • Docusaurus
  • mkdocs
  • gitbook

People like them because

  • Easy to contribute to
  • Easy to review
  • Looks good by default
  • Reduced complexity

How Evidence.dev works

Toolset built in Node.js, Svelte, Apache ECharts and Duckdb - but as a developer, you rarely have to interact directly with those tools.

It has 3 modes of operation:

  • npm run sources - gather data sources
  • npm run dev - run in dev mode
  • npm run build - build for production

Connection Sources

from https://docs.evidence.dev/

Install Evidence.dev

  • scaffold new project by cloning evidence template
  • install node dependencies
  • add a data source
  • write a source query

Start Creating!

Use your favourite text editor; I like vscode because Evidence.dev create a plugin for it. Then repeat the following steps:

What does a typical Evidence Markdown file consist of?


Input components

  • Dropdown
  • Date Selectors
  • Slider
  • Text Input
  • Button Groups

What does a typical Evidence Markdown file consist of?


Chart components

  • Area, Line and Bar Charts
  • Sankey Diagram
  • Bubble Maps and BoxPlots
  • Geo Maps
  • Heatmaps

What does a typical Evidence Markdown file consist of?


SQL Queries

  • Inline queries can only reference source abstractions
  • source abstractions are defined in SQL files in source directory
  • common queries can be defined and re-used across multiple markdown files

Deploy your evidence website to production

  • npm run build
  • upload build/ to web server
  • that's it

Thank you!

QUestions?