NOT ALL DATA
NEEDS A STACK
Issues
- can be difficult to test ideas, especially locally
- tools are costly or complex to setup
- expectation to be production ready
Solution
Solution Playground
Using FuelWatch Data to find a good Servo
Duck DB
- Created ~2019
- Academic Origins
- Centrum Wiskunde & Informatica, Amsterdam NL
- Python was created here!
DuckDB is an in-process analytical database
- no daemon
- cross platform, cross arch
- no dependencies
- multi modal
- Command Line tool
- Library (e.g. Py, R, Java, Node)
- ODBC/JDBC Driver
What it do?
- work with complex data
- convert between data formats
- read and write data lakes
DuckDB Extensions
Non Features
- Access Control
- Write Concurrency
- Network Access
For some of these non-features, check out https://motherduck.com
Use Case #1 - supersize JSON
AEMO WEMDE Dispatch Solution
- Single document 25mb JSON file
- Deeply nested and complex
- One file every 5 minutes - over 6Gb per day
Using CLI or DBeaver with DuckDB to interrogate large JSON files
- Use
UNNEST
to unpack JSON arrays
- CTE for multiple levels of nesting
Use Case #2 - GeoSpatial
Finding neighbouring postcodes using ST_Distance
install spatial;
load spatial;
with crossjoin as (
select
p1.postcode p1_pc,
p2.postcode p2_pc,
-- approximate conversion to km
111 * st_distance(
st_point(p1.long,p1.lat),
st_point(p2.long,p2.lat)
) home_distance
from postcodes p1
cross join postcodes p2
where p1.state='WA' and not (p1.postcode like '69%' or p2.postcode LIKE '69%')
),
filtered as (
select * from crossjoin where home_distance <= 7
)
select
p1_pc postcode,
array_to_string(list_distinct(array_agg(p2_pc ))[:10],',') neighbours
from filtered
group by 1 order by p1_pc;
Use Case #3 - Write Data Lake
LOAD httpfs; LOAD aws;
CALL load_aws_credentials();
CREATE TABLE fuelprices AS
SELECT * FROM 'FuelWatch**.json';
COPY fuelprices TO 's3://my-bucket/fuelwatch' (
FORMAT PARQUET,
COMPRESSION ZSTD,
OVERWRITE_OR_IGNORE,
PARTITION_BY(publish_date)
);
... or something like that
DuckDB — Sharp Edges
Avoiding memory issues
- write to a database file, not in-memory
- write intermediate results to a table
- set a bigger memory limit
- limit threads
- avoid window functions
- split the work up
- ask kindly for support
What is Evidence.dev?
## Data Application Frameworks
- Streamlit
- RShiny
- Panel
- Redash
- Cube.js
- Evidence.dev
_It shouldn't be so painful to build interactive Data Products_
Static Site Generators
- Docusaurus
- mkdocs
- gitbook
People like them because
- Easy to contribute to
- Easy to review
- Looks good by default
- Reduced complexity
How Evidence.dev works
Toolset built in Node.js, Svelte, Apache ECharts and Duckdb - but as a developer, you rarely have to interact directly with those tools.
It has 3 modes of operation:
-
npm run sources
- gather data sources
npm run dev
- run in dev mode
-
npm run build
- build for production
Install Evidence.dev
- scaffold new project by cloning evidence template
- install node dependencies
- add a data source
- write a source query
Start Creating!
Use your favourite text editor; I like vscode because Evidence.dev create a plugin for it. Then repeat the following steps:
What does a typical Evidence Markdown file consist of?
Input components
- Dropdown
- Date Selectors
- Slider
- Text Input
- Button Groups
What does a typical Evidence Markdown file consist of?
Chart components
- Area, Line and Bar Charts
- Sankey Diagram
- Bubble Maps and BoxPlots
- Geo Maps
- Heatmaps
What does a typical Evidence Markdown file consist of?
SQL Queries
- Inline queries can only reference source abstractions
- source abstractions are defined in SQL files in source directory
- common queries can be defined and re-used across multiple markdown files
Deploy your evidence website to production
npm run build
- upload
build/
to web server
- that's it
Thank you!
QUestions?