The goal of water-temp-bc is to document and serve out
water temperature data. Setup and wrangle scripts are located here https://github.com/NewGraphEnvironment/water-temp-bc
.
We scrape the Environment Canada (ECCC) web service for all realtime
temperature data for the province and serve it out from parquet files on
s3 storage. We are however limited to around 18 months of data or so -
so if we want current data we likely need to scrape on a schedule and
add the data to what we have. That said - ECCC has provided us with a
ton of historic data so in scripts/extract-eccc.R we
wrangle that together into one parquet file and serve on the cloud (s3
storage).
Currently we have more than 1 file so we we will need to put them all together soon. TO DO. Here is a list of the files that we have currently with the date stamp corresponding to the latest date for water temperature data (there is also discharge and water level, air temp mixed in for some sites.)
fs::dir_ls("data", glob = "*.parquet")
## data/realtime_raw_20240119.parquet data/realtime_raw_20250728.parquet data/realtime_raw_eccc_20221213.parquet
## data/stations_realtime.parquet
Using duckdb for
R we are able to connect directly to the parquet files
stored on a S3 bucket and query around to explore the provincial
realtime data for the province of British Columbia. The beauty of duckdb and
the parquet file format (provided we install the
httpfs extension) is that we don’t need a database. We just
create a connection to a “virtual database” with the line below and we
can query the files directly in the s3 buckets…. Neeto.
con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbExecute(con, "INSTALL httpfs; LOAD httpfs;")
Currently the data directory of this repo is mirrored at s3://water-temp-bc/data so punch in any of the urls below into your browser and grab the files yourself.
In the code chunks below we connect to duckdb load the
httpfs extension.
con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbExecute(con, "INSTALL httpfs; LOAD httpfs;")
[1] 0
In the next chunks we perform a couple of queries and present the information about the data currently available.
tab <- DBI::dbGetQuery(
con,
"SELECT *
FROM 's3://water-temp-bc/data/stations_realtime.parquet'"
)
range <- DBI::dbGetQuery(con, "
SELECT
STATION_NUMBER,
MIN(Date) AS min_date,
MAX(Date) AS max_date
FROM 's3://water-temp-bc/data/realtime_raw_20250521.parquet'
WHERE Parameter = '5'
GROUP BY STATION_NUMBER;
")
# save local so need not run every time
saveRDS(range, "data/result.rds")
Realtime station information stored on AWS s3. NOTE: To view all columns in the table - please click on one of the sort arrows within column headers before scrolling to the right.
Below we query for data from a particular site. Note that Parameter = ‘5’ seems to be a better query than Code = “TW” since not all events are currently labelled with a Code…
tab <- DBI::dbGetQuery(con, "
SELECT *
FROM 's3://water-temp-bc/data/realtime_raw_20250521.parquet'
WHERE STATION_NUMBER IN ('07EA004')
AND Parameter = '5'
LIMIT 100
")
Example of output generated by querying parquet file stored on AWS s3. NOTE: To view all columns in the table - please click on one of the sort arrows within column headers before scrolling to the right.
DBI::dbDisconnect(conn = con)