User friendly tools for downloading and manipulating gen3 metadata.
git clone https://github.com/AustralianBioCommons/gen3-metadata.git
bash build.shA full usage notebook is available in
example_notebook.ipynb. Make sure to select.venvas the kernel.
fetch_all_metadata is the primary entry point. It walks the data dictionary in dependency order and fetches data for every node, returning a dot-accessible object of JSON dicts. Call .to_df() on the result to get pandas DataFrames instead.
from gen3_metadata.gen3_metadata_parser import fetch_all_metadata
key_file = "path/to/credentials.json"
result = fetch_all_metadata(key_file, "program1", "project1")
# Access each node as raw JSON
result.subject # dict
result.demographic # dict
# Or get DataFrames
dfs = result.to_df()
dfs.subject # pandas DataFrame
dfs.demographic # pandas DataFramefetch_all_metadata accepts a data_release argument that filters each node's records by release. The default is "latest" — each node is inspected for a data_release_date field and only records matching the max ISO date are returned. The selected version and date are logged per node.
# Default: per node, keep records with the max data_release_date
result = fetch_all_metadata(key_file, "program1", "project1")
# node 'subject': selected data_release_date=2024-06-01 data_release='v2.3' (123/22494 records)
# node 'demographic': selected data_release_date=2024-06-01 data_release='v2.3' (123/22494 records)
# ...
# Pin to a specific release (exact, case-sensitive match on data_release field)
result = fetch_all_metadata(key_file, "program1", "project1", data_release="v2.3")
# Disable filtering — return every record, no filter logs
result = fetch_all_metadata(key_file, "program1", "project1", data_release=None)Behavior per node:
data_release value |
Behavior |
|---|---|
"latest" (default) |
Keep records with the max data_release_date (ISO 8601). Log the selected date and version. |
| any other string | Keep records where data_release equals that string exactly. Log the selected version and date. |
None |
No filtering. No filter log lines emitted. |
Nodes that have neither a data_release nor a data_release_date field (for example lookup/link nodes like program or project) are passed through unchanged, with an info log noting they were not filtered. Unparseable ISO dates in "latest" mode are skipped with a warning.
The same data_release argument is also available on Gen3MetadataParser.fetch_data and fetch_data_json for single-node fetches.
fetch_all_metadata prints per-node progress to stdout by default (verbose=True). If you want richer diagnostic output — filter decisions, authentication steps, debug messages — call configure_logging() once in your REPL or notebook to attach a stderr handler to the gen3_metadata logger:
import logging
import gen3_metadata
gen3_metadata.configure_logging() # INFO by default
# gen3_metadata.configure_logging(logging.DEBUG) # or more verboseEach module logs under its own gen3_metadata.<module> name, so you can filter with standard logging machinery.
Network calls inside fetch_all_metadata are timeout-guarded (30s per request) and wrapped with friendly error messages. A VPN/connectivity failure produces a clean one-liner instead of a urllib3 traceback:
fetch_all_metadata: starting for program1/CDAH
fetch_all_metadata: fetching data dictionary from cad.staging.biocommons.org.au...
RuntimeError: Could not reach cad.staging.biocommons.org.au to fetch the Gen3
data dictionary. Check VPN / network connectivity. Underlying error: ...
Individual nodes that time out or return an HTTP error during the fetch loop are logged and skipped; the overall call completes with a final summary of which nodes succeeded and which failed.
get_node_order returns a topologically sorted list of node names from the data dictionary (parents before children).
from gen3_metadata.gen3_metadata_parser import get_node_order
nodes = get_node_order("path/to/credentials.json")
# ['program', 'project', 'subject', 'sample', 'demographic', ...]If you only need one node, use Gen3MetadataParser + fetch_data_json. It returns the raw JSON response as a dict. Convert to a DataFrame yourself if you need one.
import pandas as pd
from gen3_metadata.gen3_metadata_parser import Gen3MetadataParser
key_file = "path/to/credentials.json"
parser = Gen3MetadataParser(key_file)
parser.authenticate()
# Default: filters to latest data_release_date
json_data = parser.fetch_data_json("program1", "project1", node_label="medical_history")
json_data # {'data': [...]}
# Pin to a specific release, or pass data_release=None to disable filtering
json_data = parser.fetch_data_json(
"program1", "project1", node_label="medical_history", data_release="v2.3"
)
# Convert to DataFrame if desired:
df = pd.json_normalize(json_data["data"])pytest -vv tests/As of v1.3.0 the R package is fully standalone — no Python interpreter, no reticulate, and no Python gen3_metadata package required. A single remotes::install_github(...) is all you need, which makes containerized RStudio deployments significantly simpler.
Always-latest from main:
if (!require("remotes")) install.packages("remotes")
remotes::install_github("AustralianBioCommons/gen3-metadata", subdir = "gen3metadata-R")Pinned to a specific release (recommended for reproducible environments such as Docker images):
if (!require("remotes")) install.packages("remotes")
remotes::install_github(
"AustralianBioCommons/gen3-metadata",
subdir = "gen3metadata-R",
ref = "v1.4.2"
)CRAN dependencies are installed automatically with the package. If your environment requires installing them manually:
install.packages(c("httr", "jsonlite", "jose", "glue"))Add a single layer to your image:
RUN R -e 'if (!require("remotes")) install.packages("remotes", repos="https://cloud.r-project.org"); \
remotes::install_github("AustralianBioCommons/gen3-metadata", subdir = "gen3metadata-R", ref = "v1.4.2")'No Python or pip step is required.
library("gen3metadata")fetch_all_metadata is the primary entry point. It walks the data dictionary in dependency order and fetches data for every node, returning a metadata_collection object where nodes are accessible via $. Call to_df() on it to get data.frames instead.
result <- fetch_all_metadata("path/to/credentials.json", "program1", "AusDiab")
# Access each node as raw JSON (nested list)
result$subject
result$demographic
# Or get data.frames
dfs <- to_df(result)
dfs$subject # data.frame
dfs$demographic # data.framefetch_all_metadata accepts a data_release argument that mirrors the Python API. The default is "latest" — each node is inspected for a data_release_date field and only records matching the max ISO date are kept. The selected version and date are emitted per node via message().
# Default: per node, keep records with the max data_release_date
result <- fetch_all_metadata("path/to/credentials.json", "program1", "AusDiab")
# node 'subject': selected data_release_date=2024-06-01 data_release='v2.3' (123/22494 records)
# node 'demographic': selected data_release_date=2024-06-01 data_release='v2.3' (123/22494 records)
# ...
# Pin to a specific release (exact, case-sensitive match on data_release field)
result <- fetch_all_metadata(
"path/to/credentials.json", "program1", "AusDiab",
data_release = "v2.3"
)
# Disable filtering — return every record, no filter messages
result <- fetch_all_metadata(
"path/to/credentials.json", "program1", "AusDiab",
data_release = NULL
)
# Suppress the per-node message output while keeping the filter active
result <- suppressMessages(
fetch_all_metadata("path/to/credentials.json", "program1", "AusDiab")
)Behavior per node:
data_release value |
Behavior |
|---|---|
"latest" (default) |
Keep records with the max data_release_date (ISO 8601). Emit selected date and version. |
| any other string | Keep records where data_release equals that string exactly. Emit selected version and date. |
NULL |
No filtering. No filter messages emitted. |
Nodes that have neither a data_release nor a data_release_date field are passed through unchanged with a message. Unparseable ISO dates in "latest" mode are skipped with a warning.
The same data_release argument is also available on fetch_data() for single-node fetches.
fetch_all_metadata emits per-node progress via message() (visible in interactive sessions and captured by knitr/RMarkdown by default). A DNS/VPN failure produces a clean stop() error rather than an httr traceback, e.g.:
fetch_all_metadata: starting for program1/AusDiab
fetch_all_metadata: fetching data dictionary...
Error: Could not fetch Gen3 data dictionary. Check VPN / network connectivity. Underlying error: ...
All network calls (authentication, dictionary fetch, per-node fetch) have a 30-second timeout, so a flaky network surfaces quickly instead of hanging indefinitely. To silence the progress output:
result <- suppressMessages(fetch_all_metadata(...))get_node_order returns a topologically sorted character vector of node names from the data dictionary.
nodes <- get_node_order("path/to/credentials.json")
# [1] "program" "project" "subject" "sample" "demographic" ...If you only need one node, use Gen3MetadataParser + fetch_data. It returns the raw JSON data as a nested list. Convert to a data.frame yourself if you need one.
key_file_path <- "path/to/credentials.json"
gen3 <- Gen3MetadataParser(key_file_path)
gen3 <- authenticate(gen3)
# Default: filters to latest data_release_date
data <- fetch_data(gen3,
program_name = "program1",
project_code = "AusDiab",
node_label = "subject")
# Pin to a specific release, or pass data_release = NULL to disable filtering
data <- fetch_data(gen3,
program_name = "program1",
project_code = "AusDiab",
node_label = "subject",
data_release = "v2.3")
# data is a list of records
# Convert to a data.frame if desired:
df <- do.call(rbind, lapply(data, as.data.frame))