Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -212,6 +212,7 @@ Runnable examples:
### Data and I/O

- Built-in loaders: MNIST, Fashion-MNIST, CIFAR-10
- URI-backed data sources: `file://`, `https://`, `hf+https://`, and `hf://...`
- Formats: GGUF, ONNX, SafeTensors, JSON, Image (JPEG, PNG)
- Type-safe transform DSL: resize, crop, normalize, toTensor

Expand Down
3 changes: 2 additions & 1 deletion build.gradle.kts
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,7 @@ dependencies {

// skainet-data
dokka(project(":skainet-data:skainet-data-api"))
dokka(project(":skainet-data:skainet-data-source"))
dokka(project(":skainet-data:skainet-data-transform"))
dokka(project(":skainet-data:skainet-data-simple"))
dokka(project(":skainet-data:skainet-data-media"))
Expand Down Expand Up @@ -178,4 +179,4 @@ tasks.register<Copy>("bundleDokkaIntoSite") {
dependsOn("dokkaGenerate")
from(layout.buildDirectory.dir("dokka/html"))
into(layout.projectDirectory.dir("docs/build/site/api"))
}
}
1 change: 1 addition & 0 deletions docs/modules/ROOT/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
* Tutorials
** xref:tutorials/kotlin-getting-started.adoc[Kotlin getting started]
** xref:tutorials/java-getting-started.adoc[Java getting started]
** xref:tutorials/data-sources-getting-started.adoc[Data sources and Hugging Face]
** xref:tutorials/image-data-getting-started.adoc[Image and data API]
** xref:tutorials/hlo-getting-started.adoc[StableHLO getting started]
** xref:tutorials/minerva-getting-started.adoc[Minerva getting started]
Expand Down
153 changes: 153 additions & 0 deletions docs/modules/ROOT/pages/tutorials/data-sources-getting-started.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
== Data sources and Hugging Face

SKaiNET separates artifact resolution from dataset parsing and preprocessing.
Use `skainet-data-source` when a dataset, tokenizer, model sidecar, or fixture
can live either on disk or behind a remote URI.

[cols="1,3",options="header"]
|===
| URI form | Meaning
| `file:///path/to/file`
| Read a local file.

| `https://host/path/file`
| Download and cache a generic remote artifact.

| `hf+https://huggingface.co/org/repo/resolve/main/file`
| Treat a Hugging Face resolve URL as a Hugging Face artifact.

| `hf://org/repo@main/path/file`
| Expand to a Hugging Face model repository resolve URL.

| `hf://datasets/org/repo@main/path/file`
| Expand to a Hugging Face dataset repository resolve URL.
|===

=== Add the modules

For JVM consumers, add the source module beside the data loaders you use:

[source,kotlin]
----
dependencies {
implementation(platform("sk.ainet:skainet-bom:0.32.4"))

implementation("sk.ainet.core:skainet-data-source-jvm")
implementation("sk.ainet.core:skainet-data-simple-jvm")
}
----

=== Resolve one artifact

`JvmDataSourceResolver` materializes remote artifacts into a cache and returns
a `DataSourceArtifact` that opens a `kotlinx.io.Source`. Public Hugging Face
files do not need credentials. Private files should pass an explicit
`DataSourceAuthToken` on the request or resolver. Existing `Authorization`
headers still take precedence. On JVM, the resolver can also read `HF_TOKEN` /
`HUGGING_FACE_HUB_TOKEN` from the environment as an opt-in convenience fallback.

[source,kotlin]
----
import sk.ainet.data.source.DataSourceAuthToken
import sk.ainet.data.source.DataSourceRequest
import sk.ainet.data.source.JvmDataSourceResolver

val resolver = JvmDataSourceResolver(
huggingFaceToken = DataSourceAuthToken.from("hf_...")
)
val artifact = resolver.resolve(
DataSourceRequest(
uri = "hf+https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct/resolve/main/tokenizer.json"
)
)

println(artifact.filename)
println(artifact.localPath)

val source = artifact.openSource()
try {
// Pass the source to a parser/loader for model-sized artifacts.
} finally {
source.close()
}

// Convenience for small sidecars and tests.
val bytes = artifact.readBytes()
----

For per-request credentials, pass the token directly on `DataSourceRequest`.
This is useful when one resolver works with more than one private repository:

[source,kotlin]
----
val privateArtifact = resolver.resolve(
DataSourceRequest(
uri = "hf://datasets/your-org/private-dataset@main/data/train.bin",
huggingFaceToken = DataSourceAuthToken.from("hf_...")
)
)
----

To opt into JVM environment fallback:

[source,kotlin]
----
val resolver = JvmDataSourceResolver(
useEnvironmentHuggingFaceToken = true
)
----

=== Use sources with built-in loaders

MNIST and Fashion-MNIST expose per-file URI overrides. CIFAR-10 exposes an
archive URI override. Defaults still point to the historical public dataset
locations, so existing code keeps working.

[source,kotlin]
----
import sk.ainet.data.mnist.MNIST
import sk.ainet.data.mnist.MNISTLoaderConfig

val token = "hf_..."
val train = MNIST.loadTrain(
MNISTLoaderConfig(
trainImagesUri = "file:///datasets/mnist/train-images-idx3-ubyte",
trainLabelsUri = "hf+https://huggingface.co/your-org/mnist-idx/resolve/main/train-labels-idx1-ubyte.gz",
huggingFaceTokenProvider = { token }
)
)

val batches = train.batchIterator<sk.ainet.lang.types.Int8, Byte>(batchSize = 64)
----

=== Cache behavior

Use `CachePolicy.Use` for normal operation, `Refresh` to re-download,
`Offline` to require a cached copy, and `Bypass` to avoid writing the cache.
Built-in JVM loaders map `useCache = true` to `Use` and `useCache = false`
to `Refresh`.

[source,kotlin]
----
import sk.ainet.data.source.CachePolicy
import sk.ainet.data.source.DataSourceRequest

val refreshed = resolver.resolve(
DataSourceRequest(
uri = "hf://datasets/your-org/your-dataset@main/data/train-00000.parquet",
cachePolicy = CachePolicy.Refresh
)
)
----

=== Keep preprocessing separate

After bytes are parsed into a dataset, continue using the existing transform
DSL for image/tensor preprocessing:

[source,kotlin]
----
import sk.ainet.data.transform.mnistPreprocessing

val preprocessing = mnistPreprocessing(ctx)
----
1 change: 1 addition & 0 deletions settings.gradle.kts
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ include("skainet-backends:benchmarks:jvm-cpu-publish")

// ====== DATA
include("skainet-data:skainet-data-api")
include("skainet-data:skainet-data-source")
include("skainet-data:skainet-data-transform")
include("skainet-data:skainet-data-simple")
include("skainet-data:skainet-data-media")
Expand Down
1 change: 1 addition & 0 deletions skainet-data/skainet-data-simple/build.gradle.kts
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ kotlin {
}

jvmMain.dependencies {
implementation(project(":skainet-data:skainet-data-source"))
implementation(libs.ktor.client.cio)
implementation(libs.ktor.client.plugins)
implementation(libs.ktor.client.logging)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ import sk.ainet.context.DefaultDataExecutionContext
import sk.ainet.context.ExecutionContext
import sk.ainet.data.DataBatch
import sk.ainet.data.Dataset
import sk.ainet.data.common.DatasetHuggingFaceTokenProvider
import sk.ainet.lang.tensor.Shape
import sk.ainet.lang.tensor.Tensor
import sk.ainet.lang.types.DType
Expand Down Expand Up @@ -144,7 +145,10 @@ public data class CIFAR10Dataset(
*/
public data class CIFAR10LoaderConfig(
val cacheDir: String = "cifar10-data",
val useCache: Boolean = true
val useCache: Boolean = true,
val archiveUri: String = CIFAR10Constants.DOWNLOAD_URL,
val huggingFaceTokenProvider: DatasetHuggingFaceTokenProvider? = null,
val useEnvironmentHuggingFaceToken: Boolean = false
)

/**
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
package sk.ainet.data.common

/**
* Supplies a Hugging Face token for built-in dataset loaders when their source
* URIs point at private Hugging Face artifacts.
*/
public fun interface DatasetHuggingFaceTokenProvider {
public fun token(): String?
}
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ import sk.ainet.context.DefaultDataExecutionContext
import sk.ainet.context.ExecutionContext
import sk.ainet.data.DataBatch
import sk.ainet.data.Dataset
import sk.ainet.data.common.DatasetHuggingFaceTokenProvider
import sk.ainet.lang.tensor.Shape
import sk.ainet.lang.tensor.Tensor
import sk.ainet.lang.types.DType
Expand Down Expand Up @@ -146,7 +147,13 @@ public data class FashionMNISTDataset(
*/
public data class FashionMNISTLoaderConfig(
val cacheDir: String = "fashion-mnist-data",
val useCache: Boolean = true
val useCache: Boolean = true,
val trainImagesUri: String = FashionMNISTConstants.TRAIN_IMAGES_URL,
val trainLabelsUri: String = FashionMNISTConstants.TRAIN_LABELS_URL,
val testImagesUri: String = FashionMNISTConstants.TEST_IMAGES_URL,
val testLabelsUri: String = FashionMNISTConstants.TEST_LABELS_URL,
val huggingFaceTokenProvider: DatasetHuggingFaceTokenProvider? = null,
val useEnvironmentHuggingFaceToken: Boolean = false
)

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,11 @@ public abstract class FashionMNISTLoaderCommon(public val config: FashionMNISTLo
*/
override suspend fun loadTrainingData(): FashionMNISTDataset {
val imagesBytes = downloadAndCacheFile(
FashionMNISTConstants.TRAIN_IMAGES_URL,
config.trainImagesUri,
FashionMNISTConstants.TRAIN_IMAGES_FILENAME
)
val labelsBytes = downloadAndCacheFile(
FashionMNISTConstants.TRAIN_LABELS_URL,
config.trainLabelsUri,
FashionMNISTConstants.TRAIN_LABELS_FILENAME
)

Expand All @@ -34,11 +34,11 @@ public abstract class FashionMNISTLoaderCommon(public val config: FashionMNISTLo
*/
override suspend fun loadTestData(): FashionMNISTDataset {
val imagesBytes = downloadAndCacheFile(
FashionMNISTConstants.TEST_IMAGES_URL,
config.testImagesUri,
FashionMNISTConstants.TEST_IMAGES_FILENAME
)
val labelsBytes = downloadAndCacheFile(
FashionMNISTConstants.TEST_LABELS_URL,
config.testLabelsUri,
FashionMNISTConstants.TEST_LABELS_FILENAME
)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ import sk.ainet.context.DefaultDataExecutionContext
import sk.ainet.context.ExecutionContext
import sk.ainet.data.DataBatch
import sk.ainet.data.Dataset
import sk.ainet.data.common.DatasetHuggingFaceTokenProvider
import sk.ainet.lang.tensor.Shape
import sk.ainet.lang.tensor.Tensor
import sk.ainet.lang.types.DType
Expand Down Expand Up @@ -124,7 +125,13 @@ public data class MNISTDataset(
*/
public data class MNISTLoaderConfig(
val cacheDir: String = "mnist-data",
val useCache: Boolean = true
val useCache: Boolean = true,
val trainImagesUri: String = MNISTConstants.TRAIN_IMAGES_URL,
val trainLabelsUri: String = MNISTConstants.TRAIN_LABELS_URL,
val testImagesUri: String = MNISTConstants.TEST_IMAGES_URL,
val testLabelsUri: String = MNISTConstants.TEST_LABELS_URL,
val huggingFaceTokenProvider: DatasetHuggingFaceTokenProvider? = null,
val useEnvironmentHuggingFaceToken: Boolean = false
)

/**
Expand Down Expand Up @@ -164,4 +171,4 @@ public interface MNISTLoader {
* @return The MNIST test dataset.
*/
public suspend fun loadTestData(): MNISTDataset
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,11 @@ public abstract class MNISTLoaderCommon(public val config: MNISTLoaderConfig) :
*/
override suspend fun loadTrainingData(): MNISTDataset {
val imagesBytes = downloadAndCacheFile(
MNISTConstants.TRAIN_IMAGES_URL,
config.trainImagesUri,
MNISTConstants.TRAIN_IMAGES_FILENAME
)
val labelsBytes = downloadAndCacheFile(
MNISTConstants.TRAIN_LABELS_URL,
config.trainLabelsUri,
MNISTConstants.TRAIN_LABELS_FILENAME
)

Expand All @@ -32,11 +32,11 @@ public abstract class MNISTLoaderCommon(public val config: MNISTLoaderConfig) :
*/
override suspend fun loadTestData(): MNISTDataset {
val imagesBytes = downloadAndCacheFile(
MNISTConstants.TEST_IMAGES_URL,
config.testImagesUri,
MNISTConstants.TEST_IMAGES_FILENAME
)
val labelsBytes = downloadAndCacheFile(
MNISTConstants.TEST_LABELS_URL,
config.testLabelsUri,
MNISTConstants.TEST_LABELS_FILENAME
)

Expand Down
Loading
Loading