Skip to content

Error when reading Parquet files from local S3 #29

@temminks

Description

@temminks

I'm trying to extract a parquet file from a bucket but Sling seems to fail when trying to load the parquet file (using Duckdb, I guess). This only happens when using the Python wrapper and when using a local setup using Localstack. Perhaps the endpoint-url isn't passed to duckdb but that's just a wild guess. On the other hand, writing a parquet file works just fine. It also works with the non-Python CLI.

I'm running Sling as part of Dagster which might also be relevant as you seem to evaluate that in the Python wrapper (although, with a non-local bucket this works).

Run Localstack

You can run Localstack using Docker or Podman

Create a local bucket

aws s3 mb s3://my-test-bucket --endpoint=http://localhost:4566

Define a new Sling Connection for that bucket (in your env.yaml):

  AWS_S3:
    type: s3
    bucket: my-test-bucket
    region: eu-central-1
    endpoint: http://localhost:4566
    access_key_id: localstack
    secret_access_key: localstack

Create a CSV file

 echo "Hello,World\nHello,World" > test.txt

Define a new Local Connection to access that CSV file (in your env.yaml):

  LOCAL:
    type: local
    url: file://<root/of/text/file>

Load the file as parquet into your bucket using a replication YAML

This is only to create a Parquet test file:

source: LOCAL
target: AWS_S3

defaults:
  mode: full-refresh
  target_options:
    format: parquet

streams:
  test.txt:
    object: test.parquet

Extract the parquet file from the bucket to local storage

This fails:

source: AWS_S3
target: LOCAL

defaults:
  mode: full-refresh
  source_options:
    format: parquet

streams:
  test.parquet:
    object: result.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions