Import data

Move existing data into a Tonbo Artifacts workspace from a Mac folder, an S3 bucket, or local disk.

A native artifacts migrate command is on the v1 list; for v0, use the existing battle-tested S3 / rsync tooling. Pick the recipe that matches where your data lives now.

These recipes are backend-agnostic. /mnt/work is a FUSE mount, so any tool that writes a regular Linux filesystem works. The chunks flow to wherever the workspace points, Tonbo Artifacts' managed bucket by default, or your own BYO bucket. You don't have to think about chunk storage at import time.

Any staging bucket in the recipes below is independent from the workspace's chunk storage. It's wherever your data has to live temporarily so a Linux host can pull from it. Tigris, R2, AWS S3, MinIO, any S3-compatible bucket you can read works.

From a Mac folder

The v0 mount client is Linux-only, so Mac → Tonbo Artifacts is a two-hop: stage to any S3-compatible bucket, then pull from a Linux host that has the workspace mounted.

(if your data is an archive) extract it first

The next step expects a directory tree. If you have an archive, expand it locally:

# tar.zst (zstd), fast, common from `tar --zstd`
mkdir -p ~/dataset && zstd -dc dataset.tar.zst | tar -xC ~/dataset
# tar.gz
tar -xzf dataset.tar.gz -C ~/dataset
# zip
unzip dataset.zip -d ~/dataset

Pick (or create) a staging bucket and credentials
Any S3-compatible bucket the Linux mount host can read works. Tigris, R2, AWS S3, MinIO, pick whatever you already have keys for. If you're starting from scratch, AWS S3 is one curl away:
```
aws s3 mb s3://my-staging-bucket --region us-west-1
# Lock down public access on the new bucket:
aws s3api put-public-access-block --bucket my-staging-bucket \
    --public-access-block-configuration \
      BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true
```
Mint an IAM user with read-write on my-staging-bucket and copy its access key + secret. The same key pair gets used by rclone on the Mac (next step) and by aws s3 cp on the Linux host (step after).
On the Mac: rclone stage to the bucket
```
brew install rclone
```
Configure rclone. The example uses AWS S3 us-west-1; for Tigris swap provider=Tigris and set endpoint=https://fly.storage.tigris.dev:
```
rclone config create staging s3 \
    provider=AWS \
    access_key_id=<staging-access-key> \
    secret_access_key=<staging-secret-key> \
    region=us-west-1
```
Upload:
```
rclone copy ~/dataset staging:my-staging-bucket/ \
    --transfers 32 --progress
```
On home broadband expect roughly uplink-Mbps × 0.1 GB/min; a 5–6 GB / 40k-file dataset over a ~50 Mbps uplink lands in about 15 minutes.
On the Linux mount host: pull from staging into /mnt/work
The Linux box needs read access to the staging bucket. Export the same access key you used for rclone:
```
export AWS_ACCESS_KEY_ID=<staging-access-key>
export AWS_SECRET_ACCESS_KEY=<staging-secret-key>
export AWS_REGION=us-west-1
```
```
aws s3 cp s3://my-staging-bucket/ /mnt/work/ \
    --recursive --quiet
```
--quiet avoids one log line per file (matters at 40k+ files). For non-AWS providers add --endpoint-url https://<provider-endpoint>.
Same-region intra-AWS this typically runs at ~10 MB/s per stream; a 5 GB pull is ~10 minutes wall-clock.

Validate, then drop the staging bucket

Pick a representative file and verify cold + warm reads (see Validation below). Once you're satisfied, drop the staging bucket, the workspace owns the data now via its chunk backend (managed or BYO) plus Tonbo Artifacts' metadata service.

aws s3 rm s3://my-staging-bucket --recursive
aws s3 rb s3://my-staging-bucket
aws iam delete-access-key --user-name <staging-iam-user> --access-key-id <ak>
aws iam delete-user-policy --user-name <staging-iam-user> --policy-name staging-rw
aws iam delete-user --user-name <staging-iam-user>

From an S3 source

If your data already lives in another S3-compatible bucket (Tigris, R2, AWS S3, MinIO), single hop. Make sure the Linux host has read credentials for the source bucket:

export AWS_ACCESS_KEY_ID=<source-key>
export AWS_SECRET_ACCESS_KEY=<source-secret>
export AWS_REGION=<source-region>

aws s3 cp s3://<source-bucket>/ /mnt/work/ \
    --recursive --quiet \
    --endpoint-url https://<source-endpoint>

rclone copy <source-remote>:<bucket> /mnt/work/ \
    --transfers 32 --progress

Drop --endpoint-url if you're on AWS S3 (the SDK figures it out from AWS_REGION).

Writes flow through /mnt/work (FUSE) to the workspace's chunk backend; metadata (inode tree, chunk pointers) lands in Tonbo Artifacts' metadata service. You don't see the chunk-side bucket from inside the mount.

From a local disk / NFS / etc.

If your data already sits on the same Linux host as the mount (or you can scp / rsync it there), this is the shortest path:

rsync -avP --info=progress2 /local/source/ /mnt/work/

For directory trees with deep parallelism opportunities:

find /local/source -type f \
  | parallel -j 32 rsync -aR --info=progress2 {} /mnt/work/

(Requires GNU parallel. The -aR flag preserves relative paths so the structure mirrors under /mnt/work.)

If your data is a Mac-side tarball and you don't want to set up a staging bucket, scp it to the Linux host first, expand under /tmp/..., and use this recipe. Same total cost as the Mac-folder path, one fewer service to configure.

Validation

Always sanity-check after a bulk import:

# Pick a file that's in your real workload (not synthetic).
TARGET=/mnt/work/<some-real-file>
ls -la "$TARGET"

# Cold read: first time pulls the chunk.
time cat "$TARGET" >/dev/null

# Warm read: should be single-digit ms.
time cat "$TARGET" >/dev/null

After 60 s of idle, confirm zero object-storage errors via the mount's stats file:

sleep 60
grep -E 'object_request_errors|staging_blocks' /mnt/work/.stats
# expected:
#   juicefs_object_request_errors 0
#   juicefs_staging_blocks 0

Quick spot-check that the chunks landed in the workspace's bucket (managed example shown; for BYO substitute your bucket + creds):

VOL=$(artifacts --format json workspace show <name> | jq -r .volume_name)
aws s3 ls --recursive --summarize \
    s3://tonbo-managed-storage/$VOL/ --region us-west-1 | tail -2
# Total objects + total bytes should roughly match what you imported.

What about an `artifacts migrate` command?

A first-class wrapper that handles all three patterns above with progress, validation, and resume in one CLI is on the v1 list. v0 intentionally leans on the existing tools because they're mature and your Linux host already has them.

If your migration ergonomics are blocking your benchmark or production cutover, ping us. We'll prioritise based on what's actually painful.

From a Mac folder

From an S3 source

From a local disk / NFS / etc.

Validation

What about an artifacts migrate command?

What about an `artifacts migrate` command?