ISC-tutorial/02_datalad_version_control.md
2026-05-07 10:08:36 +02:00

4.8 KiB

The goal is to have a self-contained lesson page that can be built into a complete lesson.

DataLad version control

  • The git-annex extension and external storages for large data
  • The DataLad tool on top of git and its sub-commands
  • Hands-on: Get to know the tutorial repository
  • Hands-on: Add new data to the tutorial repository

TODO The git-annex extension and external storages for large data

TODO The DataLad tool on top of git and its sub-commands

TODO Hands-on: Get to know the tutorial repository

Let's work with a Datalad repository for ESA/Hubble Pictures of the Week at https://esahubble.org/images/potw/ (recently they changed to ESA/Hubble Pictures of the Month

  1. Look around in the gitlab project
  2. Install Datalad
  3. Clone the data repository
  4. Look at existing datasets

The GitLab Repository -- TODO replace with Forgejo

Look at the public GitLab repository https://codebase.helmholtz.cloud/knue/esa-hubble-picture-of-the-week.datalad/project

The GitLab Repository -- TODO replace with Forgejo

The GitLab Repository -- TODO replace with Forgejo

The GitLab Repository -- TODO replace with Forgejo

Install DataLad

Install DataLad and git-annex in your HPC account without sudo

python3 -m venv .venv
. .venv/bin/activate
pip install datalad git-annex

See the DataLad handbook for more ways to install: https://handbook.datalad.org/en/latest/intro/installation.html

Clone the repository

TODO update with Forgejo

  • Make sure you are a project member in the GitLab repository
  • Make sure to have your SSH key registered in GitLab
  • Copy the "Clone with SSH" link
    • More convenient for later when contributing to the repository
datalad clone git@codebase.helmholtz.cloud:knue/\
esa-hubble-picture-of-the-week.datalad/project.git \
    [esa-hubble-picture-of-the-week.datalad]
  • Ignore the warnings about "Remote origin does not have git-annex installed" and "access to 2 dataset siblings ... not auto-enabled, enable with ..." because there are multiple annex storages (TODO not needed with Forgejo)

Look at existing datasets

Look around

cd 10__/1001/
ls -alh
md5sum *.tif
  • Broken links are for missing annexed files
  • Run either of datalad get . or git annex get .
    • The -J <n> option uses up to n parallel streams
ls -alh
md5sum *tif
  • Run either of datalad drop . or git annex drop .
ls -alh

Hands-on Step 1 finished

This is the way how DataLad and git-annex manage large, binary files

Congratulations, you mastered the first step

More details

  • You can configure which files to make annexed files according to type, size, ...
  • You can have one or multiple annex storages
  • It works with Github or GitLab or similar git forges with extra annex storages
  • Forgejo-Aneksajo (https://codeberg.org/forgejo-aneksajo/forgejo-aneksajo) makes it even easier with built-in support for annexed files

TODO Hands-on: Add new data to the tutorial repository

Adding new data to the tutorial repository

In this hands-on session, we will add new data to our DataLad repository. This includes downloading images from the ESA/Hubble website and integrating them into our dataset using DataLad's tools.

Steps:

  1. Download new images:

    • Visit the ESA/Hubble Pictures of the Month website.
    • Select a few images to add to your dataset.
    • Download these images to a local directory (e.g., downloads/).
  2. Add images to the dataset:

    • Use datalad add to include the downloaded images in the dataset.
    • Commit the changes to the dataset.
  3. Verify the addition:

    • Check that the images are properly tracked by DataLad.
    • Ensure that the metadata is correctly updated.

This process demonstrates how DataLad handles large files efficiently, using git-annex under the hood to manage file storage and versioning.

Example commands: