ISC-tutorial/02_datalad_version_control.md
2026-05-07 10:08:36 +02:00

162 lines
4.8 KiB
Markdown

The goal is to have a self-contained lesson page that can be built into a complete lesson.
# DataLad version control
- The git-annex extension and external storages for large data
- The DataLad tool on top of git and its sub-commands
- **Hands-on**: Get to know the tutorial repository
- **Hands-on**: Add new data to the tutorial repository
# TODO The git-annex extension and external storages for large data
# TODO The DataLad tool on top of git and its sub-commands
# TODO **Hands-on**: Get to know the tutorial repository
Let's work with a Datalad repository for [ESA/Hubble Pictures of the Week](https://esahubble.org/images/potw/) at https://esahubble.org/images/potw/ (recently they changed to [ESA/Hubble Pictures of the Month](https://esahubble.org/images/potm/)
<div style="display: flex; gap: 20px; vertical-align: top;">
<div style="flex: 0 0 50%;">
1. Look around in the gitlab project
1. Install Datalad
1. Clone the data repository
1. Look at existing datasets
</div>
<div style="flex: 0 0 50%; vertical-align: top;">
<div style="position: relative; width: 300px; height: 300px;">
<img src="images/potw2438a.jpg.png" style="position: absolute; top: 0; left: 0; width: 200px; z-index: 1; border: 2px solid #333;">
<img src="images/potw2439a.jpg.png" style="position: absolute; top: 60px; left: 60px; width: 200px; z-index: 2; border: 2px solid #333;">
<img src="images/potw2440a.jpg.png" style="position: absolute; top: 120px; left: 120px; width: 200px; z-index: 3; border: 2px solid #333;">
</div>
</div>
</div>
## The GitLab Repository -- TODO replace with Forgejo
Look at the public GitLab repository
https://codebase.helmholtz.cloud/knue/esa-hubble-picture-of-the-week.datalad/project
## The GitLab Repository -- TODO replace with Forgejo
<img src="images/J1.png" data-preview-image>
## The GitLab Repository -- TODO replace with Forgejo
<img src="images/J2.png" data-preview-image>
## The GitLab Repository -- TODO replace with Forgejo
<img src="images/J3.png" data-preview-image>
## Install DataLad
Install `DataLad` and `git-annex` in your HPC account without sudo
```bash
python3 -m venv .venv
. .venv/bin/activate
pip install datalad git-annex
```
See the DataLad handbook for more ways to install: https://handbook.datalad.org/en/latest/intro/installation.html
## Clone the repository
TODO update with Forgejo
* Make sure you are a project member in the GitLab repository
* Make sure to have your SSH key registered in GitLab
* Copy the "Clone with SSH" link
* More convenient for later when contributing to the repository
```bash
datalad clone git@codebase.helmholtz.cloud:knue/\
esa-hubble-picture-of-the-week.datalad/project.git \
[esa-hubble-picture-of-the-week.datalad]
```
* Ignore the warnings about "Remote origin does not have git-annex installed" and "access to 2 dataset siblings ... not auto-enabled, enable with ..." because there are multiple annex storages (TODO not needed with Forgejo)
## Look at existing datasets
Look around
```bash
cd 10__/1001/
ls -alh
md5sum *.tif
```
* Broken links are for missing annexed files
* Run either of `datalad get .` or `git annex get .`
* The `-J <n>` option uses up to n parallel streams
```bash
ls -alh
md5sum *tif
```
* Run either of `datalad drop .` or `git annex drop .`
```bash
ls -alh
```
## Hands-on Step 1 finished
This is the way how DataLad and git-annex manage large, binary files
_Congratulations, you mastered the first step_
More details
* You can configure which files to make annexed files according to type, size, ...
* You can have one or multiple annex storages
* It works with Github or GitLab or similar git forges with extra annex storages
* Forgejo-Aneksajo (https://codeberg.org/forgejo-aneksajo/forgejo-aneksajo) makes it even easier with built-in support for annexed files
# TODO **Hands-on**: Add new data to the tutorial repository
## Adding new data to the tutorial repository
In this hands-on session, we will add new data to our DataLad repository. This includes downloading images from the ESA/Hubble website and integrating them into our dataset using DataLad's tools.
### Steps:
1. **Download new images**:
- Visit the [ESA/Hubble Pictures of the Month](https://esahubble.org/images/potm/) website.
- Select a few images to add to your dataset.
- Download these images to a local directory (e.g., `downloads/`).
2. **Add images to the dataset**:
- Use `datalad add` to include the downloaded images in the dataset.
- Commit the changes to the dataset.
3. **Verify the addition**:
- Check that the images are properly tracked by DataLad.
- Ensure that the metadata is correctly updated.
This process demonstrates how DataLad handles large files efficiently, using git-annex under the hood to manage file storage and versioning.
### Example commands: