162 lines
4.8 KiB
Markdown
162 lines
4.8 KiB
Markdown
The goal is to have a self-contained lesson page that can be built into a complete lesson.
|
|
|
|
# DataLad version control
|
|
|
|
- The git-annex extension and external storages for large data
|
|
- The DataLad tool on top of git and its sub-commands
|
|
- **Hands-on**: Get to know the tutorial repository
|
|
- **Hands-on**: Add new data to the tutorial repository
|
|
|
|
|
|
|
|
# TODO The git-annex extension and external storages for large data
|
|
|
|
|
|
|
|
# TODO The DataLad tool on top of git and its sub-commands
|
|
|
|
|
|
|
|
# TODO **Hands-on**: Get to know the tutorial repository
|
|
|
|
Let's work with a Datalad repository for [ESA/Hubble Pictures of the Week](https://esahubble.org/images/potw/) at https://esahubble.org/images/potw/ (recently they changed to [ESA/Hubble Pictures of the Month](https://esahubble.org/images/potm/)
|
|
|
|
<div style="display: flex; gap: 20px; vertical-align: top;">
|
|
<div style="flex: 0 0 50%;">
|
|
|
|
1. Look around in the gitlab project
|
|
1. Install Datalad
|
|
1. Clone the data repository
|
|
1. Look at existing datasets
|
|
|
|
</div>
|
|
<div style="flex: 0 0 50%; vertical-align: top;">
|
|
|
|
<div style="position: relative; width: 300px; height: 300px;">
|
|
<img src="images/potw2438a.jpg.png" style="position: absolute; top: 0; left: 0; width: 200px; z-index: 1; border: 2px solid #333;">
|
|
<img src="images/potw2439a.jpg.png" style="position: absolute; top: 60px; left: 60px; width: 200px; z-index: 2; border: 2px solid #333;">
|
|
<img src="images/potw2440a.jpg.png" style="position: absolute; top: 120px; left: 120px; width: 200px; z-index: 3; border: 2px solid #333;">
|
|
</div>
|
|
|
|
</div>
|
|
</div>
|
|
|
|
|
|
## The GitLab Repository -- TODO replace with Forgejo
|
|
|
|
Look at the public GitLab repository
|
|
https://codebase.helmholtz.cloud/knue/esa-hubble-picture-of-the-week.datalad/project
|
|
|
|
|
|
## The GitLab Repository -- TODO replace with Forgejo
|
|
|
|
<img src="images/J1.png" data-preview-image>
|
|
|
|
|
|
## The GitLab Repository -- TODO replace with Forgejo
|
|
|
|
<img src="images/J2.png" data-preview-image>
|
|
|
|
|
|
## The GitLab Repository -- TODO replace with Forgejo
|
|
|
|
<img src="images/J3.png" data-preview-image>
|
|
|
|
|
|
|
|
## Install DataLad
|
|
|
|
Install `DataLad` and `git-annex` in your HPC account without sudo
|
|
|
|
```bash
|
|
python3 -m venv .venv
|
|
. .venv/bin/activate
|
|
pip install datalad git-annex
|
|
```
|
|
|
|
See the DataLad handbook for more ways to install: https://handbook.datalad.org/en/latest/intro/installation.html
|
|
|
|
|
|
|
|
## Clone the repository
|
|
|
|
TODO update with Forgejo
|
|
|
|
* Make sure you are a project member in the GitLab repository
|
|
* Make sure to have your SSH key registered in GitLab
|
|
* Copy the "Clone with SSH" link
|
|
* More convenient for later when contributing to the repository
|
|
|
|
```bash
|
|
datalad clone git@codebase.helmholtz.cloud:knue/\
|
|
esa-hubble-picture-of-the-week.datalad/project.git \
|
|
[esa-hubble-picture-of-the-week.datalad]
|
|
```
|
|
* Ignore the warnings about "Remote origin does not have git-annex installed" and "access to 2 dataset siblings ... not auto-enabled, enable with ..." because there are multiple annex storages (TODO not needed with Forgejo)
|
|
|
|
|
|
## Look at existing datasets
|
|
|
|
Look around
|
|
|
|
```bash
|
|
cd 10__/1001/
|
|
ls -alh
|
|
md5sum *.tif
|
|
```
|
|
|
|
* Broken links are for missing annexed files
|
|
* Run either of `datalad get .` or `git annex get .`
|
|
* The `-J <n>` option uses up to n parallel streams
|
|
|
|
```bash
|
|
ls -alh
|
|
md5sum *tif
|
|
```
|
|
|
|
* Run either of `datalad drop .` or `git annex drop .`
|
|
|
|
```bash
|
|
ls -alh
|
|
```
|
|
|
|
|
|
## Hands-on Step 1 finished
|
|
|
|
This is the way how DataLad and git-annex manage large, binary files
|
|
|
|
_Congratulations, you mastered the first step_
|
|
|
|
More details
|
|
* You can configure which files to make annexed files according to type, size, ...
|
|
* You can have one or multiple annex storages
|
|
* It works with Github or GitLab or similar git forges with extra annex storages
|
|
* Forgejo-Aneksajo (https://codeberg.org/forgejo-aneksajo/forgejo-aneksajo) makes it even easier with built-in support for annexed files
|
|
|
|
|
|
|
|
# TODO **Hands-on**: Add new data to the tutorial repository
|
|
|
|
## Adding new data to the tutorial repository
|
|
|
|
In this hands-on session, we will add new data to our DataLad repository. This includes downloading images from the ESA/Hubble website and integrating them into our dataset using DataLad's tools.
|
|
|
|
### Steps:
|
|
|
|
1. **Download new images**:
|
|
- Visit the [ESA/Hubble Pictures of the Month](https://esahubble.org/images/potm/) website.
|
|
- Select a few images to add to your dataset.
|
|
- Download these images to a local directory (e.g., `downloads/`).
|
|
|
|
2. **Add images to the dataset**:
|
|
- Use `datalad add` to include the downloaded images in the dataset.
|
|
- Commit the changes to the dataset.
|
|
|
|
3. **Verify the addition**:
|
|
- Check that the images are properly tracked by DataLad.
|
|
- Ensure that the metadata is correctly updated.
|
|
|
|
This process demonstrates how DataLad handles large files efficiently, using git-annex under the hood to manage file storage and versioning.
|
|
|
|
### Example commands:
|
|
|