Watch

No description

HTML 70%
Python 30%

Find a file

Repository files (latest commit first)
Filename	Latest commit message	Latest commit date
Adina Wagner d86b07a578 Re-register 24__/2444/potw2444a.jpg		2026-05-11 14:15:44 +02:00
.datalad	[DATALAD] new dataset	2024-10-01 15:20:40 +02:00
10__	Trigger webhook again, don't actually change something	2025-10-01 15:13:21 +02:00
16__	Trigger webhook	2025-08-19 15:27:03 +02:00
22__/2222	Trigger webhook again, don't actually change something	2025-10-01 15:13:21 +02:00
24__	Re-register 24__/2444/potw2444a.jpg	2026-05-11 14:15:44 +02:00
25__	[DATALAD RUNCMD] ./extract_metadata.py 25__/2551/ 25__/25...	2026-01-15 14:41:59 +01:00
26__/2601	Add 2601	2026-02-12 13:00:32 +01:00
code/histograms	chmod +x	2026-02-12 13:00:48 +01:00
nocodb_automation@aa805e2e82	Add 2550	2025-12-19 16:35:33 +01:00
.gitattributes	PNG goes to git as preview images	2024-10-02 10:05:07 +02:00
.gitignore	Ignore venv	2024-10-01 18:59:29 +02:00
.gitmodules	Add 2550	2025-12-19 16:35:33 +01:00
copyright.html	Add README and licensing information	2024-10-04 16:24:53 +02:00
extract_metadata.py	new dataset, adjusted extract script	2024-10-04 16:43:43 +02:00
LICENSE.md	Add README and licensing information	2024-10-04 16:24:53 +02:00
README.md	Update NocoDB link	2026-01-07 11:03:23 +01:00
requirements.txt	Add Python requirements file	2024-10-16 12:21:05 +02:00
SCHEMA.json	Improve extract script, new SCHEMA, new DATASET.json files	2024-10-03 11:36:28 +02:00

README.md

Datalad tutorial repository with ESA/Hubble Pictures of the Week

This is a tutorial data repository to teach the F.A.I.R. Research Data Management (RDM) with datalad, git, and git-annex.

Prerequisites

You need to install DataLad with a recent version (suggest >= 1.1.0) as well as git, git-annex. Suggested installation methods are via package manager like

sudo apt install datalad git-annex git-annex-remote-rclone

if possible or via Python in a venv like

python3 -m pip install datalad

python3 -m pip install --user datalad

or via the uv tool as

uv tool install datalad

See the datalad handbook for more options.

Additional tools for this specific example

For this specific example but not for general datalad repositories need exiftool and some Python libraries. For the exiftool command install via your favorite package manager, e.g. apt with

sudo apt install libimage-exiftool-perl

(or similar or see its install instructions). For the Python packages pillow and PyExifTool use

pip install -r requirements.txt

(or with pip or similar).

Clone the repository

Clone this repository with datalad via ssh:

datalad clone git@codebase.helmholtz.cloud:knue/esa-hubble-picture-of-the-week.datalad/project.git

(assuming you have the proper access rights already so you were able to read this).

Look around

In this data repository consisting of many datasets. You see a directory tree like

... and probably more in the future. This is to organise datasets with a four-digit name abcd into sub-dir ab__/ and inside this into sub-sub-dir abcd/. Inside the latter sub-sub-directory all the files that belong to a dataset can be found. This scheme prevents that there will be a very large numbers of sub-dirs in any dir. Even though a local Linux filesystem has no issues with many thousand entires in a directory, at some point it will become problematic, for example with listing them. In parallel HPC file systems this is more critical. Always use some arbitrary approach like this to limit the number of elements in a dir with an upper bound of many hundred or a few thousand.

Inside each dataset directory ab__/abcd/ you can find several text files and several binary (data) files. The text file DATASET.json is a special file, which is the representative file for the entire dataset. It contains the structured metadata for this dataset (see below). It should always be part of a finished dataset. In addition, there is a README.md file with some basic information and a preview of the data. This is for convenience for human users to get an overview of the dataset. It is also very handy when looking at the dataset in gitlab.

The binary data files in this example are image files, some of them quite large. They are not the kind of files that are good for handling with git alone. This is what git-annex and datalad are good for and this is part of what this tutorial wants to teach.

But why are the binary data files broken?

After cloning the repository like above, the binary files *.jpg and *.tif are listed but not accesible. They are actually broken symlinks. Use datalad get <subdir> to get a subset or datalad get . to get all of them -- see also the datalad handbook. Datalad will download them for you from their original location on the web (see below). There is another source s3.casus.science to download them from but this is not (yet) accessible without additional configuration and authentication information.

Source of data for this repository

The source of data for our example is ESA/Hubble Picture of the Week https://esahubble.org/images/potw/.

See https://esahubble.org/copyright/ and the file LICENSE.md.

Adding more datasets to this repository

The proper way to add more datasets to this data repository is as follows:

Point your webbrowser to https://esahubble.org/images/potw/ and pick one of their pictures of the week which is not yet in this data repository. Go to the specific page like https://esahubble.org/images/potw2422a/ ... note the 4-digit number near the end of the URL.
Create the proper subdirectory for the new data set. If the 4-digit number was abcd, then run mkdir -p ab__/abcd/, then cd ab__/abcd/.
Download the web page to the dataset sub-dir, for example curl -O https://esahubble.org/images/potw2422a/ (in case of issues with older versions of curl use curl -o potw2422a https://esahubble.org/images/potw2422a/).
Use datalad to download the two images "Fullsize Original" (TIFF format) and "Large JPEG" (JPG format) like datalad download-url https://esahubble.org/media/archives/images/original/potw2422a.tif https://cdn.esahubble.org/archives/images/large/potw2422a.jpg
Go to the root dir of the data repository again cd ../../
Now run ./extract_metadata.py ab__/abcd/ to automatically extract all metadataand produce DATASET.json, README.md, and a preview image in PNG format. Only with those extra files the dataset will be complete. (You might need to install the specific dependencies used by the Python script extract_metadata.py in this example, either install globally or in a venv or similar.)
Run datalad status to see what changed in the reporitory. Then run datalad save -m "<commit message>" to commit it locally. This is basically the same as git add plus git commit. Actually, you can use the corresponding git commands just as well.

Now, you have added a new dataset locally to the data collection in this data repository.

Comment 1: In step 3 you might want to use datalad download-url instead of curl for the HTML file, too. This would be more datalad-ish, of course. For the sake of the example let's pretend that step 3. is an example of producing data yourself in some way whereas step 4. is something that explicitly downloads things from public web links. Comment 2: For step 6 again, wrapping the call to extract_metadata.py in datalad run is the even more datalad-ish way.

Work collaboratively with the data repository

To share your new datasets with the rest of your team, you need to push it to the gitlab repository. This is how you work collaboratively with the data repository.

To update your local repository from the central gitlab repository, run datalad update --how=merge or the equivalent git command git pull. (N.B., datalad update by itself is equivalent to git fetch). This is how you get new datasets that someone else added.
Run datalad push or datalad push --to <name-of-gitlab-sibling> to bring the gitlab version of this repository up to date. This is equivalent to git push.

The data inventory for the data repository

The gitlab version of this repository is automatically connected to a public dataset inventory. This is where all the metadata of all the datasets (from their DATASET.json files) is collected in a convenient, searchable and filterable web browser view. This is where one can search for specific properties resp. the datasets that match those properites.