- HTML 70%
- Python 30%
| .datalad | ||
| 10__ | ||
| 16__ | ||
| 22__/2222 | ||
| 24__ | ||
| 25__ | ||
| 26__/2601 | ||
| code/histograms | ||
| nocodb_automation@aa805e2e82 | ||
| .gitattributes | ||
| .gitignore | ||
| .gitmodules | ||
| copyright.html | ||
| extract_metadata.py | ||
| LICENSE.md | ||
| README.md | ||
| requirements.txt | ||
| SCHEMA.json | ||
Datalad tutorial repository with ESA/Hubble Pictures of the Week
This is a tutorial data repository to teach the F.A.I.R. Research Data Management (RDM) with datalad, git, and git-annex.
Prerequisites
You need to install DataLad with a recent version (suggest >= 1.1.0) as well as git, git-annex. Suggested installation methods are via package manager like
sudo apt install datalad git-annex git-annex-remote-rclone
if possible or via Python in a venv like
python3 -m pip install datalad
or
python3 -m pip install --user datalad
or via the uv tool as
uv tool install datalad
See the datalad handbook for more options.
Additional tools for this specific example
For this specific example but not for general datalad repositories need exiftool and some Python libraries. For the exiftool command install via your favorite package manager, e.g. apt with
sudo apt install libimage-exiftool-perl
(or similar or see its install instructions). For the Python packages pillow and PyExifTool use
pip install -r requirements.txt
(or with pip or similar).
Clone the repository
Clone this repository with datalad via ssh:
datalad clone git@codebase.helmholtz.cloud:knue/esa-hubble-picture-of-the-week.datalad/project.git
(assuming you have the proper access rights already so you were able to read this).
Look around
In this data repository consisting of many datasets. You see a directory tree like
./
10__/
1001/
1002/
1003/
16__/
1624/
24__/
2438/
2439/
2440/
... and probably more in the future. This is to organise datasets with a four-digit name abcd into sub-dir ab__/ and inside this into sub-sub-dir abcd/. Inside the latter sub-sub-directory all the files that belong to a dataset can be found. This scheme prevents that there will be a very large numbers of sub-dirs in any dir. Even though a local Linux filesystem has no issues with many thousand entires in a directory, at some point it will become problematic, for example with listing them. In parallel HPC file systems this is more critical. Always use some arbitrary approach like this to limit the number of elements in a dir with an upper bound of many hundred or a few thousand.
Inside each dataset directory ab__/abcd/ you can find several text files and several binary (data) files. The text file DATASET.json is a special file, which is the representative file for the entire dataset. It contains the structured metadata for this dataset (see below). It should always be part of a finished dataset. In addition, there is a README.md file with some basic information and a preview of the data. This is for convenience for human users to get an overview of the dataset. It is also very handy when looking at the dataset in gitlab.
The binary data files in this example are image files, some of them quite large. They are not the kind of files that are good for handling with git alone. This is what git-annex and datalad are good for and this is part of what this tutorial wants to teach.
But why are the binary data files broken?
After cloning the repository like above, the binary files *.jpg and *.tif are listed but not accesible. They are actually broken symlinks. Use datalad get <subdir> to get a subset or datalad get . to get all of them -- see also the datalad handbook. Datalad will download them for you from their original location on the web (see below). There is another source s3.casus.science to download them from but this is not (yet) accessible without additional configuration and authentication information.
Source of data for this repository
The source of data for our example is ESA/Hubble Picture of the Week https://esahubble.org/images/potw/.
See https://esahubble.org/copyright/ and the file LICENSE.md.
Adding more datasets to this repository
The proper way to add more datasets to this data repository is as follows:
- Point your webbrowser to https://esahubble.org/images/potw/ and pick one of their pictures of the week which is not yet in this data repository. Go to the specific page like https://esahubble.org/images/potw2422a/ ... note the 4-digit number near the end of the URL.
- Create the proper subdirectory for the new data set. If the 4-digit number was
abcd, then runmkdir -p ab__/abcd/, thencd ab__/abcd/. - Download the web page to the dataset sub-dir, for example
curl -O https://esahubble.org/images/potw2422a/(in case of issues with older versions of curl usecurl -o potw2422a https://esahubble.org/images/potw2422a/). - Use datalad to download the two images "Fullsize Original" (TIFF format) and "Large JPEG" (JPG format) like
datalad download-url https://esahubble.org/media/archives/images/original/potw2422a.tif https://cdn.esahubble.org/archives/images/large/potw2422a.jpg - Go to the root dir of the data repository again
cd ../../ - Now run
./extract_metadata.py ab__/abcd/to automatically extract all metadataand produce DATASET.json, README.md, and a preview image in PNG format. Only with those extra files the dataset will be complete. (You might need to install the specific dependencies used by the Python scriptextract_metadata.pyin this example, either install globally or in a venv or similar.) - Run
datalad statusto see what changed in the reporitory. Then rundatalad save -m "<commit message>"to commit it locally. This is basically the same asgit addplusgit commit. Actually, you can use the corresponding git commands just as well.
Now, you have added a new dataset locally to the data collection in this data repository.
Comment 1: In step 3 you might want to use datalad download-url instead of curl for the HTML file, too. This would be more datalad-ish, of course. For the sake of the example let's pretend that step 3. is an example of producing data yourself in some way whereas step 4. is something that explicitly downloads things from public web links.
Comment 2: For step 6 again, wrapping the call to extract_metadata.py in datalad run is the even more datalad-ish way.
Work collaboratively with the data repository
To share your new datasets with the rest of your team, you need to push it to the gitlab repository. This is how you work collaboratively with the data repository.
- To update your local repository from the central gitlab repository, run
datalad update --how=mergeor the equivalent git commandgit pull. (N.B.,datalad updateby itself is equivalent togit fetch). This is how you get new datasets that someone else added. - Run
datalad pushordatalad push --to <name-of-gitlab-sibling>to bring the gitlab version of this repository up to date. This is equivalent togit push.
The data inventory for the data repository
The gitlab version of this repository is automatically connected to a public dataset inventory. This is where all the metadata of all the datasets (from their DATASET.json files) is collected in a convenient, searchable and filterable web browser view. This is where one can search for specific properties resp. the datasets that match those properites.