Update docs #40

Merged
mslw merged 21 commits from docs-rewrite into main 2024-01-10 17:38:36 +00:00
13 changed files with 468 additions and 244 deletions

View file

@ -1,2 +1,2 @@
Sphinx >= 7.0, < 8.0
furo==2023.5.20
Sphinx == 7.2.*
furo == 2023.9.10

View file

@ -1,152 +1,27 @@
Administrator docs
==================
The INM-ICF Utilities `Github repository`_ provides a set of
executable Python scripts which automate generation of deposits in the
ICF archive. To simplify deployment, these scripts and all their
dependencies are packaged as a `Singularity`_ v3 container
(`download`_).
.. _github repository: https://github.com/psychoinformatics-de/inm-icf-utilities
.. _singularity: https://docs.sylabs.io/guides/main/user-guide/
.. _download: https://ci.appveyor.com/api/projects/mih/inm-icf-utilities/artifacts/icf.sif
Archive generation
------------------
Containerized execution
^^^^^^^^^^^^^^^^^^^^^^^
With the Singilarity image, ``icf.sif``, all scripts are made directly
available, either through ``singularity run``:
.. code-block:: console
$ singularity run <singularity options> icf.sif <script name> <script options>
or by making the image file executable.
The singularity image can also be installed as if it was a system
package. For this, fill in the placeholders in the following script,
and save it as ``icf-utils``:
.. code-block:: sh
#!/bin/sh
set -e -u
singularity run -B <absolute-path-to-data> <absolute-path-to-icf.sif-file> "$@" > icf-utils
The ``-B`` defines a bind path, making it accessible from within the
container.
Afterwards, install it under ``/usr/bin`` to make all functionality
available under an ``icf-utils`` command.
.. code-block::
$ sudo install -t /usr/bin icf-utils
Archival workflow
^^^^^^^^^^^^^^^^^
-----------------
The main part of visit archival is the creation a TAR file.
The DataLad dataset can be generated and placed alongside the tarballs
without affecting them. Placement in the study folder guarantees the
same access permissions (authenticated https). The datasets are
generated based on file metadata -- the TAR archive remains the only
data source -- so storage overhead is minimal.
Optionally, the DataLad dataset can be generated and placed alongside
the tarballs without affecting them. Placement in the study folder
guarantees the same access permissions (authenticated https). The
datasets are generated based on file metadata -- the TAR archive
remains the only data source -- so storage overhead is minimal.
Four scripts, executed in the given order, capture the archival
process.
process. See :ref:`scripts` for usage details and :ref:`container` for
recommended deployment of the tools.
Script listing
^^^^^^^^^^^^^^
- ``make_studyvisit_archive``
- ``deposit_visit_metadata`` (optional)
- ``deposit_visit_dataset`` (optional)
- ``catalogify_studyvisit_from_meta`` (optional)
``make_studyvisit_archive``
"""""""""""""""""""""""""""
This utility generates a TAR archive from a directory containing DICOM files.
The input directory can have any number of files, with any organization or
naming. However, the DICOM files are assumed to come from a single "visit"
(i.e., the time between a person or sample entering and then leaving a
scanner). The input directory's content is copied into a TAR archive verbatim,
with no changes to filenames or organization.
In order to generate reproducible TAR archives, the file order, recorded
permissions and ownership, and modification times are standardized. All files
in the TAR archive are declared to be owned by root/root (uid/gid: 0/0) with
0644 permissions. The modification time of any DICOM file is determined
by its contained DICOM `StudyDate/StudyTime` timestamps. The modification time
for any non-DICOM file is set to the latest timestamp across all DICOM files.
.. code-block:: console
$ icf-utils make_studyvisit_archive --help
usage: make_studyvisit_archive [-h] [-o PATH] --id STUDY-ID VISIT-ID <input-dir>
``deposit_visit_metadata``
""""""""""""""""""""""""""
This command locates the DICOM tarball for a particular visit in a
study (given by their respective identifiers) in the data store, and
extracts a minimal set of metadata tags for each DICOM image, and the
TAR archive as a whole. These metadata are then deposited in two
files, in JSON format, in the study directory:
- ``{visit_id}_metadata_tarball.json``
JSON object with basic properties of the archive, such as 'size', and
'md5'.
- ``{visit_id}_metadata_dicoms.json``
JSON array with essential properties for each DICOM image file, such as
'path' (relative path inside the TAR archive), 'md5' (MD5 checksum of
the DICOM file), 'size' (in bytes), and select standard DICOM tags,
such as "SeriesDescription", "SeriesNumber", "Modality",
"MRAcquisitionType", "ProtocolName", "PulseSequenceName". The latter
enable a rough, technical characterization of the images in the TAR
archive.
.. code-block:: console
$ icf-utils getmeta_studyvisit -h
usage: getmeta_studyvisit [-h] [-o PATH] --id STUDY-ID VISIT-ID
``deposit_visit_dataset``
"""""""""""""""""""""""""
This command reads the metadata deposit from
``deposit_visit_metadata`` for a visit in a study (given by their
respective identifiers) from the data store, and generates a DataLad
dataset from it. This DataLad dataset provides versioned access to the
visit's DICOM data, up to single-image granularity. Moreover, all
DICOM files are annotated with basic DICOM tags that enable on-demand
dataset views for particular applications (e.g., DICOMs sorted by
image series and protocol name). The DataLad dataset is deposited in
two files in the study directory:
- ``{visit_id}_XDLRA--refs``
- ``{visit_id}_XDLRA--repo-export``
where the former enables `datalad/git clone` operations, and the latter
represents the actual dataset as a compressed archive.
.. code-block:: console
$ icf-utils dataladify_studyvisit_from_meta -h
usage: dataladify_studyvisit_from_meta [-h] [-o PATH] --id STUDY-ID VISIT-ID
``catalogify_studyvisit_from_meta``
"""""""""""""""""""""""""""""""""""
This command creates or updates a DataLad catalog -- a user-facing
html rendering of dataset contents. It is placed in the ``catalog``
folder in the study directory.
.. code-block:: console
$ icf-utils dataladify_studyvisit_from_meta --help
usage: dataladify_studyvisit_from_meta [-h] [-o PATH] --id STUDY-ID VISIT-ID
Creation of the TAR file needs to be done by the ICF. The remaining
three steps can be done by the ICF (with results deposited alongside
the TAR file), or by the ICF users who can access the data (on their
own infrastructure), and for this reason are marked as optional.

View file

@ -16,6 +16,7 @@ individuals.
:caption: Contents:
user/index
reference/index
admin
developer

View file

@ -0,0 +1,40 @@
.. _container:
Containerized execution
-----------------------
To simplify deployment, ICF utilities scripts and all their
dependencies are packaged as a `Singularity`_ v3 container
(`download`_).
.. _singularity: https://docs.sylabs.io/guides/main/user-guide/
.. _download: https://ci.appveyor.com/api/projects/mih/inm-icf-utilities/artifacts/icf.sif
With the Singilarity image, ``icf.sif``, all scripts are made directly
available, either through ``singularity run``:
.. code-block:: console
$ singularity run <singularity options> icf.sif <script name> <script options>
or by making the image file executable.
The singularity image can also be installed as if it was a system
package. For this, fill in the placeholders in the following script,
and save it as ``icf-utils``:
.. code-block:: sh
#!/bin/sh
set -e -u
singularity run -B <absolute-path-to-data> <absolute-path-to-icf.sif-file> "$@" > icf-utils
The ``-B`` defines a bind path, making it accessible from within the
container.
Afterwards, install it under ``/usr/bin`` to make all functionality
available under an ``icf-utils`` command.
.. code-block::
$ sudo install -t /usr/bin icf-utils

View file

@ -0,0 +1,19 @@
Reference
=========
The INM-ICF Utilities `Github repository`_ provides a set of
executable Python scripts which automate generation of deposits in the
ICF archive. To simplify deployment, these scripts and all their
dependencies are packaged as a `Singularity`_ v3 container
(`download`_).
.. _github repository: https://github.com/psychoinformatics-de/inm-icf-utilities
.. _singularity: https://docs.sylabs.io/guides/main/user-guide/
.. _download: https://ci.appveyor.com/api/projects/mih/inm-icf-utilities/artifacts/icf.sif
.. toctree::
:maxdepth: 2
:caption: Contents:
container
scripts

View file

@ -0,0 +1,92 @@
.. _scripts:
Script listing
--------------
``make_studyvisit_archive``
^^^^^^^^^^^^^^^^^^^^^^^^^^^
This utility generates a TAR archive from a directory containing DICOM files.
The input directory can have any number of files, with any organization or
naming. However, the DICOM files are assumed to come from a single "visit"
(i.e., the time between a person or sample entering and then leaving a
scanner). The input directory's content is copied into a TAR archive verbatim,
with no changes to filenames or organization.
In order to generate reproducible TAR archives, the file order, recorded
permissions and ownership, and modification times are standardized. All files
in the TAR archive are declared to be owned by root/root (uid/gid: 0/0) with
0644 permissions. The modification time of any DICOM file is determined
by its contained DICOM `StudyDate/StudyTime` timestamps. The modification time
for any non-DICOM file is set to the latest timestamp across all DICOM files.
.. code-block:: console
$ icf-utils make_studyvisit_archive --help
usage: make_studyvisit_archive [-h] [-o PATH] --id STUDY-ID VISIT-ID <input-dir>
``deposit_visit_metadata``
^^^^^^^^^^^^^^^^^^^^^^^^^^
This command locates the DICOM tarball for a particular visit in a
study (given by their respective identifiers) in the data store, and
extracts a minimal set of metadata tags for each DICOM image, and the
TAR archive as a whole. These metadata are then deposited in two
files, in JSON format, in the study directory:
- ``{visit_id}_metadata_tarball.json``
JSON object with basic properties of the archive, such as 'size', and
'md5'.
- ``{visit_id}_metadata_dicoms.json``
JSON array with essential properties for each DICOM image file, such as
'path' (relative path inside the TAR archive), 'md5' (MD5 checksum of
the DICOM file), 'size' (in bytes), and select standard DICOM tags,
such as "SeriesDescription", "SeriesNumber", "Modality",
"MRAcquisitionType", "ProtocolName", "PulseSequenceName". The latter
enable a rough, technical characterization of the images in the TAR
archive.
.. code-block:: console
$ icf-utils deposit_visit_metadata -h
usage: deposit_visit_metadata [-h] [-o PATH] --id STUDY-ID VISIT-ID
``deposit_visit_dataset``
^^^^^^^^^^^^^^^^^^^^^^^^^
This command reads the metadata deposit from
``deposit_visit_metadata`` for a visit in a study (given by their
respective identifiers) from the data store, and generates a DataLad
dataset from it. This DataLad dataset provides versioned access to the
visit's DICOM data, up to single-image granularity. Moreover, all
DICOM files are annotated with basic DICOM tags that enable on-demand
dataset views for particular applications (e.g., DICOMs sorted by
image series and protocol name). The DataLad dataset is deposited in
two files in the study directory:
- ``{visit_id}_XDLRA--refs``
- ``{visit_id}_XDLRA--repo-export``
where the former enables `datalad/git clone` operations, and the latter
represents the actual dataset as a compressed archive.
.. code-block:: console
$ icf-utils deposit_visit_dataset -h
usage: deposit_visit_dataset [-h] --id STUDY-ID VISIT-ID [-o PATH] [--store-url URL]
``catalogify_studyvisit_from_meta``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This command creates or updates a DataLad catalog -- a user-facing
html rendering of dataset contents. It is placed in the ``catalog``
folder in the study directory.
.. code-block:: console
$ icf-utils catalogify_studyvisit_from_meta --help
usage: catalogify_studyvisit_from_meta [-h] [-o PATH] --id STUDY-ID VISIT-ID

View file

@ -24,10 +24,10 @@ following:
Catalog-based browsing
======================
By entering the ``datalad_catalog`` directory, users will be able to
If a catalog has been generated for a given study, users will be able to
browse through the directory tree with additional annotations
of available metadata, and search for acquisitions based on keywords
or name.
or name, by entering the ``datalad_catalog`` directory.
Downloads
=========

View file

@ -0,0 +1,85 @@
.. _dl-access:
Access data with DataLad
------------------------
This section describes accessing the ICF data by cloning DataLad
datasets which have already been created and made available, most
likely on local infrastructure. Dataset generation is described in
the previous section, :ref:`dl-generate`.
This workflow uses DataLad with DataLad-Next extension (see
:ref:`dl-requirements`). DataLad datasets index data in their original
(ICF) location. Obtaining data hosted in the ICF store requires access
credentials for a given study, issued by the ICF. DataLad acts only as
a client software. See :ref:`dl-credentials` for details.
Clone & get
^^^^^^^^^^^
If a visit dataset has been prepared and placed in an accessible
location, it can be cloned with DataLad from a URL containing the
following components:
* a set of configuration parameters, always constant
* store base URL (e.g., ``file:///data/group/groupname/local_dicom_store``) [1]_
* study ID (e.g., ``my-study``)
* visit ID (e.g., ``P000123``)
* a file name suffix / template, ``_annex{{annex_key}}`` (verbatim), always constant
The pattern for the URL is::
'datalad-annex::?type=external&externaltype=uncurl&encryption=none&url=<store base URL>/<study ID>/<visit ID>_{{annex_key}}'
Given the exemplary values above, the pattern would expand to:
.. code-block::
'datalad-annex::?type=external&externaltype=uncurl&encryption=none&url=file:///data/group/groupname/local_dicom_store/my-study/P000123_{{annex_key}}'
adswa commented 2024-01-08 13:42:25 +00:00 (Migrated from github.com)

I think it would be nice to have an actual fully clone example given here:


A full ``datalad clone`` command could then look like this:

.. code-block::
    datalad clone 'datalad-annex::?type=external&externaltype=uncurl&encryption=none&url=file:///tmp/local_dicom_store/dl-Z03/P000624_{{annex_key}}'  my_clone
    

I think it would be nice to have an actual fully clone example given here: ```suggestion A full ``datalad clone`` command could then look like this: .. code-block:: datalad clone 'datalad-annex::?type=external&externaltype=uncurl&encryption=none&url=file:///tmp/local_dicom_store/dl-Z03/P000624_{{annex_key}}' my_clone ```
adswa commented 2024-01-08 13:43:46 +00:00 (Migrated from github.com)

It may also be worth a note that this command essentially never fails. If I mistype the URL, cloning succeeds, but it tells me

[WARNING] You appear to have cloned an empty repository.                                                          
[WARNING] Cloned /tmp/my_clone but could not find a branch with commits 

Which makes it sound like its the dataset's issue, when it just stemmed from a non-existent URL

It may also be worth a note that this command essentially never fails. If I mistype the URL, cloning succeeds, but it tells me ``` [WARNING] You appear to have cloned an empty repository. [WARNING] Cloned /tmp/my_clone but could not find a branch with commits ``` Which makes it sound like its the dataset's issue, when it just stemmed from a non-existent URL
mslw commented 2024-01-10 11:32:32 +00:00 (Migrated from github.com)

Good point, clone from datalad-annex urls does that (related: https://github.com/datalad/datalad-next/issues/373). I'll add a note.

Good point, clone from datalad-annex urls does that (related: https://github.com/datalad/datalad-next/issues/373). I'll add a note.
A full ``datalad clone`` command could then look like this:
.. code-block::
datalad clone 'datalad-annex::?type=external&externaltype=uncurl&encryption=none&url=file:///tmp/local_dicom_store/my-study/P000123_{{annex_key}}' my_clone
.. note::
The clone command will not fail if the ``datalad-annex::`` URL
points to a nonexisting target. If you see the following warning:
.. code-block:: none
[WARNING] You appear to have cloned an empty repository.
[WARNING] Cloned /path/to/my_clone but could not find a branch with commits
it is likely that the provided URL is mistyped or otherwise not correct.
.. note:: The URL is arguably a bit clunky. A convenience short cut can be provided via configuration item ``datalad.clone.url-substitute.<label>`` and a substitution rule based on regular expressions. For example, clone URLs can be shortened to require only an identifier (here, ``file:///data/group/groupname/local_dicom_store``), study ID, and visit ID (``inm-icf/<study-ID>/<visit-ID>``) with the following configuration:
.. code-block::
git config --global datalad.clone.url-substitute.inm-icf ',^file:///data/group/groupname/local_dicom_store/([^/]+)/(.*)$,datalad-annex::?type=external&externaltype=uncurl&encryption=none&url=file:///data/group/groupname/local_dicom_store/\1/\2_{{annex_key}}'
This configuration allows DataLad to take any URL of the form ``file:///data/group/groupname/local_dicom_store/<study-ID>/<visit-ID>`` and assemble the required ``datalad-annex::...`` URL on its own, and a clone call shortens into ``datalad clone file:///data/group/groupname/local_dicom_store/my-study/P000123``.
You are free to adjust this configuration custom to your needs and preferences.
Further documentation on it can be found in the `DataLad Docs`_.
.. _DataLad Docs: http://docs.datalad.org/en/stable/design/url_substitution.html
Cloning will retrieve a lightweight dataset, which does not (yet)
contain file content. File content can be retrieved with ``datalad
get``. DataLad will handle download and unpacking of the tar file.
Take a look at the section :ref:`dl-advanced` to learn about useful
convenience features DataLad adds on top of this.
.. rubric:: Footnotes
.. [1] Examples use ``file://`` URLs, given that the datasets are most
likely to be generated on institute-local infrastructure. Other
protocoles (e.g. ``https://`` or ``ssh://``) can be substituted
depending on the particular setup, without affecting the URL
structure.

View file

@ -0,0 +1,28 @@
.. _dl-credentials:
Manage DataLad credentials
--------------------------
The ICF store is not publicly available, and ICF administrators will
provide user names and passwords on a per-study basis. DataLad will
store or retrieve these credentials using your operating system's
keyring service. In general, the first time you use DataLad to access
a project directory, you will be prompted for your credentials. If
content retrieval succeeds, you will have a possibility of saving the
credential, to be reused the next time you access a URL from the same
realm.
If you have access to multiple projects, you can have different sets
of credentials. You can use the `datalad credentials`_ command from
DataLad Next to manage (e.g. query, set or remove) credentials known
to DataLad.
.. admonition:: DataLad usage in the context of GDPR
DataLad is a client-side software. Usage of DataLad with ICF store
is technically equivalent to downloading tar archives with ``wget``
or with a web browser click-to-download: in either case, data
access happens over https, and the authorisation is performed by
the ICF server, not by the clients.
.. _datalad credentials: http://docs.datalad.org/projects/next/en/latest/generated/man/datalad-credentials.html

View file

@ -0,0 +1,142 @@
.. _dl-generate:
adswa commented 2024-01-08 13:21:12 +00:00 (Migrated from github.com)

I believe it should go inside the local store?

   datalad download "https://data.inm-icf.de/<project-ID>/<visit-ID>_dicom.tar  local_dicom_store/<project-ID>/<visit-ID>_dicom.tar"
I believe it should go inside the local store? ```suggestion datalad download "https://data.inm-icf.de/<project-ID>/<visit-ID>_dicom.tar local_dicom_store/<project-ID>/<visit-ID>_dicom.tar" ```
adswa commented 2024-01-08 13:22:43 +00:00 (Migrated from github.com)

?

A DataLad dataset is created based on the metadata extracted in the
? ```suggestion A DataLad dataset is created based on the metadata extracted in the ```
adswa commented 2024-01-08 13:25:22 +00:00 (Migrated from github.com)

Re-reading this paragraph many times, I feel like I'm not 100% sure what it is telling me. Maybe one introductory sentence in addition helps. Is the gist something like this?

In order to deposit a DataLad dataset next to the original tarball in the remote data store, the following command creates a DataLad dataset  based on the metadata extracted in the
Re-reading this paragraph many times, I feel like I'm not 100% sure what it is telling me. Maybe one introductory sentence in addition helps. Is the gist something like this? ```suggestion In order to deposit a DataLad dataset next to the original tarball in the remote data store, the following command creates a DataLad dataset based on the metadata extracted in the ```
adswa commented 2024-01-08 13:27:21 +00:00 (Migrated from github.com)

I think the command also misses the --id parameter and placeholders? I'm getting this when running it:

(icf) adina@muninn in /tmp
❱ singularity run -B $STORE_DIR icf.sif deposit_visit_dataset \
  --store-dir $STORE_DIR --store-url https://data.inm-icf.de
usage: deposit_visit_dataset [-h] --id STUDY-ID VISIT-ID [-o PATH] [--store-url URL]
deposit_visit_dataset: error: the following arguments are required: --id
I think the command also misses the ``--id`` parameter and placeholders? I'm getting this when running it: ``` (icf) adina@muninn in /tmp ❱ singularity run -B $STORE_DIR icf.sif deposit_visit_dataset \ --store-dir $STORE_DIR --store-url https://data.inm-icf.de usage: deposit_visit_dataset [-h] --id STUDY-ID VISIT-ID [-o PATH] [--store-url URL] deposit_visit_dataset: error: the following arguments are required: --id ```
adswa commented 2024-01-08 13:28:32 +00:00 (Migrated from github.com)
   singularity run -B $STORE_DIR icf.sif deposit_visit_dataset \
     --id <Study ID> <Visit ID> dl-Z03 P000624 --store-dir $STORE_DIR --store-url <ICF STORE URL>

```suggestion singularity run -B $STORE_DIR icf.sif deposit_visit_dataset \ --id <Study ID> <Visit ID> dl-Z03 P000624 --store-dir $STORE_DIR --store-url <ICF STORE URL> ```
adswa commented 2024-01-08 13:32:39 +00:00 (Migrated from github.com)

sorry for the flood of comments, I'm realizing more and more things as I'm walking through - I was expecting this to generate a dataset based on the heading, but it doesn't create a standard dataset on my system - just the lightweight representation. Maybe we can reflect this in the heading and description, eg with by placing "dataset" in air quotes or calling it lightweight dataset representation already at the start?

sorry for the flood of comments, I'm realizing more and more things as I'm walking through - I was expecting this to generate a dataset based on the heading, but it doesn't create a standard dataset on my system - just the lightweight representation. Maybe we can reflect this in the heading and description, eg with by placing "dataset" in air quotes or calling it lightweight dataset representation already at the start?
mslw commented 2024-01-10 11:22:52 +00:00 (Migrated from github.com)

No need to apologize; thanks a lot for these comments. I agree with the points you make and will make changes accordingly (without using the suggestions directly).

No need to apologize; thanks a lot for these comments. I agree with the points you make and will make changes accordingly (without using the suggestions directly).
mslw commented 2024-01-10 11:24:32 +00:00 (Migrated from github.com)

Will do that, but without mixing placeholders and values 😉

Will do that, but without mixing placeholders and values :wink:
mslw commented 2024-01-10 11:50:52 +00:00 (Migrated from github.com)

Lol, the double space in the argument makes it download to local_dicom_store instead of local_dicom_store.

I am not a huge fan of how datalad download works with <path>|<url>|<url-path-pair> as an individual argument, but I guess it is a way to make it work with multiple pairs at once

Lol, the double space in the argument makes it download to ` local_dicom_store` instead of `local_dicom_store`. I am not a huge fan of how `datalad download` works with `<path>|<url>|<url-path-pair>` as an individual argument, but I guess it is a way to make it work with multiple pairs at once
adswa commented 2024-01-10 12:26:16 +00:00 (Migrated from github.com)

Lol, the double space in the argument makes it download to local_dicom_store instead of local_dicom_store.

oh no.... :o

> Lol, the double space in the argument makes it download to local_dicom_store instead of local_dicom_store. oh no.... :o
Generate DataLad datasets
-------------------------
The ICF archive for a given project contains DICOM files packaged in
tar archives (DICOM tarballs). In this section we describe creating
DataLad datasets, which index content and location of these tarballs,
for DataLad-based access on institute-local infrastructure.
In principle, such datasets are *lightweight*, meaning that they only
index the content that can be retrieved from the ICF archive (all
access restrictions apply). Using DataLad can simplify local access,
allow raw data versioning, integrate with existing workflows, and
enable logical transformations of the DICOM folder structure - see
:ref:`dl-advanced` for examples of the latter.
The workflow described below uses DataLad with DataLad-Next extension
for initial DICOM download and the INM-ICF tools packaged as a
Singularity container for subsequent steps (see
:ref:`dl-requirements`). ICF access credentials are required (see
:ref:`dl-credentials`).
Obtain the tarball
^^^^^^^^^^^^^^^^^^
First, create an empty directory to be the local dataset store. The
last path component must be the ``project-ID`` used by the ICF store,
because following commands use project and visit IDs to determine
paths.
.. code-block:: bash
mkdir -p local_dicom_store/<project-ID>
Download the visit tarball, keeping the same relative path:
.. code-block:: bash
datalad download "https://data.inm-icf.de/<project-ID>/<visit-ID>_dicom.tar local_dicom_store/<project-ID>/<visit-ID>_dicom.tar"
The local copy of the tarball is required to index its contents. It
can be removed afterwards -- datasets will use the ICF store as the
content source.
Using ``datalad download`` for downloading the file has the benefit of
using DataLad's credential management. If this is the first time you
use DataLad to access the project directory, you will be asked to
provide your ICF credentials. See :ref:`dl-credentials` for details.
For the following steps, the ICF utility scripts packaged as a
Singularity container will be used, and executed with ``singularity
run`` (see :ref:`container` for download and usage details). The
*absolute path* to the local DICOM store will be represented by
``$STORE_DIR``:
.. code-block:: bash
export STORE_DIR=$PWD/local_dicom_store
Deposit visit metadata alongside tarball
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Information required to create a DataLad dataset needs to be extracted
from the tarball:
adswa commented 2024-01-08 13:17:56 +00:00 (Migrated from github.com)

Following through the docs sequentially, I don't think I've come across this singularity image before. I think it would make sense to link to its download page here.

Following through the docs sequentially, I don't think I've come across this singularity image before. I think it would make sense to link to its download page here.
mslw commented 2024-01-10 11:14:47 +00:00 (Migrated from github.com)

The 3rd paragraph of this page says:

The workflow described below uses DataLad with DataLad-Next extension for initial DICOM download and the INM-ICF tools packaged as a Singularity container for subsequent steps (see DataLad requirements).

where "DataLad requirements" is a link to a page that describes things in greater details (and is actually positioned earlier in the User Guide), and links to containerized execution page.

However, your comment makes it apparent that I didn't do a good enough job when trying to compartmentalize the docs (to avoid repetition), and I will add a sentence of two to make up for it.


By the way, this points to a small design issue with the tooling. Initially, the Singularity image was just for ICF. ICF would only use DataLad through the scripts in this image. Users would not need the Singularity image, they would clone datasets from ICF using DataLad.

Now, users who want to dataladify datasets using the Singularity image still need to download the tarballs somehow. I decided to suggest datalad download for the task, because it interacts with DataLad credentials, that would also be needed for any subsequent dataset content retrieval from ICF. Alternatively, we could recommend curl -u followed by Singularity (no need to install DataLad), or datalad download followed by running scripts from this repo (no need for Singularity). The former seems unsatisfactory, because any further dataset interaction would need to happen through DataLad anyway. The latter seems unsatisfactory because the Singularity image was introduced to make the ICF tooling independent of changes in DataLad.

The 3rd paragraph of this page says: > The workflow described below uses DataLad with DataLad-Next extension for initial DICOM download and the INM-ICF tools packaged as a Singularity container for subsequent steps (see DataLad requirements). where "DataLad requirements" is a link to a page that describes things in greater details (and is actually positioned earlier in the User Guide), and links to containerized execution page. However, your comment makes it apparent that I didn't do a good enough job when trying to compartmentalize the docs (to avoid repetition), and I will add a sentence of two to make up for it. <hr> By the way, this points to a small design issue with the tooling. Initially, the Singularity image was just for ICF. ICF would only use DataLad through the scripts in this image. Users would not need the Singularity image, they would clone datasets from ICF using DataLad. Now, users who want to dataladify datasets using the Singularity image still need to download the tarballs somehow. I decided to suggest `datalad download` for the task, because it interacts with DataLad credentials, that would also be needed for any subsequent dataset content retrieval from ICF. Alternatively, we could recommend `curl -u` followed by Singularity (no need to install DataLad), or `datalad download` followed by running scripts from this repo (no need for Singularity). The former seems unsatisfactory, because any further dataset interaction would need to happen through DataLad anyway. The latter seems unsatisfactory because the Singularity image was introduced to make the ICF tooling independent of changes in DataLad.
.. code-block:: bash
singularity run -B $STORE_DIR icf.sif deposit_visit_metadata \
--store-dir $STORE_DIR --id <project-ID> <visit ID>
This will generate two files, ``<visit ID>_metadata_dicoms.json`` and
``<visit ID>_metadata_tarball.json``, and place them alongside the
tarball. The former contains metadata describing individual files
within the tarball (relative path, MD5 checksum, size, and a small
subset of DICOM headers describing acquisition type), and the latter
describes the tarball itself.
Deposit dataset representation alongside tarball
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The next step is to create a lightweight, clone-able representation of
a dataset in the local dataset store. This step relies on the metadata
extracted with the previous command. Additionally, the base URL of the
ICF store needs to be provided (here represented by ``<ICF STORE
URL>``, this base URL should not contain study or visit ID). The URL,
combined with respective IDs, will be registered in the dataset as the
source of the DICOM tarball, and used for retrieval by dataset clones.
.. code-block:: bash
singularity run -B $STORE_DIR icf.sif deposit_visit_dataset \
--store-dir $STORE_DIR --store-url <ICF STORE URL> --id <project-ID> <visit ID>
This will produce two files, ``<visit ID>_XDLA--refs`` and ``<visit
ID>_XDLA--repo-export`` (text file and zip archive
respectively). Together, they are a representation of a (lightweight)
DataLad dataset, and contain the information necessary to retrieve the
data content with DataLad (but do not contain the data content
itself).
Create a catalog view (optional)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
A catalog page (html+JS rendering of dataset contents generated with
`DataLad catalog`_) can be created for the visit dataset. This is
mostly useful when providing (internal) https access to the datasets.
The following command will create the catalog (or update its content)
and place it in the ``catalog`` folder in the study directory.
.. _DataLad catalog: https://docs.datalad.org/projects/catalog
.. code-block:: bash
singularity run -B $STORE_DIR icf.sif catalogify_studyvisit_from_meta \
--store-dir $STORE_DIR --id <project-ID> <visit ID>
adswa commented 2024-01-08 13:36:22 +00:00 (Migrated from github.com)

I think it would be nice to mention that this catalog needs to be subsequently served, or at least point to the README for further instructions - I naively expected the index.html page to display something and initially thought something was wrong.

I think it would be nice to mention that this catalog needs to be subsequently served, or at least point to the README for further instructions - I naively expected the index.html page to display something and initially thought something was wrong.
This catalog needs to be subsequently served; a simple (possibly
local) http server is enough. See the generated README file in the
``catalog`` folder for details.
Remove the tarball
^^^^^^^^^^^^^^^^^^
Finally, the DICOM tarball can be safely removed.
.. code-block:: bash
rm $STORE_DIR/<project-ID>/<visit ID>_dicom.tar
Metadata files can be removed, too, leaving only the dataset
representation in ``*XDLRA*`` files.
.. code-block:: bash
rm $STORE_DIR/<project-ID>/<visit ID>_metadata_*.json
The local store can be used as a DataLad entry point for obtaining the
DICOM files from the ICF store (which would serve as the data source
for dataset clones); see :ref:`dl-access`.

View file

@ -0,0 +1,35 @@
.. _dl-requirements:
DataLad requirements
--------------------
Accessing the ICF store contents and cloning datasets generated with
the ICF tooling requires `DataLad`_ with `Datalad-Next`_ extension
installed. You can find instructions for installing DataLad on your
operating system in the `DataLad Handbook`_. `Datalad-Next`_ can be
installed with `pip`_ [1]_.
Generating DataLad datasets based on the DICOMS in the ICF store
additionally requires the INM-ICF tools, which are packaged as a
`Singularity`_ container; see :ref:`container`. The tools are not
required for accessing already existing DataLad datasets.
Obtaining data hosted in the ICF store requires access credentials for
a given study, issued by the ICF. DataLad acts only as a client
software. See :ref:`dl-credentials` for details.
.. rubric:: Footnotes
.. [1] To install software with pip, run a call such as the one below
in your favourite `virtual environment`_:
.. code-block:: bash
python -m pip install datalad-next
.. _datalad: https://www.datalad.org/
.. _datalad-next: https://docs.datalad.org/projects/next
.. _datalad handbook: https://handbook.datalad.org/intro/installation.html
.. _pip: https://pip.pypa.io/en/stable/
.. _virtual environment: https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/
.. _singularity: https://docs.sylabs.io/guides/main/user-guide/

View file

@ -1,96 +0,0 @@
DataLad-based access
--------------------
Software requirements
^^^^^^^^^^^^^^^^^^^^^
Accessing the ICF store requires `DataLad`_ with `Datalad-Next`_
extension installed.
You can find instructions for installing DataLad on your operating
system in the `DataLad Handbook`_.
`Datalad-Next`_ can be installed with `pip`_ [1]_.
.. _datalad: https://www.datalad.org/
.. _datalad-next: https://docs.datalad.org/projects/next
.. _datalad handbook: https://handbook.datalad.org/intro/installation.html
.. _pip: https://pip.pypa.io/en/stable/
Credentials
^^^^^^^^^^^
The ICF store is not publicly available, and ICF administrators will provide user names and passwords on a per-study basis.
DataLad will store or retrieve these credentials using your
operating system's keyring service. In general, the first time you use
DataLad to access a project directory, you will be prompted for your
credentials. If content retrieval succeeds, the credential will be
saved, and reused the next time you access a URL from the same realm.
If you have access to multiple projects, you can have different sets
of credentials. You can use the `datalad credentials`_ command from
DataLad Next to manage (e.g. query, set or remove) credentials known
to DataLad.
.. admonition:: DataLad usage in the context of GDPR
DataLad is a client-side software. Usage of DataLad with ICF store
is technically equivalent to downloading tar archives with ``wget``
or with a web browser click-to-download: in either case, data
access happens over https, and the authorisation is performed by
the ICF server, not by the clients.
.. _datalad credentials: http://docs.datalad.org/projects/next/en/latest/generated/man/datalad-credentials.html
Clone & get
^^^^^^^^^^^
A visit dataset can be cloned with DataLad from a URL containing the
following components:
* store base URL (e.g., ``https://data.inm-icf.de``)
* study ID (e.g., ``my-study``)
* visit ID (e.g., ``P000123``)
* a set of additional parameters, always constant
The pattern for the URL is::
'datalad-annex::?type=external&externaltype=uncurl&url=<store base URL>/<study ID>/<visit ID>_{{annex_key}}&encryption=none'
Given the exemplary values above, the pattern would expand to
.. code-block::
'datalad-annex::?type=external&externaltype=uncurl&url=https://data.inm-icf.de/my-study/P000123_{{annex_key}}&encryption=none'
.. note:: The URL is arguably a bit clunky. A convenience short cut can be provided via configuration item ``datalad.clone.url-substitute.<label>`` and a substitution rule based on regular expressions. For example, clone URLs can be shortened to require only an identifier (here, ``https://data.inm-icf.de``), study ID, and visit ID (``inm-icf/<study-ID>/<visit-ID>``) with the following configuration:
.. code-block::
git config --global datalad.clone.url-substitute.inm-icf ',^https://data.inm-icf.de/([^/]+)/(.*)$,datalad-annex::?type=external&externaltype=uncurl&url=https://data.inm-icf.de/\1/\2_{{annex_key}}&encryption=none'
This configuration allows DataLad to take any URL of the form ``https://data.inm-icf.de/<study-ID>/<visit-ID>`` and assemble the required ``datalad-annex::...`` URL on its own, and a clone call shortens into ``datalad clone https://data.inm-icf.de/my-study/P000123``.
You are free to adjust this configuration custom to your needs and preferences.
Further documentation on it can be found in the `DataLad Docs`_.
.. _DataLad Docs: http://docs.datalad.org/en/stable/design/url_substitution.html
Cloning will retrieve a lightweight dataset, which does not (yet)
contain file content. File content can be retrieved with `datalad
get`. DataLad will handle download and unpacking of the tar file.
Take a look at the section :ref:`dl-advanced` to learn about
useful convenience features DataLad adds on top of this.
Catalog-based clone URLs
^^^^^^^^^^^^^^^^^^^^^^^^
Instead of crafting clone URLs by hand, the ``datalad_catalog``
directory in the data store displays a copy-paste URL for cloning when
clicking the "Download with DataLad" button on each individual visit ID.
.. rubric:: Footnotes
.. [1] To install software with pip, run a call such as the one below
in your favourite `virtual environment <https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/>`_::
python -m pip install datalad-next

View file

@ -15,5 +15,8 @@ Please contact `ICF personnel`_ to get access and for any authentication-related
:caption: Contents:
browser
datalad
datalad-requirements
datalad-credentials
datalad-generate
datalad-access
datalad-advanced