Make generated datalad datasets reproducible #37

Open
opened 2023-06-01 05:23:52 +00:00 by mih · 1 comment
mih commented 2023-06-01 05:23:52 +00:00 (Migrated from github.com)

This requires a timestamp to be included in the tarball metadata.

It may also require to decide on an agent identity (committer), unless reproducibility should be limited to a same-person scope.

This requires a timestamp to be included in the tarball metadata. It may also require to decide on an agent identity (committer), unless reproducibility should be limited to a same-person scope.
mih commented 2023-06-06 10:58:06 +00:00 (Migrated from github.com)

ATM datalad dataset IDs are also generated as UUID4 (random). In order to be reproducible, this must be changed.

It would make sense to generate a deterministic UUID5 and base it on another known identifier. A candidate is the tarball MD5. datalad-ebrains does something similar:

github.com/datalad/datalad-ebrains@75acaae21e/datalad_ebrains/fairgraph_query.py (L83-L88)

Ping @jsheunis

ATM datalad dataset IDs are also generated as UUID4 (random). In order to be reproducible, this must be changed. It would make sense to generate a deterministic UUID5 and base it on another known identifier. A candidate is the tarball MD5. datalad-ebrains does something similar: https://github.com/datalad/datalad-ebrains/blob/75acaae21e7daf4ea100ec7e8e8fa01774729e63/datalad_ebrains/fairgraph_query.py#L83-L88 Ping @jsheunis
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
inm7/inm-icf-utilities#37
No description provided.