| .datalad | ||
| bootstrapCAT@9745142ee0 | ||
| bootstrapfMRIprep@cd8f381db5 | ||
| bootstrapQC@fc84d09ead | ||
| .gitattributes | ||
| .gitmodules | ||
| README.md | ||
Project: High throughput container workflows
Implementation of reproducible containerized processing pipelines for structural and functional MRI data including quality assessment
We propose a bootstrap approach for the reproducible setup of an entire processing workflow for a given dataset with a specific pipeline by executing a single shell script. This procedure capitalizes on the capabilities of the FAIRly big workflow, which relies on:
- Datalad as well tested, distributed data management tool.
- Singularity as reliable software hosting environment.
- HTCondor and SLURM as powerful processing job scheduling systems.
The 3 ready to use pipelines implemented here deal with the following basic MRI processing workflows:
- Quality assessment of MRI data for quality control (QC)
- bootsrapQC
- Structural preprocessing for anatomical analysis and inference
- bootstrapCAT
- Functional preprocessing for activity and connectivity analyses
- bootstrapfMRIprep
-
The QC-workflow is designed to inform and optimize subsequent statistical analyses of MRI datasets using a combination of established quality assessment tools: CAT12 and MRIQC. It provides a multitude of image quality metrics (IQMs) that are available for the whole dataset including flags for potential outliers with low image quality. Data is presented in machine readable CSV tables, as well as in an interactive HTML format provided by MRIQC to browse individual subjects IQMs.
-
CAT12 - the Computational Anatomy Toolbox (Gaser et at., 2024) performs full reparation of structural MRI data for volumetric and cortical surface based analyses. CAT12 is an extension to SPM12 in Matlab/Octave used here as compiled standalone version in a Singularity container. The toolbox covers diverse morphometric analysis methods such as Voxel-based morphometry (VBM), surface-based morphometry (SBM) and label- or region-based morphometry (RBM). The CAT-workflow provides morphometric derivatives for subsequent statistical and predictive modeling of brain anatomy.
-
fMRIprep flexibly prepares all commonly acquired flavors of functional MRI data including optional preparation of related structural data for surface based workflows with Freesurfer. fMRIPrep is a flexible pipeline for preprocessing functional magnetic resonance imaging (fMRI) data. It’s designed to be robust to variations in scan acquisition protocols and relying on the Brain Imaging Data Structure (BIDS) convention requires minimal user input to providing easily interpretable and comprehensive reporting in addition to analysis ready derivatives.
For each workflow, a template and example bootstrap scripts are provided to set up all necessary parts for processing a whole MRI dataset in ephemeral clones. Only after successful completion of a compute job in the pipeline, results are pushed to a Datalad special remote, from where the processed data can be cloned as part of the generated dataset.
Stages of pipeline execution
The computation of results files is executed in 3 stages:
-
Dataset preparation: All prerequisites are automatically setup of for data processing, including the input dataset, the software pipeline, and the submission scripts that trigger job scheduling in high-throughput and high-performance computing environments (HTC and HPC). The bootstrap template script is tailored to a given pipeline's (I) input data, (II) storage setup for saving the pipeline’s output, and (III) available job scheduling system for efficient, parallelized data processing. When executed, the bootstrap script creates an empty dataset that includes all the necessary scripts for processing the data, as well as links to the input dataset, the software containers, and the dataset repository, which will gather all the processed data derivatives.
-
Job submission: Submit compute jobs for processing the full dataset with provenance tracking in Datalad. This captures machine-readable, re-executable run records for every computed job associated with each derivative file produced by the workflow. Executing the prepared job submission script for the available HTC/HPC environment triggers the maximum parallel processing of the entire input dataset. The pipeline setup guarantees that data transfer to the desired remote location will only occur if the data processing is fully successful.
-
Dataset consolidation: As parallel data processing is only possible for independent job execution, it is necessary to consolidate the resulting data into a final dataset. Running the initially prepared merge script ensures that all derivatives are available in one place as ready-to-use, cloneable Datalad dataset including access control mechanisms for data privacy preservation.