Test-retest reliability of fMRI experiments during robot-assisted active and passive stepping

Background Brain activity has been shown to undergo cortical and sub-cortical functional reorganisation over the course of gait rehabilitation in patients suffering from a spinal cord injury or a stroke. These changes however, have not been completely elucidated by neuroimaging to date, mainly due to the scarcity of long-term, follow-up investigations. The magnetic resonance imaging (MRI) compatible stepper MARCOS was specifically developed to enable the investigation of the supraspinal adaptations in paretic patients undergoing gait-rehabilitation in a controlled and repeatable manner. In view of future clinical research, the present study aims at examining the test-retest reliability of functional MRI (fMRI) experiments using MARCOS. Methods The effect of repeated active and passive stepping movements on brain activity was investigated in 16 healthy participants from fMRI data collected in two separate imaging sessions six weeks apart. Root mean square errors (RMSE) were calculated for the metrics of motor performance. Regional overlap of brain activation between sessions, as well as an intra-class correlation coefficient (ICC) was computed from the single-subject and group activation maps for five regions of interest (ROI). Results Data from eight participants had to be excluded due to excessive head motion. Reliability of motor performance was higher during passive than active movements, as seen in 4.5- to 13-fold lower RMSE for passive movements. In contrast, ICC ranged from 0.48 to 0.72 during passive movements and from 0.77 to 0.85 during active movements. Regional overlap of activations was also higher during active than during passive movements. Conclusion These findings imply that an increased variability of motor performance during active movements of healthy participants may be associated with a stable neuronal activation pattern across repeated measurements. In contrast, a stable motor performance during passive movements may be accompanied by a confined reliability of brain activation across repeated measurements.


Background
Exercises for functional gait rehabilitation, such as walking on a treadmill or with the aid of a robotic gait orthosis, have a major positive impact on restoration of walking in patients suffering from spinal cord injury or stroke. Previous studies investigating the effect of such functional gait-rehabilitation exercises on brain activity indicate a promotional effect for supraspinal plasticity in the motor centres expected to be involved in locomotion [1,2]. However, the quality of this neural plasticity and its underlying physiological mechanisms have not been characterised in detail mainly due to the lack of standardised experimental conditions for follow-up studies. Longitudinal interventional studies combining extensive gait rehabilitation with a standardised controlled and measurable motor paradigm of the lower limbs during imaging of the brain might further disentangle the effect of gait training on brain activation. Newton et al. presented a motor paradigm of the lower limbs to investigate brain activity during simultaneous control of static moments around the hip, knee and ankle joint in one leg [3]. However, over ground walking involves the control of a dynamic and bilateral anti-phasic simultaneous movement of both legs under the transient influence of ground reaction forces. The magnetic resonance (MR) compatible stepper MARCOS has been developed to deliver and monitor repeated gait-like stepping movements in a standardised manner across task-related functional magnetic resonance imaging (fMRI) experiments [4]. The robot facilitates active (i.e. produced by the participant), as well as passive movements (i.e. imposed by the device). The investigation of passive movements can be meaningful in patients with no, or very limited, voluntary muscle activity in the legs as it is independent of performance ability, yet may be informative about the capacity for sensory adaptations to training [5]. Furthermore, the robot can impose loads against the soles of the feet along the cranio-caudal body axis mimicking ground reaction forces during stepping, thereby activating load-sensitive receptors in the lower limbs.
When investigating brain activation during lower limb motor control repeatedly over the period of rehabilitative intervention, knowledge on the test-retest reliability of the data is indispensable as activated brain areas have been shown to undergo test-retest effects between repeated imaging sessions [6]. In the context of interventional studies, information on the stability of a measurement serves as the basis for differentiating true effects caused by a therapy from those caused by variations in the experimental conditions.
In functional brain imaging, measures of reliability can be either calculated from single subject activation maps, or from activation maps derived from random effects group analyses, depending on whether conclusions shall be formulated for an individual participant or for a representative group of participants that was drawn from a particular population. In the context of interventional studies both are desirable, the former to judge on the effect of the intervention in a particular patient, the latter to generalise the findings of a study to a population.
A number of statistical tests have been proposed for judging the effects of repeated examination of brain activation. In motor control fMRI experiments in both healthy participants and patients, planned comparisons of activation maps, percent of signal intensity change, intra-class correlation coefficient (ICC), voxel count, overlap of activations between repeated sessions, coefficient of variation (CV) and the comparison of the location of the centre of gravity of activated clusters have been applied [3,[6][7][8][9][10][11][12]. Since all of these measures examine the retest-reliability of a given experiment from different perspectives, they have usually been combined to draw inferences.
An ICC, which is calculated from pairs of activation maps, appears to be the most appropriate measure of reliability for fMRI-data, since it is calculated from the variance components of the imaging data and does not depend on the magnitude of activations [13]. It has become the most widely used metric of reliability in fMRI studies.
Imaging studies using ICC as a measure of reliability of fMRI motor control experiments report in general good repeatability (values are deemed excellent above 0.75, good between 0.59 and 0.75, fair between 0.40 and 0.58 and poor for values below 0.40 [14]). Newton et al. reported the results of their study of unilateral singlejoint lower limb motor control in which ICC was calculated from pairs of activation maps for two particular regions of interest (ROI). The voluntary production of torques led to individual ICC ranging from poor to excellent across subjects both in the primary sensorimotor (S1M1) and premotor (Brodmann Area (BA) 6) cortex [3]. In a recent reliability study investigating active and passive flexion and extension of the elbow using an MRcompatible manipulandum, fair to excellent ICC for active and passive movements was estimated for all of the investigated ROIs [11]. However, the test-retest reliability of a given paradigm and hence its ICC depends on numerous parameters throughout data acquisition and analysis, such as imaging hardware, resolution or spatial smoothing of the data (for a review see [15]). The comparability of different studies of reliability is therefore limited, and test-retest reliability needs to be established for each particular paradigm.
In view of future robot-aided fMRI assessments in longitudinal interventional studies, the aim of the present study was to assess the test-retest reliability of experiments using the robot MARCOS. The stability of motor performance and brain activation during the execution of two stepping conditions were investigated. Passive stepping without foot load and active stepping against a foot load of 40 % body weight were assumed to represent the most reliable and the least reliable motor task respectively. The passive condition was expected to yield more reliable results than the active condition, since the passive condition is robot-driven by a strict position control algorithm, while in the active condition motor behaviour is controlled by the participant, and therefore more variable.

Stepping robot
The MR-compatible stepping robot MARCOS was employed to control active and passive stepping movements throughout the experiment [4,16]. When placed inside the robot, the knees and feet of the participant are each attached to a pneumatic cylinder (Fig. 1a). The arrangement of the cylinders allows for one-degree of freedom flexion and individual extension movements of each leg, and the resulting gait-like stepping movement in the sagittal plane resembles "marching-on-the-spot". In addition, external loads of up to 400 N along the cranio-caudal body axis can be rendered to the soles of the feet of the participants during movements by the foot cylinders in order to simulate ground reaction forces. The desired load at the foot is inversely proportional to the vertical position of the knee, such that highest force levels occur at full extension of the leg. Each of the four actuators is equipped with position and force sensors enabling accurate measurement of movement kinematics and kinetics. Data is sampled and stored at 80 Hz for off-line analysis of motor and robot performance. The robot is governed by two PCs: the sensor evaluation CPU runs on Linux and communicates via Ethernet with the control CPU running Matlab xPC real-time target. The control CPU executes control of the pneumatic valves as well as the control of the gait-pattern. Several redundant mechanisms were implemented in the robot to guarantee safety of the participants: 1) mechanical end stops prevent non-physiological postures of the lower limbs, 2) cylinder positions are measured, divergent positions cause the robot to shut down, 3) both CPUs are monitored by the control software and any error in the CPUs causes the robot to shut down, 4) watchdog circuits monitor the CPUs and the communication, 5) the operator can shut down the robot through emergency stops, and 6) upon an emergency stop, all cylinder chambers are set to atmospheric pressure and the cylinders can be moved freely.
A custom made hip and shoulder fixation as well as a custom made head bowl are combined with the inflatable Crania pillow (www.pearltec.ch) to prevent excessive task-related motion of the upper body and the head. MARCOS was built from materials of low magnetic susceptibility (i.e. aluminium, brass, polyvinyl chloride).  1 The investigational set-up as applied in the study. For the purpose of the present study, the MR-compatible robot MARCOS was mounted on the bench of the MR-scanner. The robot was used to measure and control the delivery of active and passive stepping movements (a). During task execution, the word "MOVE" was presented on the screen and participants conducted stepping movements in the rhythm of the concurrently presented auditory stimulus (metronome). This was followed by the acquisition of the BOLD-signal, while participants fixated on a white cross on the screen. Subsequently, the word "LISTEN" and the metronome were presented concurrently, again followed by the acquisition of the BOLD-signal (b) MR-compatibility of the system was established by Hollnagel et al. [4].

Experimental procedure
Sixteen healthy participants were investigated during two separate sessions (t 1 and t 2 ) six weeks apart. The chosen retest interval represents a common duration of rehabilitative gait interventions [17,18], is in line with previous studies investigating the plasticity of brain activity in response to motor rehabilitation [1,19] and corresponds to previous studies assessing reliability of fMRI signals [6,11]. Participants were eligible for inclusion in the study if they did not meet any of the following exclusion criteria: 1) diagnosed neurological, musculoskeletal or cardiac dysfunction at present or in the past, 2) cardiac pacemaker, neuro-stimulator or hearing aid and 3) drugabuse. The study was approved by the Ethics Committee of the Canton of Zurich (approval Nr. 856) and was conducted in accordance with the guidelines for research involving human subjects as outlined by the Declaration of Helsinki. All participants were informed about the aims and procedures of the study and gave their written consent for participation.
All participants performed passive stepping without foot load (passive) and active stepping with a load of 40 % of individual body weight acting against the foot soles (active40) in both imaging sessions. The leg movements of the participants (i.e. range of motion, stepping cadence and interaction forces with the robot) were measured online by the robot during both sessions. Movement conditions were tested in random order in separate runs of functional image acquisition in a blocked design. Each functional run consisted of 15 blocks of movement, interleaved with 15 blocks of a control condition. Each block lasted 10 s and was followed by 9.075 s of image acquisition. Movement cadence during both conditions was set to 0.5 Hz by the presentation of a metronome through the earphones [20,21], yielding five steps per leg in each trial. Although the cadence was imposed by the robot during passive movements, the metronome was also presented in this condition, as well as during the control condition, to match auditory stimulation. The control condition served two purposes: firstly as a reference condition against which brain activity during movement trials was compared and, secondly, to control for auditory activations elicited by listening to the metronome per se. Visual cues were projected onto a screen near the feet of the participants at the start and for the duration of each block. Participants could see the screen by means of a mirror mounted to the head coil of the scanner. The word "MOVE" was presented for movement trials, while "LISTEN" was presented during control trials (Fig. 1b). As passive movements were imposed on the participants they were instructed to relax the muscles of their lower limbs and to not voluntarily contribute to flexion and extension of their lower limbs. During active movements, participants were instructed to voluntarily produce flexion and extension of their lower limbs in the rhythm set by the metronome. Under these conditions, the cylinders attached to the knees limited the range of motion and the cylinders attached to the feet rendered the desired foot loads as participants cycled through the steps. Furthermore, participants were instructed to fixate on a white cross on the screen during image acquisition between the "MOVE" and "LISTEN" blocks and to not rehearse or imagine movement execution when listening to the metronome alone. Participants were familiarised with each movement condition before the start of the experiment and informed about the upcoming type of movement before the start of each functional run.

Image acquisition
Image acquisition from all participants was carried out on the same 1.5 T Philips Achieva scanner (Philips Medical Systems, Best, the Netherlands) at the University Hospital of Zurich using an 8-channel SENSE TM head coil. The sparse sampling imaging protocol consisted of clusters of image acquisition interleaved with silent gaps of 10 s length [22]. Each imaging cluster comprised of three consecutive volumes (TR = 3.025 s). The duration between the onsets of two imaging clusters was hence 19.075 s. 93 volumes in 31 clusters of 3 volumes were acquired, using a whole brain T2*-weighted, single-shot, echo planar imaging (EPI) sequence (TE = 50 ms, flip angle = 90°, SENSE factor = 1.6). 35 interleaved, angulated, transversal slices covering the whole brain were acquired in each volume (field of view = 220 mm × 220 mm, acquisition voxel size: 2.75 × 2.8 × 3.8 mm, resliced to 1.72 × 1.72 × 3.8 mm).

Data analysis Motor performance
Three metrics of motor performance were calculated for both stepping conditions at t 1 and t 2 : Knee amplitude was defined as the range of motion of the knee per step and stepping frequency was defined as the number of steps of one leg per second. Foot load was defined as the maximal interaction force between the foot and the robot per step. Position and force data were extracted using custom Matlab routines (Matlab 2012b, Mathworks Inc., Natick, MA, USA, www.mathworks.com). Position data was filtered with a low pass 1st-order Butterworth filter with a cut-off frequency of 4 Hz and the mean knee amplitude and stepping frequency were extracted from each leg in each block of movement. The mean foot load was extracted from the data recorded from the force sensors at the foot cylinders per block.
For each participant and condition, values were then averaged across all blocks. Data was further averaged over both legs, as foot load, knee amplitude and stepping frequency values of the left and the right leg were not significantly different (planned comparisons, all p-values > 0.1) in any of the conditions.
The left/right-averaged data of each performance metric of both sessions was then subject to an individual 2-way repeated measures ANOVA with the factors time and condition. This allowed testing for significant differences between time points and conditions. To test the hypothesis of no significant differences between t 1 and t 2 in all of the performance metrics within each condition, additional planned comparisons were applied in case of a significant main effect of time. The significance level for all statistical tests of motor performance was set to α = 0.05.
To further assess reliability of repeated test sessions, the root mean squared error (RMSE) of differences between t 1 and t 2 was calculated using the following formula: where n denotes the total number of measurements of the metric at each session (i.e., 15) and x 1,i & x 2,i are the i -th pair of values of the measurements at t 1 and t 2 .The RMSE is an indicator of the absolute reliability. The absolute difference in measurements of the same metric repeated in two different sessions is expected to lie within 2.77*RMSE in 95 % of the measurements (normal distribution of the measurements is assumed, this was verified by means of visual inspection of Q-Q-plots) [23]. In other words, a true effect due to an intervention is likely if the difference between repeated measurements is higher than the RMSE multiplied by 2.77. Small RMSE values indicate low variability between measurements at t 1 and t 2 . This indicator shall provide a practical of reliability for users of the robot.In order to compare the foot load variability to data in the literature, the CV was calculated across repeated measurement sessions using the following formula for each individual: where mean i denotes the mean of measurement i, SD i denotes the standard deviation of measurement i, n i denotes the number of trials at measurement i, and k denotes the total number of measurements.

Analysis of imaging data
All fMRI datasets were analysed using SPM8 (Wellcome Department of Cognitive Neurology, London, UK, www.fil.ion.ucl.ac.uk/spm) running on Matlab 2012b (Mathworks, Inc., Natick, MA, USA, www.mathworks. com). The first three volumes prior to the first taskblock were removed from each run. In spatial preprocessing the remaining 90 volumes were firstly realigned to their mean image and unwarped to remove residual head motion related variance and image distortions along air-tissue boundaries [24]. Secondly, all data from t 2 was coregistered to the mean image of the respective condition at t 1 . Thirdly, all images were normalised into standard MNI space using to the EPI-template provided by the Montreal Neurological Institute, re-sliced to a voxel size of 2 × 2 × 2 mm 3 , and finally all data was spatially smoothed (FWHM = 8 mm). The estimated realignment parameter data from the realignment step were filtered using the discrete cosine transform matrix filter (cut off at 128 s) incorporated in SPM8, to remove linear baseline drifts. Only data from participants whose estimated head motion parameters were below a stringent threshold of ½ voxel size after filtering in every spatial dimension in both conditions and at both experimental sessions were included in the subsequent statistical analysis. In the 1st-level analysis the data from t 1 and t 2 were modelled as two separate task regressors in the same general linear model (GLM) for each movement condition individually [25]. Two additional regressors were added to the model for each session to account for the T1-decay along consecutive volumes [26]. A high pass filter (cut off at 128 s) was used to remove slow signal drifts. To account for the sparsesampling fMRI scheme, data taken during each trial was modelled using a boxcar function (1st-order, window length 3 x TR (i.e., 9.075 s)) [27]. Contrast images for each task regressor were calculated to reveal task-related activation at t 1 and t 2 . To estimate the task-related effects at the group level, all contrast images of a specific task from the 1st-level analysis were subject to individual one-sample t-tests. Planned comparisons were computed, in order to test for significant differences between t 1 and t 2 . The resulting activation maps were limited to a cluster-corrected voxel threshold of p < 0.001 (spatial extent: k ≥ 42 contiguous voxels) [28,29]. The cluster threshold method was applied to control for the overall type I error. Anatomical correlates of clusters of activation were determined with the help of probabilistic cytoarchitectonic maps implemented in the Anatomy toolbox [30]. This toolbox was also used to define bilateral anatomical regions of interest (ROI) in the primary motor cortex (M1), primary somatosensory cortex (S1), secondary somatosensory cortex (S2) and the cerebellum Table 1. The ROI covering M1 was built by combining BA 4a and 4b [31]. BAs 1, 2, 3a and 3b served to create the ROI in S1 [32][33][34]. The ROI covering S2 was built by combining areas Operculum (OP) 1, OP2, OP3 and OP4 in the parietal operculum [35,36]. The ROI located in the cerebellum was created by combining the lobules I to X (lobes and vermis) included in the Anatomy toolbox [37]. A ROI covering SMA was built from the automated anatomical labeling atlas [38] using the WFU_pickatlas toolbox [39]. These specific ROIs were selected as these areas have repeatedly been reported to be involved in lower limb motor control in previous studies [21,22,[40][41][42].

Indices of reliability
To assess the reliability of activations in these specific ROIs the following indices were calculated for the individual data at the 1st-level as well as for the 2nd-level group data from pairs of activation maps: In order to gain insight into the spatial congruence of activations, the relative overlap of activations between t 1 and t 2 was determined by calculating the Sørensen-Dice index [43][44][45]: where V overlap represents the number of voxels commonly activated at t 1 and t 2 , and V 1 / V 2 represent the number of voxels that were activated at t 1 or t 2 respectively. This ratio of commonly activated voxels and the sum of activated voxels at the two sessions was calculated from activation maps that were limited to p ≤ 0.001, uncorrected for multiple comparisons. This index can range from 0 (no overlap) to 1 (perfect overlap) and is independent of the height of the t-values, once voxels have passed the threshold. However, a specific voxel with comparable activation at t 1 and t 2 might pass the threshold in one session, but only just fail to pass the threshold in the second session. In this case the denominator of the above ratio is increased, leading to an underestimation of the overlap between sessions.
To complement the results from R overlap , the ICC was calculated, a measure of reliability that is derived from activation maps without any statistical voxel threshold. A two-way mixed model for consistency between measurements, i.e. ICC (3,1) was applied in the current study [11,46]. In the case of two repeated measurements, the ICC coefficient was calculated using: where BMS denotes the Between voxel Mean Square variance, while EMS denotes the Error Mean Square variance. Using unthresholded data for the calculation of the ICC is legitimate, since the ICC is based purely on the variance of the data and does not depend on the level of activation itself. As such, voxels with low activation might exhibit high ICC coefficients, meaning they have consistent activation despite failing to pass significance in a t-test in the fMRI-analysis (i.e. in the case of voxels whose response to the stimulus poorly fits the modelled hemodynamic response function). However, at the same time the ICC might also include some voxels that were not involved in the task. The coefficient may range from 0 (low reliability) to 1 (perfect reliability). In the present study, ICCs were classified as excellent above 0.75, good between 0.59 and 0.75, fair between 0.40 and 0.58 and poor for values below 0.40, as proposed by [14]. The results calculated from the single subject data, R overlap and ICC single were then condensed by averaging across all participants during each movement condition, yielding mean values for R overlap and ICC single . Fisher's z-transform was applied to ICC single values before averaging.
To further test for statistically significant differences of R overlap or ICC single across ROIs and conditions, the results of each reliability index derived from the 1stlevel fMRI analyses were entered into a separate 2-way repeated measures ANOVA with the factors condition and ROI. To test the hypothesis of no significant differences within ROIs across conditions, additional planned comparisons were conducted in case of a significant main effect of condition (α = 0.05, with a Bonferroni correction for multiple comparisons). Normal distribution of the data was verified by visual inspection of Q-Q-plots.

Results
All participants performed both stepping conditions at t 1 and t 2 during functional image acquisition. The retest interval between t 1 and t 2 ranged between 42 and 48 days. In both stepping conditions all data from 8 of the 16 participants was excluded from the analysis due to head motion exceeding ½ voxel size during image acquisition either at t 1 or t 2 , or both. Most of the head motion occurred in the z-direction (inferior/superior), i.e. along the cranio-caudal body axis, probably reflecting the impact of the stepping movements of the legs. Characteristics of the study sample can be found in Table 2. The participants of the present study are a subset of those reported in [22].

Motor performance
The 2-way repeated measures ANOVA of the performance metric knee amplitude did not reveal a significant interaction effect between the factors of time and condition (F 1,7 = 4.057, p = 0.084). The factor condition showed a significant main effect (F 1,7 = 9.751. p = 0.017).
For the performance metric foot force, the 2-way repeated measures ANOVA did reveal a significant interaction effect between the factors time and condition (F 1,7 = 6.083, p = 0.043). Furthermore, a significant main effect was found for both factors time (F 1,7 = 9.974, p = 0.016) and condition (F 1,7 = 419.307, p < 0.001). Planned comparisons for the factor time revealed a significant difference in the foot force between t 1 and t 2 in condition active40 (T 7 = 2.968, p = 0.021), but not in the condition passive (T 7 = 0.615, p = 0.558) ( Table 3).
Finally, the 2-way repeated measures ANOVA for the performance metric stepping frequency did not reveal a significant interaction between the factors time and condition (F 1,7 = 0.554, p = 0.481). Furthermore, no significant main effect of time was detected (F 1,7 = 0.957, p = 0.361). However, the factor condition showed a significant main effect (F 1,7 = 12.805, p = 0.009).
During passive stepping, the mean (standard deviation) RMSE of differences between t 1 and t 2 for knee amplitude was 0.0033 (0.0027) m, 0.0036 (0.002) Hz for stepping frequency and 3.9922 (3.5492) N for the foot force. During active40 stepping, the mean RMSE of differences between t 1 and t 2 for knee amplitude was 0.027 (0.0152) m, 0.047 (0.0177) Hz for stepping frequency and 18.8534 (10.7748) N for foot force (Fig. 2). When comparing the RMSE values between the two movement conditions, values were about 8-fold higher for knee amplitude, about 13-fold higher for stepping frequency and 4.5-fold higher for foot force in the condition active40 than in passive. The higher RMSE for foot force in active40 is supported by a higher mean CV of 2.21 (0.45) % for ac-tive40 vs. 1.83 (1.94) % in passive.

Functional brain activation during stepping at t 1 and t 2
During passive stepping, overlapping activation across t 1 and t 2 was found in bilateral S1/M1, superior parietal lobe, S2, SMA proper and the cerebellar vermis. At both measurement sessions the middle cingulate gyrus was furthermore activated, albeit these clusters did not spatially overlap. During active40 stepping, overlapping cortical activation across t 1 and t 2 was found in bilateral S1/M1, superior parietal lobe, S2 and SMA proper. Overlapping subcortical activation in the anterior and posterior cerebellar vermis was furthermore found in this condition. At t 2 bilateral activation of the thalamus was additionally found during active40 (Table 4, Fig. 3a and b).

Repeatability of fMRI measurements
Planned comparisons between the activation maps at t 1 and t 2 for passive and active40 stepping revealed only a minor, yet significant, difference between measurements in both movement conditions. When compared to t 2 , passive stepping at t 1 led to significantly higher activation in the left supramarginal gyrus and in the cerebellar vermis (Fig. 3c). No area showed significantly higher activation at t 2 than at t 1 in the passive condition. During active40 stepping at t 2 , significantly higher activation was only found in the left middle temporal sulcus than at t 1 (Fig. 3d), while no area showed significantly higher activation at t 1 than at t 2 during active40.
These small group-level differences between measurements at t 1 and t 2 in both stepping conditions are not fully supported by the ROI-analysis. For passive stepping, only a small amount of overlapping activation between 2nd-level group data at t 1 and t 2 was found in the cerebellum and S2, while M1, S1 and SMA lacked any overlapping activation (i.e., R overlap = 0). For active40 stepping, overlapping activation was found in all of the investigated ROIs (Table 5).
Furthermore, the ROI analysis revealed higher ICCs for activations during active40 than during passive stepping in the cerebellum, S1, S2 and SMA, but not M1. ICC group calculated from the 2nd-level group data during passive stepping revealed fair repeatability for S1 and SMA and good repeatability for S2, the cerebellum and M1. During active40 stepping excellent repeatability was found for all of the ROIs (Table 4).
When calculating average repeatability from the individual 1st-level fMRI data during the passive condition,  M1 primary motor area, S1 primary somatosensory area, S2 secondary somatosensory area, BA brodmann area, OP operculum, SMA supplementary motor area, AAL automated anatomic labelling atlas [38] fair ICC single was found for the cerebellum, good ICC single for S1 and excellent ICC single for M1, S2 and SMA. During active40 stepping good reliability of activations was found in the cerebellum, S1 and S2, while excellent averaged ICC single was found in M1 and SMA (Fig. 4).
Individual ICC single during passive stepping ranged from poor to good in the cerebellum and from fair to excellent in S2 and SMA, while in M1 and S1, ICC single ranged from good to excellent. During active40 stepping ICC single ranged from fair to excellent in the cerebellum and S1 and from good to excellent M1, S2 and SMA ( Table 5). The 2-way repeated measures ANOVA for ICC single scores revealed a significant interaction effect of condition by ROI (F 4,28 = 16.173, p < 0.001). Furthermore, there was a significant main effect of ROI (F 4,28 = 16.923, p < 0.001), but no significant main effect of condition (F 1,7 = 3.538, p = 0.101).
The 2-way repeated measures ANOVA for the single subject R overlap scores revealed a significant interaction effect of condition and ROI (F 4,28 = 3.537, p = 0.019). Furthermore, there was a significant main effect of ROI (F 4,28 = 13.206, p < 0.001), but no significant main effect of condition (F 1,7 = 0.413, p = 0.541).

Discussion
The present study explored the test-retest reliability of motor performance and brain activation of a novel robot-aided experimental fMRI paradigm at the individual and group-level. The consistency of task-induced blood oxygenated level dependent (BOLD)-signal was compared between repeated measurements of active and passive gait-like stepping in the MR-compatible stepper MARCOS. To the authors' knowledge, this is the first fMRI reliability study of brain activation during bilateral multi-joint lower limb movements.

Motor performance
Passive stepping did not reveal any statistically significant differences in motor performance between the measurements at t 1 and t 2 . The absence of statistically significant differences indicates very stable performance by the robot across repeated measurement sessions during passive movements. High repeatability during this condition is further supported by a low RMSE between t 1 and t 2 for each of the three metrics. Since the robot was strictly governed by position control in this condition, it was expected that the performance metrics would exhibit very low variability.
The healthy participants in this study reported in general no difficulties in maintaining limb passivity during the movements that were imposed by the robot. This observation is supported by low values of foot force and negative interaction forces between the robot and the participants at the knees (not shown), meaning that their legs were indeed suspended by the fixations during passive steps. High interaction forces would indicate a lack of muscle relaxation. In experiments with neurologically impaired patients this could, for example, point towards the presence of spasticity in the involved muscles, i.e. an increased resistance to imposed movement. The low RMSE for foot force further indicates that the participants were able to maintain limb passivity at similar levels during both experiments.
The metric foot force was significantly smaller at t 2 than at t 1 during active40 stepping. However, the CV of the foot force during active40 in the present study is smaller than the CV of the vertical ground reaction force during ground level walking (7 % in [47], single subject measured 9 times over 3 days), or during walking on the treadmill (5-8 % in [48], 10 subjects over 40 steps). Therefore this difference may be attributed to the variability inherent in the human motor system and interpret this finding as not physiologically relevant. This rationale is furthermore supported by the fact that no influence of movement performance on activity in specific regions of the brain was found in previous work [22]. The smaller foot forces at t 2 were driven by the concurrent reduction in knee amplitude, since in active40 stepping the robot is governed by a controller generating foot forces in proportion to the position of the knees. Individual RMSE between t 1 and t 2 of all motor performance metrics were 4.5 to 13-fold above those during passive stepping, indicating higher variability of movements  between measurements at t 1 and t 2 in active40. High RMSE-values also indicate a limited sensitivity of the experimental set-up in this condition. In the context of an interventional study, it would hence be less likely to detect an actual rehabilitation related change in the motor performance during active movements (e.g. a reduction in movement variability). The stepping frequency of 0.5 Hz investigated in the present study was approximately three times lower than that of over-ground gait at a comfortable speed in healthy participants (approximately 1.75 Hz [47]). In view of future clinical work with neurological patients, a reduced stepping frequency was chosen for two reasons: Firstly, lower stepping frequencies induce less taskrelated head motion (results from pilot experiments not shown), a factor that positively contributes to data quality. Secondly, lower stepping frequencies are well tolerated by neurological patients who are prone to developing spasticity, in particular during passive movements (results from pilot experiments not shown).

Activated areas during stepping
Both stepping conditions led to significant BOLD signal increases at t 1 and t 2, as compared to the control condition, in areas which have been previously reported to be involved in supine gait-like stepping [22], in pedaling (See figure on previous page.) Fig. 2 Motor performance and its reliability during passive and active40 stepping. Motor performance at session 1 (t1) and 2 (t2) and root mean squared error (RMSE) of differences between t1 and t2 of the individual participants during passive (left column) and active40 (right column) stepping. a knee amplitude, b stepping frequency and (c) foot force. Rows 1, 3, 5: mean ± one standard deviation at t 1 and t 2 . Rows 2, 4, 6: RMSE of differences between t 1 and t 2  Coordinates indicate the location of the peak activation in each cluster. All coordinates are in MNI-space, voxel threshold was p ≤ 0.001, cluster-corrected, k = 42 voxels S2 secondary somatosensory cortex, SMA supplementary motor area, S1/M1 primary sensorimotor cortex, CMA cingulate motor area [21,40,41], as well as during isolated movements of the lower limbs [42,49].

Reliability of fMRI measurements at the group level
After the analysis of the estimated head motion parameters from the realignment step, eight participants had to be excluded from both stepping conditions. This high dropout rate exemplifies that the test-retest reliability of the presented paradigm is a priori and is limited by the task-induced head motion in many participants. This occurred despite extensive body fixation applied to the participants at the torso and the head.
In those participants included in the analysis, the comparison of activations elicited by passive stepping at t 1 and t 2 revealed only minor, yet significant, differences, accompanied by minor and statistically non-significant changes in motor performance. They specifically consisted of higher activity observed in the left supramarginal gyrus and the cerebellar vermis at t 1 . Participants were able to maintain limb passivity at equal levels during both sessions. However, many of the participants reported that passiveness of the limbs required considerable concentration and attention to the task. Due to the novelty and relatively unnatural character of the task at t 1 , imposed passivity might have caused a higher cognitive load than at t 2 , and this might be reflected by session-specific variations in the related cortical processes causing differences in test-retest outcome measurements. The supramarginal gyrus has been shown to be involved in motor attention [50], hence its differential activation between sessions may indicate an effect of habituation from t 1 to t 2 , despite the provision of rehearsal trials before image acquisition of each movement condition. Differences between activations induced by repetition of the same motor task have also been reported by Loubinoux et al. These authors argued that reduced levels of stress, arousal and attention may contribute to the differences between repeated measurements, as the component of novelty is attenuated in a second session [6]. Reductions in activation have also been associated with motor learning [51]. However, the design of the current study did not include a motor learning component. Regarding the activation differences between sessions it must also be noted that some signal clusters are located in the vicinity of the cerebellar tentorium, a region of the brain that is susceptible to motion artifacts due to the tissue boundaries in this area [52,53]. The findings of R overlap , the index assessing the overlap of activations between sessions at t 1 and t 2 , only partially support the findings from the voxel-wise whole group brain analysis. When calculating this metric from the group data, congruent activation during passive stepping was found in the cerebellum and S2, but not in any of the other ROIs. Overlapping activation in the paracentral lobule (including the medial aspects of M1 and S1) and SMA were found when thresholding the group t-maps during passive stepping more liberally at p ≤ 0.005 1(uncorrected). Cáceres et al. emphasised that high variability in the data together with the low number of participants, as it often is the case in neuroimaging experiments, hinders the power to detect effects [13]. Therefore, with the inclusion of a higher number of participants in passive stepping, an overlap of activation in these areas between t 1 and t 2 at the group level might be demonstrated at the current threshold.
Intriguingly, group R overlap was consistently, yet not significantly, lower during passive than during active40 movements in all of the investigated ROIs. Yet, this is in line with the fact that fewer significant differences between t 1 and t 2 were found at the whole brain level during active40 than passive stepping.
The low congruence of activations during repeated passive stepping is in line with the values of ICC group in these ROIs (i.e. fair to good). This indicates that the activations elicited by passive stepping in healthy participants are not overly reliable in S1 and SMA, but more robust in the cerebellum, M1 and S2, if a measurement is repeated after several weeks. In the latter three ROIs the activation of voxels was hence on similar levels at t 1 and t 2 (good ICC group ), but did not reach significance (p ≤ 0.001, uncorrected) at the group level at either one or both measurement sessions, and hence R overlap was low in these areas. This reflects the fact that voxels can have stable signals across sessions leading to good ICC but at the same time do not necessarily fit the HRF model very well as reflected by low t-values (Fig. 5) [13].
Despite the differences and higher variability of motor behaviour between sessions i.e. higher RMSE values of kinematic metrics during active40 stepping, widespread differences on the side of the supraspinal activations were absent, except for one small cluster of 55 voxels in the left posterior middle temporal lobe. The statistical comparison of activation maps at the whole brain level implicates that the generation and control of active movement induces more robust and consistent neural activation across sessions than the monitoring of passive movements. This is supported by the indices of testretest reliability of fMRI measurements that were computed from the 2nd-level group data. Higher levels of overlap (R overlap ) between activations at t 1 and t 2 were found during active40 than during passive stepping in all of the investigated ROIs. This finding is accompanied by excellent values of ICC group in all of the ROIs. The observation that activations were generally more robust during active than during passive movements is underpinned by a previous study of robot-assisted unilateral elbow movements [11]. The values of R overlap and ICC reported by Estévez et al. were slightly, yet consistently, higher across investigated ROIs for active than for passive movements regardless of whether the values were calculated from 1st-level or 2nd-level data. In contrast, Loubinoux et al. did not find differences in the reliability of activations between active and passive movements [6].

Reliability of fMRI measurements at the single subject level
In contrast to a complete lack of overlapping activations during passive movements in three of the five ROIs (M1, Fig. 4 Individual test-retest reliability of brain activation in the investigated regions of interest. Reliability of individual activations (ICCsingle) is given for the regions of interest (ROI) cerebellum, M1, S1, S2 and SMA between t1 and t2. Some participants demonstrate consistently higher ICC than others. T-values were extracted from each ROI and reliability was assessed during passive (left) and active 40 (right) stepping. M1 = primary motor cortex, S1 = primary somatosensory cortex, S2 = secondary somatosensory cortex, SMA = supplementary motor area S1 and SMA) at the group level, R overlap was different from zero in all of the ROIs at the single subject level. However, mean values were lower than those reported by [11]. The ranges of ICC single in the present study are slightly above those of the unilateral single-joint lower limb motor control experiment using ICC as an indicator of reliability by Newton et al. [3]. The comparison of ICCs from Newton et al. with those of the present study indicates that test-retest reliability can be increased to some extent by the use of a robot to standardise the motor task between participants. Mean values of ICC single are in the realm of those reported by Estévez et al. who also applied a robotic device to control and measure movements [11].
Considerable variability in both reliability indices calculated from pairs of individual t-maps (R overlap and ICC single , Table 5) was found in this study. This finding is compatible with a study of Wei et al., who reported that between subject variance is higher than within subject variance in fMRI experiments [54], a finding that was also reported by [10]. This group found that variation in imaging data can be largely explained by differences in the signal-to-noise ratio (SNR) between individual measurements and that good ICC is achieved if the SNR of a particular measurement alone is high [10]. In the present study ICC single was on a similar level in most participants across ROIs (except for participant 8 who had consistently lower values in all ROIs, Fig. 4).
Surprisingly, the SNR of this participant was not lower than that of the other participants (not shown).

Potential implications for patient studies
Several groups have evaluated the test-retest reliability of fMRI experiments in stroke patients. Kimberley et al. found that stroke patients had somewhat higher ICC of fMRI results than healthy controls in a drawing task [8], while Eaton et al. reported approximately equal reliability between aphasic stroke patients and healthy controls in a language task [55]. However, the study of Kimberley et al. investigated repeatability using the unaffected hand. It can therefore only be speculated about testretest reliability of experiments involving the paretic side of the body. Kimberley et al. discussed the possibility that increased between-subject variability artificially heightened their measures of reliability. Variability between study subjects may be increased in stroke patients due to heterogeneity in the study sample with regards to time since stroke, extent of recovery or lesion size and location.
The finding that reliability in stroke patients is comparable to that of healthy controls is somewhat surprising, since factors such as increased head motion [56] and age of patients [57,58] may decrease the SNR and in turn the reliability. As the mean age of stroke patients is usually higher than that of the healthy participants in the present study, test-retest reliability of the presented Fig. 5 Voxel-wise maps of intra-class correlation coefficients for repeated sessions of passive and active40 stepping. Maps of intra-class correlation coefficients (ICCgroup) for repeated sessions of passive (a) and active40 (b) stepping shown on different axial slices (the z-coordinate is indicated at the top of each slice). Bilateral S2 and the paracentral lobule show high ICC in both conditions, while occipital, posterior parietal and prefrontal regions show high ICC as well in active40. Areas with high ICC (red) are hence not necessarily congruent with areas of activation above threshold. Images are displayed in neurological convention (i.e. left is left). The scale on the right indicates the ICC paradigm could hence be lower when applied to a stroke cohort. Huettel et al. suggested to ameliorate the limitation of decreased SNR in the elderly by increasing the number of trials [57]. However, increasing the number of trials, and thereby the length of the fMRI experiment, will in turn very likely increase head motion, and this may again, to some extent, cancel out the gain in SNR. There is, therefore, a large potential in the combination of MARCOS with prospective motion correction during BOLD signal acquisition. These systems capture the movement of the head during an experimental run and adjust the pulse sequence in real-time such that the field-of-view remains in alignment with the brain tissue (for a review see [59]).

Conclusions
The results of the present study in healthy participants indicate that activations during passive movements are less robust over repeated measurement sessions than those during active movements despite lower variability of motor performance during passive movements. The high variability of ICC single between individual participants during both movement conditions renders the presented approach less suitable for making inferences at the single-subject level. The fact that half of the participants had to be excluded from image analysis due to excessive task-induced head motion implies a limited feasibility for studies with patients. The group results from the remaining participants, however, revealed fair to excellent test-retest reliability. This implies feasibility of the method for studies investigating basic neurophysiological principles and to draw conclusions that can be generalised to the populations from which the study participants were selected.