Assessing real-world gait with digital technology? Validation, insights and recommendations from the Mobilise-D consortium

Micó-Amigo, M. Encarna; Bonci, Tecla; Paraschiv-Ionescu, Anisoara; Ullrich, Martin; Kirk, Cameron; Soltani, Abolfazl; Küderle, Arne; Gazit, Eran; Salis, Francesca; Alcock, Lisa; Aminian, Kamiar; Becker, Clemens; Bertuletti, Stefano; Brown, Philip; Buckley, Ellen; Cantu, Alma; Carsin, Anne-Elie; Caruso, Marco; Caulfield, Brian; Cereatti, Andrea; Chiari, Lorenzo; D’Ascanio, Ilaria; Eskofier, Bjoern; Fernstad, Sara; Froehlich, Marcel; Garcia-Aymerich, Judith; Hansen, Clint; Hausdorff, Jeffrey M.; Hiden, Hugo; Hume, Emily; Keogh, Alison; Kluge, Felix; Koch, Sarah; Maetzler, Walter; Megaritis, Dimitrios; Mueller, Arne; Niessen, Martijn; Palmerini, Luca; Schwickert, Lars; Scott, Kirsty; Sharrack, Basil; Sillén, Henrik; Singleton, David; Vereijken, Beatrix; Vogiatzis, Ioannis; Yarnall, Alison J.; Rochester, Lynn; Mazzà, Claudia; Del Din, Silvia

doi:10.1186/s12984-023-01198-5

Research
Open access
Published: 14 June 2023

Assessing real-world gait with digital technology? Validation, insights and recommendations from the Mobilise-D consortium

M. Encarna Micó-Amigo¹,
Tecla Bonci²,
Anisoara Paraschiv-Ionescu³,
Martin Ullrich⁴,
Cameron Kirk¹,
Abolfazl Soltani³,
Arne Küderle⁴,
Eran Gazit⁵,
Francesca Salis^6,9,
Lisa Alcock^1,7,
Kamiar Aminian³,
Clemens Becker⁸,
Stefano Bertuletti⁹,
Philip Brown¹⁰,
Ellen Buckley²,
Alma Cantu¹¹,
Anne-Elie Carsin^12,13,14,
Marco Caruso⁹,
Brian Caulfield^15,16,
Andrea Cereatti⁹,
Lorenzo Chiari^17,18,
Ilaria D’Ascanio¹⁷,
Bjoern Eskofier⁴,
Sara Fernstad¹¹,
Marcel Froehlich¹⁹,
Judith Garcia-Aymerich^12,13,14,
Clint Hansen²⁰,
Jeffrey M. Hausdorff^5,21,22,
Hugo Hiden¹¹,
Emily Hume²³,
Alison Keogh^15,16,
Felix Kluge^4,24,
Sarah Koch^12,13,14,
Walter Maetzler²⁰,
Dimitrios Megaritis²³,
Arne Mueller²⁴,
Martijn Niessen²⁵,
Luca Palmerini^17,18,
Lars Schwickert⁸,
Kirsty Scott²,
Basil Sharrack²⁶,
Henrik Sillén²⁷,
David Singleton^15,16,
Beatrix Vereijken²⁸,
Ioannis Vogiatzis²³,
Alison J. Yarnall^1,7,10,
Lynn Rochester^1,7,10,
Claudia Mazzà² &
Silvia Del Din^1,7
for the Mobilise-D consortium

Journal of NeuroEngineering and Rehabilitation volume 20, Article number: 78 (2023) Cite this article

6421 Accesses
17 Citations
20 Altmetric
Metrics details

A Correction to this article was published on 03 May 2024

This article has been updated

Abstract

Background

Although digital mobility outcomes (DMOs) can be readily calculated from real-world data collected with wearable devices and ad-hoc algorithms, technical validation is still required. The aim of this paper is to comparatively assess and validate DMOs estimated using real-world gait data from six different cohorts, focusing on gait sequence detection, foot initial contact detection (ICD), cadence (CAD) and stride length (SL) estimates.

Methods

Twenty healthy older adults, 20 people with Parkinson’s disease, 20 with multiple sclerosis, 19 with proximal femoral fracture, 17 with chronic obstructive pulmonary disease and 12 with congestive heart failure were monitored for 2.5 h in the real-world, using a single wearable device worn on the lower back. A reference system combining inertial modules with distance sensors and pressure insoles was used for comparison of DMOs from the single wearable device. We assessed and validated three algorithms for gait sequence detection, four for ICD, three for CAD and four for SL by concurrently comparing their performances (e.g., accuracy, specificity, sensitivity, absolute and relative errors). Additionally, the effects of walking bout (WB) speed and duration on algorithm performance were investigated.

Results

We identified two cohort-specific top performing algorithms for gait sequence detection and CAD, and a single best for ICD and SL. Best gait sequence detection algorithms showed good performances (sensitivity > 0.73, positive predictive values > 0.75, specificity > 0.95, accuracy > 0.94). ICD and CAD algorithms presented excellent results, with sensitivity > 0.79, positive predictive values > 0.89 and relative errors < 11% for ICD and < 8.5% for CAD. The best identified SL algorithm showed lower performances than other DMOs (absolute error < 0.21 m). Lower performances across all DMOs were found for the cohort with most severe gait impairments (proximal femoral fracture).

Algorithms’ performances were lower for short walking bouts; slower gait speeds (< 0.5 m/s) resulted in reduced performance of the CAD and SL algorithms.

Conclusions

Overall, the identified algorithms enabled a robust estimation of key DMOs. Our findings showed that the choice of algorithm for estimation of gait sequence detection and CAD should be cohort-specific (e.g., slow walkers and with gait impairments). Short walking bout length and slow walking speed worsened algorithms’ performances.

Trial registration ISRCTN – 12246987.

Introduction

The adverse consequences of physical mobility loss and the importance of preserving mobility to ensure healthy ageing are undeniable [1, 2]. For this reason, a variety of behavioural, nutritional, and pharmacological interventions aim to improve mobility in general, and more specifically target the preservation of an individual’s ability to walk independently and safely both within and outside their homes [3,4,5,6]. Evaluating the effectiveness of interventions by quantifying an improved gait pattern, however, remains a challenge when relying on traditional tools such as patient-reported outcomes or supervised gait tests in clinic or lab, as these typically lack ecological validity [7].

Therefore, there is a need for the development of accurate, reliable, and sensitive tools for the quantification of gait and mobility in real-life [8, 9]. Digital health technology, including body-worn or wearable devices, offers a way forward by providing digital outcomes to remotely measure and monitor gait [10, 11], a fundamental component of mobility [12, 13]. Nonetheless, due to several persisting challenges in this field, current tools and techniques are still in their infancy. These challenges need to be addressed before digital mobility outcomes can be confidently adopted in clinical trials and as part of standard healthcare, including a variety of technical, clinical, and regulatory aspects [9, 14].

Exciting technical advances in algorithms and data processing techniques have led to the deployment of a plethora of algorithms to extract digital mobility outcomes from gait data recorded using inertial measurement units embedded within wearable devices [15,16,17]. Even so, significant ongoing challenges exist, in particular establishing the technical validity of these algorithms. A thorough validation process must account for complex factors that simultaneously arise from multiple sources influencing digital mobility outcome measures, including disease characteristics, patient specific habits, and the context in which walking is recorded (i.e. indoors, outdoors, public vs. private domain) [18,19,20]. All these factors concur to potentially limit the generalizability of validation data recorded during traditional gait protocols such as those administered within a controlled clinical or laboratory setting in which participants are asked to walk along a straight path or just a few daily life activities are simulated [21, 22]. Only recently, ad-hoc wearable devices have been developed, which finally allow moving the validation to more complex and realistic real-life scenarios [19, 23]. However, published validation studies generally only target a subset of specific digital mobility outcomes as calculated from one or a reduced number of algorithms and/or include only a few cohorts, hence providing partial information about generalizability of the results [22, 24].

The aim of this paper is to identify, compare and rank the most promising algorithms that quantitatively characterize gait with digital mobility outcomes from continuous real-life monitoring in a diverse group of patients who present with different mobility challenges. Here we focus on detection of gait sequences (i.e., identified walking bouts), individual steps, and estimation of cadence and stride length from a single wearable device positioned on the lower back, an ergonomically easy-to-use position near the centre of mass, which is well accepted by participants [25, 26]. To establish generalizability, we independently compare algorithms in six cohorts: healthy older adults, Parkinson’s disease, multiple sclerosis, proximal femoral fracture, chronic obstructive pulmonary disease and congestive heart failure. Specifically, we aim to:

(a)
Identify, compare and rank the best performing (i.e., most accurate and reliable) algorithms for each cohort;
(b)
Describe the performance of the identified best algorithms;
(c)
Analyse the influence of walking speed and walking bout duration on the algorithm performance;
(d)
Provide recommendations to implement and select algorithms for real-world gait analysis tailored to different patient cohorts.

Methods

Participants

A convenience sample of 108 participants were recruited to represent five disease cohorts (chronic obstructive pulmonary disease, Parkinson’s disease, multiple sclerosis, proximal femoral fracture, and congestive heart failure), as well as healthy older adults, encompassing a broad range of mobility levels. Participants were recruited in five sites: The Newcastle upon Tyne Hospitals NHS Foundation Trust, UK and Sheffield Teaching Hospitals NHS Foundation Trust, UK (ethics approval granted by London – Bloomsbury Research Ethics committee, 19/LO/1507); Tel Aviv Sourasky Medical Center, Israel (ethics approval granted by the Helsinki Committee, Tel Aviv Sourasky Medical Center, Tel Aviv, Israel, 0551-19TLV), Robert Bosch Foundation for Medical Research, Germany (ethics approval granted by the ethical committee of the medical faculty of The University of Tübingen, 647/2019BO2), University of Kiel, Germany (ethics approval granted by the ethical committee of the medical faculty of Kiel University, D438/18). All participants gave written informed consent to take part in the study. Inclusion and exclusion criteria and details about the technical validation study experimental protocol are described in [19].

Experimental protocol

Participants were monitored for 2.5 h as they went about their usual activities in their habitual environment (home/work/community/outdoor). To ensure diversity of walking activity, participants were also asked to perform some specific tasks: outdoor walking; walking up and down a slope and stairs; and moving from one room to another. Participants wore a single McRoberts Dynaport MM+ wearable device (sampling frequency 100 Hz, triaxial acceleration range: ± 8 g/resolution: 1 mg, triaxial gyroscope range: ± 2000 degrees per second (dps)/resolution: 70 mdps), secured to the lower back with an elasticated belt and Velcro fastening. A reference system was used to establish the accuracy of algorithms and was comprised of a multicomponent system of INertial modules, DIstance Sensors and Pressure insoles (INDIP) [19, 23, 27]. The INDIP system and the associated algorithms to estimate digital mobility outcomes have been validated in previous studies in healthy and pathological cohorts (e.g., hemiparetic, Parkinson’s disease, Huntington’s disease and mild cognitive impairment) and in this study participants [23, 28,29,30,31,32]. The INDIP and the single wearable device on the lower back were synchronized using timestamps referred to a common clock [19].

Pre-selection of algorithms for further validation and ranking

In this paper we focused on key metrics of real-world walking that form the basis from which a variety of digital mobility outcomes, including walking speed, can then be quantified. These are: gait sequence detection, foot initial contact detection, cadence and stride length estimation. For each metric, we identified published algorithms from laboratory-based or semi-structured protocols [8, 33]. This yielded 14 for gait sequence detection, 21 for initial contact detection, 23 for cadence and 18 for stride length estimation. For each digital mobility outcome, a shortlist of up to four most promising algorithms was selected based on initial testing in pre-existing data from older adults and pathological cohorts, including Parkinson’s disease [28, 34,35,36], multiple sclerosis [37, 38], stroke & chorea [28, 39]. Algorithms’ selection was based on the ranking methodology proposed in Bonci et al. [24]. The final subset of optimized algorithms (including detailed descriptions of implementation) are summarized in Table 1 and briefly outlined below.

Table 1 Description of algorithms for each metric: gait sequence detection (GSD), initial contact event detection (ICD), cadence estimation (CAD) and stride length estimation (SL)

Full size table

Gait sequence detection (GSD) This metric identifies sections of the raw signal which correspond to walking/gait. Three algorithms were selected: GSD_A [40], GSD_B [16] and GSD_C [41].

Initial contact detection (ICD) This metric detects the foot initial contact within each gait sequence. Four algorithms were selected: ICD_A [16, 41, 42], ICD_B [44], ICD_C [16, 41, 42] and ICD_D [45].

Cadence estimation (CAD) This metric identifies strides as a cyclic pattern from which cadence [number of steps within a minute (min)] is estimated in each walking bout [17]. Three algorithms were selected: CAD_A [41, 42, 44], CAD_B [16, 46] and CAD_C [17, 45]. Cadence (steps/min) was derived from identified strides as follows:

$${{\varvec{C}}{\varvec{a}}{\varvec{d}}{\varvec{e}}{\varvec{n}}{\varvec{c}}{\varvec{e}}}_{{\varvec{i}}}={StrideFrequency}_{i}*2,$$

(1)

where $i=1,\dots , n$ are the different walking bouts and Stride Frequency is evaluated as:

$$\user2{Stride\; Frequency}_{{\varvec{i}}} = \frac{{\mathop \sum \nolimits_{k = 1}^{{n\_STRIDE_{i} }} \left( {{\raise0.7ex\hbox{${60}$} \!\mathord{\left/ {\vphantom {{60} {STRIDE d_{k} }}}\right.\kern-0pt} \!\lower0.7ex\hbox{${STRIDE d_{k} }$}}} \right)}}{{n\_STRIDE_{i} }},$$

(2)

where $i=1,\dots , n$ are the different walking bouts, ${n\_STRIDE}_{i}$ is the number of strides (including right and left steps) in the relevant $i$ – walking bout, ${STRIDE d}_{k}$ is the duration [seconds] of the k – stride in the relevant $i-$ walking bout

Stride length estimation (SL) This metric quantifies stride length, evaluated as the distance between two non-consecutive initial contacts. Four algorithms were selected based either on biomechanical or machine-learning models: SL_A [47, 48], SL_B [47, 48], SL_C [49, 50] and SL_D [17, 51].

Data and statistical analyses for validation and ranking of algorithms.

All calculations and statistical analysis were performed using Matlab® R2021a (Mathworks, Natick, MA).

Performance measures to describe and establish algorithm validity

To ensure objective comparison between systems (INDIP and wearable device), walking bouts detected by the INDIP were given as a standardized input to all algorithms except for gait sequence detection where the full wearable device recording was given as input. A walking bout was defined as a walking sequence containing at least two consecutive strides of both feet (e.g., R–L–R–L–R–L or L–R–L–R–L–R, with R/L being the right/left foot contact with the ground) [18]. Criteria for inclusion of a stride were: (a) duration of 0.2–3 s, and (b) a minimum length of 0.15 m. A resting period/break of 3 s or more identified consecutive walking bouts [18], thus each walking bout could include resting periods/breaks ≤ 3 s. Each metric was determined by the algorithms implemented on the single wearable device and by the INDIP.

Algorithm validation was established independently for each cohort by comparing digital mobility outcomes obtained from the selected algorithms applied to the wearable device with those from the INDIP using the following set of performance measures to describe and establish validity:

$${\text{Accuracy}} = \frac{TP + TN}{{TN + TP + FN + FP}}$$

(3)

$${\text{Sensitivity}} = \frac{TP}{{TP + FN}}$$

(4)

$${\text{Specificity}} = \frac{TN}{{TN + FP}}$$

(5)

$${\text{Positive\; Predictive\; Value}} = \frac{TP}{{TP + FP}}$$

(6)

where TP = True Positive events, TN = True Negative events, FP: False Positive events, FN: False Negative events.

Intra class correlation coefficient (ICC_(2,1)) [52] was calculated to assess the association between the digital mobility outcomes of the two systems using all walking bouts collected from each cohort separately. Based on ICC estimates, values less than 0.5, between 0.5 and 0.75, between 0.75 and 0.9, and greater than 0.9 were deemed to be indicative of poor, moderate, good, and excellent agreement, respectively [53].
Absolute agreement was assessed by quantifying (i) absolute error, (ii) bias, and (iii) Limits of Agreement [54] between the wearable device and reference system digital mobility outcomes calculated for each walking bout.
Relative errors between the wearable device and INDIP digital mobility outcomes were determined for each walking bout.

Mean and 95% confidence intervals of all digital mobility outcomes were evaluated at a cohort level (i.e., quantified using all walking bouts across all participants belonging to that specific cohort). Subsets of relevant measures were then used for the different digital mobility outcomes and evaluated as detailed below.

For gait sequence detection algorithms, each window of 0.1 s from the complete 2.5-h recording was classified (see Fig. 1) as either true positive, false positive, true negative or false negative and accuracy, sensitivity, specificity, positive predictive value were calculated. These measures were evaluated for each 2.5-h assessment. In addition, absolute errors and ICC_(2,1) for the total accumulated duration of all gait sequences identified in a 2.5-h recording was assessed and compared between the two systems, for each participant.

In the case of initial contact detection, we defined each initial contact event within a walking bout as a true positive, false positive and false negative by comparing the initial contact events detected by the wearable device to the events detected by the INDIP within a tolerance window of 0.5 s (centred around the event identified by the INDIP, see Fig. 2), representative of a step duration [55]. This approach has been previously used and was adopted to take into account the potential mismatch on the event time between the INDIP and the wearable device [56]. To assess initial contact detection, true negative events were not evaluated, since true negative would correspond to all non-initial contact events identified as such by both systems.

For initial contact detection, we utilised the following measures: sensitivity, positive predictive values, absolute errors (which were estimated for each true positive initial contact (see Fig. 2)) and relative error (estimated by dividing all absolute errors, within a walking bout, by the average step duration estimated by the INDIP [55]).

For cadence and stride length algorithms, the measures used were: relative errors, absolute errors and ICC_(2,1).

Ranking algorithms using performance measures

A simplified version of the ranking methodology described in Bonci et al. [24] was applied to compare algorithm performance using a decision matrix. This was based on the weighted combination of performance measures described above assessing agreement between the single wearable device and the INDIP system (classified as benefit or cost). Performance measures considered as benefits were: accuracy, sensitivity, specificity, positive predictive value and ICC_(2,1) [52]. Performance measures considered as costs were absolute and relative errors. Each measure was weighted based on its relative importance to the algorithm’s validity assessment (see Bonci et al. [24] and Additional file 1 for further detail regarding the specific performance measures and assigned weights for gait sequence detection, initial contact detection, cadence and stride length algorithms). This information was combined to determine a performance index (0 = worst, 1 = best), calculated as a weighted mean of the selected benefit and/or cost analysis, which was subsequently used to compare and rank the algorithm performances, and thus, to select the top performing algorithms for each cohort independently.

Influence of walking speed and walking duration on the algorithms’ performance

The performance of initial contact detection, cadence and stride length top-selected algorithms was then assessed considering the impact that walking bout walking speed values (calculated as the average stride speed by the INDIP system) and walking bout durations had on the relative error of each digital mobility outcome (i.e., step duration, cadence and stride length). Specifically, median relative errors for each digital mobility outcome were quantified evaluating all the walking bouts characterized by specific walking speed and walking bout duration ranges; including errors observed in consecutive walking speed windows of 0.05 m/s [57] and in consecutive walking bout duration windows of 2 s. For each digital mobility outcome, the resulting median errors were then employed in a best-fit approach to determine their association between the relative errors and walking speed or walking bout duration, respectively. In the best-fit approach, median error values were also weighted according to the relevant number of observations in a given window with respect to the total number of observations.

Results

Participant clinical and demographic characteristics per cohort are presented in Table 2.

Table 2 Demographic and clinical characteristics of the participants

Full size table

The cohorts covered a wide range of mobility levels: the walking speed measured by the INDIP system during the 2.5-h assessment ranged from an average of 0.54 m/s (proximal femoral fracture) to 0.72 m/s (congestive heart failure), with a minimum measured walking speed of 0.10 m/s (in Parkinson’s disease) and a maximum of 1.63 m/s (in healthy older adults) (Table 2).

Nine participants (8%: three with chronic heart failure, two with multiple sclerosis, one with Parkinson’s disease and three proximal femoral fracture participants) were excluded from subsequent analysis due to data unavailability.

Gait sequence detection

Performance measures and ranking

We report in Table 3 the gait sequence detection algorithms main peformance measures (All performance measures are considered for the evaluation of the performance index are shown in the Additional file 1: Table).

Table 3 Gait sequence detection (GSD) performance measures; gait sequence total duration obtained from the INDIP and the single wearable device, absolute error, bias and limits of agreement (LoA) and intra class correlation (ICC_(2,1)) for comparison between systems, and overall performance index for the GSD algorithms. Values are expressed as mean and 95% confidence intervals (CI) for each cohort. In italic and boldface recommended algorithms. Underlined performance index indicates top-ranked algorithm for the specific cohort of that row

Full size table

Across all cohorts, performance measures for the three gait sequence detection algorithms were good to excellent (sensitivity ranged between 0.60 and 0.92, specificity between 0.95 and 0.99, accuracy between 0.94 and 0.97 and positive predictive value between 0.74 and 0.91 [41] (Table 3, Additional file 1: Table). The lowest sensitivity was observed for the most impaired cohort (proximal femoral fracture) for all algorithms.

The absolute error between the wearable device and the INDIP for the total accumulated duration of the detected gait sequences ranged from 71.9 to 358.5 s across the three algorithms which was approximately from 7 to 32% of the total duration estimated by the INDIP. Overall, except for the proximal femoral fracture cohort, GSD_A and GSD_B overestimated the total gait sequence duration, whereas GSD_C underestimated it. The ICC_(2,1) ranged from 0.68 to 1.00, with the lowest ICC_(2,1) found for the multiple sclerosis cohort, in line with the largest disagreement, based on the largest limits of agreement [54], among all cohorts and the three algorithms.

Algorithm GSD_A presented the overall best performance index for healthy older adults (0.819), congestive heart failure (0.853), chronic obstructive pulmonary disease (0.822), multiple sclerosis (0.735) and Parkinson’s disease (0.852) cohorts (see Additional file 1). Algorithm GSD_B presented the highest performance indexes for the proximal femoral fracture cohort (0.771) and similar good performances for multiple sclerosis (0.655) and Parkinson’s disease (0.726).

Initial contact detection

Performance measures and ranking

Table 4 presents performance measures of initial contact detection algorithms, which were very similar for the four algorithms. Across algorithms and cohorts, sensitivity ranged from 0.76 to 0.83 and positive predictive values from 0.81 to 0.93, whilst relative errors ranged from 7.6 to 21.2%.

Table 4 Initial contact detection (ICD) performance measures. Sensitivity, positive predictive value, absolute and relative errors, and overall performance index for the ICD algorithms. Values are expressed as mean and 95% confidence intervals (CI) for each cohort. In italic face: recommended algorithms. Underlined performance index indicates top-ranked algorithm for the specific cohort of that row

Full size table

Algorithm ICD_A presented the highest overall performance index across all cohorts: healthy older adults (0.804), congestive heart failure (0.771), chronic obstructive pulmonary disease (0.790), multiple sclerosis (0.805), Parkinson’s disease (0.798) and proximal femoral fracture (0.818) reflecting the lowest absolute and relative errors, highest sensitivity, and positive predictive values.

Effect of walking speed and bout duration

Relative errors for step duration, as extracted from the initial contacts, decreased with walking speed (R² = 0.86), with errors lower than 10% reached for walking speeds > 0.25 m/s (Fig. 3a) [58]. Any value of walking bout duration showed median errors lower than 10%, but an overall error decrease was observed when the walking bout duration increased (R² = 0.70, Fig. 4a). Overall, higher errors (> 50%) were observed only in the 0.9% of the detected walking bouts; these bouts were characterised by a short duration (8.37 ± 4.71 s) and slow walking speed (0.44 ± 0.24 m/s).

Cadence estimation

Performance measures and ranking

Performance measures of the cadence algorithms are presented in Table 5, reflecting a slight (4.6–7.2 steps/min) overestimation of cadence by the wearable device with respect to INDIP for all the cohorts with algorithms CAD_B and CAD_C (except for proximal femoral fracture with CAD_C, in which case there is a misestimation). The absolute error ranged from 5.2 to 9.3 steps/min, the relative error between 6.6% to 11.8% and ICC_(2,1) ranged from 0.44 to 0.82 across the three algorithms.

Table 5 Cadence (CAD) estimation performance measures. Cadence obtained from the INDIP and the single wearable device, bias, limits of agreement (LoA) and intra class correlation (ICC(2,1)) for comparison between systems, and overall performance index for the CAD algorithms. In italic and boldface recommended algorithms. Underlined performance index indicates top-ranked algorithm for the specific cohort of that row

Full size table

The highest absolute and relative errors, and the lowest ICC_(2,1) were found for the proximal femoral fracture cohort. CAD_C had the highest performance index for healthy older adults (0.653), congestive heart failure (0.720), chronic obstructive pulmonary disease (0.693), multiple sclerosis (0.644), Parkinson’s disease (0.653). CAD_B presented the best performances for proximal femoral fracture (0.584) showing the lowest absolute error (7.2 steps/min), closest largest limits of agreement (− 10.1 to 24.2 steps/min), lowest relative error (8.5%) and highest ICC_(2,1) (0.66). Overall good performances were also found for CAD_B for multiple sclerosis and Parkinson’s disease.

Effect of walking speed and bout duration

For both CAD_B and CAD_C, as walking speed increased, the relative error decreased (Fig. 3b), with speeds above 0.3 m/s resulting in an error below a 10% threshold [58]. Generally, the highest errors were observed for the shortest and slowest bouts (Fig. 4b). The walking bouts with higher errors [> 50%, n = 25 (0.8%)] had a mean duration of 8.88 s (std: 5.97 s) and slow walking speed values (0.28 ± 0.09 m/s).

Stride length estimation

Performance measures and ranking

Table 6 shows an overall overestimation of stride length by the wearable device with respect to the INDIP. The absolute error between the wearable device and the INDIP outcomes ranged from 0.15 to 0.33 m across all algorithms.

Table 6 Stride length (SL) estimation performance measures. Stride length obtained from the INDIP and the single wearable device, bias, limits of agreement (LoA) and intra class correlation (ICC(2,1)) for comparison between systems, and overall performance index for the SL algorithms. In boldface: recommended algorithms. Underlined performance index indicates top-ranked algorithm for the specific cohort of that row

Full size table

The mean relative errors ranged from 25.3 to 34.1% for SL_A, and similarly from 27.4 to 35.8% for SL_B. These were larger for SL_C (ranging from 29.0 to 34.5%) and for SL_D (40.4 to 47.7%). The ICC_(2,1) for SL_A were the largest, ranging from 0.28 to 0.70, followed by SL_B with a range from 0.20 to 0.66. The ICC_(2,1) for SL_C were below 0.5, and below 0.15 for SL_D.

Overall, SL_A presented the highest performance indexes for all cohorts excluding multiple sclerosis, with the following values: healthy older adults (0.582), congestive heart failure (0.663), chronic obstructive pulmonary disease (0.381), Parkinson’s disease (0.607), and proximal femoral fracture (0.465). In the multiple sclerosis cohort, SL_B had the highest performance index (0.487).

Effect of walking speed and bout duration

Critical errors in the stride length estimate were observed for the slowest bouts, with values decreasing below 20% only for walking speed > 0.5 m/s and below 10% only for 0.6 m/s (Fig. 3c). Highest errors were also still associated with shortest and slowest bouts (Fig. 4c); specifically, the shortest bouts (≤ 10 s) had a mean error of 32.6%, while the longest ones (> 60 s) 9.3%. Overall, errors higher than 50% were observed in about 17% of the total number of walking bouts. These bouts were short (13.03 ± 10.53 s), with slow walking speed (0.36 ± 0.13 m/s) and short stride length values (0.45 ± 0.17 m).

Discussion

This is the first study presenting a comprehensive comparative assessment of a broad range of algorithms applied to a single wearable device, for estimating key digital mobility outcomes pertaining to gait (i.e., gait sequences, individual steps, cadence and stride length) in heterogeneous diseases and using data from the real world. In this work, we have described algorithms’ performances, selected the best algorithm for each digital mobility outcome and cohort, analysed the influence of walking speed and walking bout duration on their performance, and provided recommendations for their selection and implementation for real-world gait analysis.