Participants
Balance data were collected from ten participants with self-reported balance concerns. Prospective participants were included if they could stand for 10 minutes without an assistive device and if they did not self report a recent (within the past six months) fall resulting in serious injury or hospitalization, or lower extremity injury (e.g., a lower extremity fracture, sprain) that reduced strength or sensation in their legs. The average age across all recruited participants was 65.7 (± 14.7) years. Of the ten participants, there were four females and six males. Each participant experienced some level of balance concerns, due to either an existing vestibular diagnosis (four participants) or due other circumstances (such as older age). Along with individuals with balance concerns, eight PT participants (subsequently referred to as PTs) with specialization in treating individuals with balance disorders were recruited to assess participant balance. The study protocol was reviewed and approved by the University of Michigan Institutional Review Board. Informed consent was received from all participants and recruited PTs.
Study design
Each participant performed three 30 s repetitions of 15 standing balance exercises from a group of 54 exercises, while using an overhead harness for safety. The postural exercises were chosen from a standard set used for balance rehabilitation [12]. These exercises varied in difficulty and challenged balance in different (and possibly multiple) ways. Exercises varied in terms of the surface on which they were performed (firm or foam), leg stance (feet apart, feet together, partial heel-to-toe, heel-to-toe, and single-leg), visual condition (eyes open/closed), and head movements (none, pitch head movements, and yaw head movements). Participant balance ability was assessed using a standard set of balance exercises prior to the start of data collection. Based on their baseline abilities, a PT with expertise in balance rehabilitation used a balance exercise progression framework [12] to select specific exercises for each participant. The cumulative distribution of exercises performed across all participants was considered on a rolling basis when selecting exercises for a given participant to ensure that a broad range of exercises was performed across all participants. We collected a total of 450 exercise repetitions across all participants. Based on video recordings of the exercises, the recruited PTs rated each exercise repetition on a scale from one to five [16]. A label of one represented an exercise that was performed independently with limited or reduced sway, while a five represented an exercise for which the participant was unable to maintain position even with assistance. Among the eight PTs recruited for the study, one to five PTs rated each exercise, with an average of 4.28 (± 0.66) PTs per exercise. We summarized scores by taking the mode among all PT ratings. In addition, participants were asked to rate their own performance using a similar scale [16]. Both scales were purposefully designed to have five points and were adapted from previously published scales [19, 20]. In our analyses, we excluded a small number of exercise repetitions due to missing labels (n = 3) or premature termination of the recording (n = 4).
Data collection
During the exercises, participants wore a single (six degree of freedom) inertial measurement unit (IMU; MTx, XSens Inc, Eschende, Netherlands) on an elastic belt approximately positioned over the L3 vertebrae level dorsal to the spine to measure trunk sway relative to gravity, in both the pitch and roll directions [21, 22]. For each balance exercise, only angular velocity data were considered as linear acceleration data were not stored. Angular velocity data were sampled at 100 Hz. Although a subset of the exercises involved participants making head movements about the pitch and yaw axes to further challenge participants’ balance for certain stance conditions, only trunk movements about the pitch and roll axes were analyzed from the single trunk-based IMU. These data best capture postural stability in the anterior–posterior and medial–lateral directions and are conventionally reported for kinematic studies using IMUs [16]. We did not apply any preprocessing prior to using the IMU data as input to the model. We also collected ‘step-out’ information, where a step-out indicated any loss of balance resulting in hand contact with the spotter or nearby chair for support, or the need to take a step in order to regain balance. Following a step-out, an individual was encouraged to continue the balance exercise until the full 30 s duration of the repetition had elapsed.
Machine learning techniques
Given these data, we aimed to learn a mapping f from a particular representation of the IMU data \(x \in {\mathcal {X}}\) to a summarized PT label \(y \in {\mathcal {Y}} = \{1,2,3,4,5\}\) (based on the mode). To learn potentially complex non-linear relationships between the IMU data and the summarized PT label, we considered using the IMU data as input to different machine learning models. Each model was trained to accurately estimate the PT label on a set of training data, before being applied to a held-out set of test data to assess generalization. However, there are many different techniques for representing IMU data for a particular exercise repetition. We considered three different representations of the IMU data as inputs to a machine learning model (Fig. 1). Based on the input, each model produced an estimate of balance performance in \(\{1,2,3,4,5\}\), with the goal of matching the PT labels. First, we considered a multivariate time-series representation of the data with two channels (for pitch and roll, respectively). A time-series representation encoded the temporal dependencies present in the data. We used the time-series data as input to a 1-dimensional convolutional neural network (CNN), which has previously been shown to be an effective model for learning from time-series data [23,24,25]. Second, we considered an image representation of the IMU data. To create an image representation, we plotted pitch on the x-axis and roll on the y-axis. We then transformed this plot into a 2-D \(60 \times 60\) image, that could be used as input to a model. This image was used as the input to a 2-dimensional CNN. Among these two representations of the IMU data, we hypothesized that the image representation would outperform the time-series representation, as the ability to assess balance likely relies on the raw spatial pitch and roll information more than the temporal relationship of pitch and roll. Third, we also represented each exercise by extracting 11 features that had previously been shown to be useful in assessing balance [16, 26,27,28]. In particular, we calculated the kinematic features that were used by Bao et al. for each repetition: the root-mean-square of trunk sway in all directions, the path length of the trunk sway trajectory, and the elliptical fit area of the trunk sway [16]. We used this feature vector as input to a random forest model.
Experimental set-up
To train and evaluate our models, we split our data into a training dataset, a validation dataset, and a testing dataset based on the participant. Specifically, we used data from participants 1–6 as the training set, data from participants 7 and 8 as our validation set, and data from participants 9 and 10 as our test set. Given the computational costs associated with training ML models, we considered only a single held-out test set, as in past work [29, 30]. However, the distribution of summarized PT labels was consistent across our training, validation, and test sets.
Models were optimized by minimizing the cross-entropy loss. We used the Adam optimizer with a learning rate of \(1 \times 10^{-4}\), a batch size of 32 and weight decay tuned for each specific model [31, 32]. Each model was trained with a fixed budget of 2000 epochs, and hyperparameters (such as weight decay) were chosen with respect to the performance on the validation set. When training the 2-dimensional CNN using the image representation, we augmented our dataset by randomly rotating each image in the training set three separate times, with each rotation randomly taken from the set \(\{30^\circ , 60^\circ , 120^\circ , 150^\circ , 210^\circ , 240^\circ , 300^\circ , 330^\circ \}\). We selected the hyperparameters of the random forest, specifically the number of trees (1000), by using grid search and a leave-one-participant-out cross validation scheme to maximize the performance on the held-out set.
For the 1-dimensional CNN, we used one convolutional layer with eight filters and a kernel size of three, followed by max pooling, ReLU activation, and batch normalization. We experimented with more convolutional layers, yet saw no improvement in performance. We followed this convolutional block with two fully-connected layers, with batch normalization and ReLU activation in between [33, 34]. To reduce over-fitting, we applied dropout with probability 0.5 following the first fully-connected layer [35]. For the 2-dimensional CNN, we used eight filters and a kernel size of 3 × 3, followed by a similar architecture as used in the 1-dimensional CNN. Given that position and velocity likely encode different information, we used depth-wise convolutions resulting in different filters for each input type for both CNN architectures [36].
We evaluated models in terms of classification accuracy and the area under the receiver operating characteristics curve (AUROC). Accuracy indicated how often our model aligned with how a PT might have assessed performance during an exercise. The AUROC is a common evaluation metric that measures a model’s capability of correctly ranking examples. Given our multiclass setting, we considered the macro-averaged AUROC, which computed the average value among the AUROC calculated across each class independently. We used validation AUROC as the performance metric during hyperparameter selection. For additional context, we compared the discriminative performance of the ML methods to two additional baselines: (1) naively predicting the mode label (i.e., ‘2’) for all ratings in the test set (which we term the ‘majority classifier’) and (2) using a participant’s self-assessment ratings as predictions for each exercise. Comparing to the simple majority classifier tested whether or not the ML models were learning something beyond the mode, while comparing to the self-assessment baseline demonstrated the ability of the ML models to improve upon an individual’s self-assessment when estimating ground-truth PT assessments of balance. All experiments were repeated 30 times with different random initializations of the model to evaluate stability. Throughout the rest of the paper, we report the average accuracy and AUROC over the 30 runs on the test set, as well as the standard deviation (SD). We implemented all neural network-based approaches in PyTorch. We trained each model on a GeForce GTX 1080 Ti GPU.
Statistical analysis
We repeatedly trained models based on 30 different random initializations of the model parameters. To compare the performance of our models, we used paired t-tests to test for significant differences of the mean performance of each method across 30 runs with different random initializations. A paired t-test was used as performance was calculated for each model on the same examples in the test set, making the examples related. Significant differences were defined at a significance level of \(\alpha = 0.05\).