Performance of machine learning models in estimation of ground reaction forces during balance exergaming

Background Balance training exercise games (exergames) are a promising tool for reducing fall risk in elderly. Exergames can be used for in-home guided exercise, which greatly increases availability and facilitates independence. Providing biofeedback on weight-shifting during in-home balance exercise improves exercise efficiency, but suitable equipment for measuring weight-shifting is lacking. Exergames often use kinematic data as input for game control. Being able to useg such data to estimate weight-shifting would be a great advantage. Machine learning (ML) models have been shown to perform well in weight-shifting estimation in other settings. Therefore, the aim of this study was to investigate the performance of ML models in estimation of weight-shifting during exergaming using kinematic data. Methods Twelve healthy older adults (mean age 72 (± 4.2), 10 F) played a custom exergame that required repeated weight-shifts. Full-body 3D motion capture (3DMoCap) data and standard 2D digital video (2D-DV) was recorded. Weight shifting was directly measured by 3D ground reaction forces (GRF) from force plates, and estimated using a linear regression model, a long-short term memory (LSTM) model and a decision tree model (XGBoost). Performance was evaluated using coefficient of determination (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^2$$\end{document}R2) and root mean square error (RMSE). Results Results from estimation of GRF components using 3DMoCap data show a mean (± 1SD) RMSE (% total body weight, BW) of the vertical GRF component (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_z$$\end{document}Fz) of 4.3 (2.5), 11.1 (4.5), and 11.0 (4.7) for LSTM, XGBoost and LinReg, respectively. Using 2D-DV data, LSTM and XGBoost achieve mean RMSE (± 1SD) in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_z$$\end{document}Fz estimation of 10.7 (9.0) %BW and 19.8 (6.4) %BW, respectively. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^2$$\end{document}R2 was \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$>.97$$\end{document}>.97 for the LSTM in the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_z$$\end{document}Fz component using 3DMoCap data, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$>.77$$\end{document}>.77 using 2D-DV data. For XGBoost, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_z$$\end{document}Fz \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^2$$\end{document}R2 was \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$>.86$$\end{document}>.86 using 3DMoCap data, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$>.56$$\end{document}>.56 using 2D-DV data. Conclusion This study demonstrates that an LSTM model can estimate 3-dimensional GRF components using 2D kinematic data extracted from standard 2D digital video cameras. The \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_z$$\end{document}Fz component is estimated more accurately than \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_y$$\end{document}Fy and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_x$$\end{document}Fx components, especially when using 2D-DV data. Weight-shifting performance during exergaming can thus be extracted using kinematic data only, which can enable effective independent in-home balance exergaming.

deteriorates gradually, increasing the risk of falls and decreasing community mobility and quality of life. These are major factors of increased risk of disability and mortality in elderly [2]. Targeted balance exercise improves postural control, and exercises typically included in exercise programs for balance training are for example leaning, reaching, and weight shifting [3]. These types of exercises have been shown to reduce fall risk [4,5] by improving dynamic stability during gait [6] as well as anticipatory and reactive balance ability [3]. Research has shown that technological tools that provide visual biofeedback and guidance can improve the potential effect of such exercises [7,8]. By using exercise games (so-called exergames), biofeedback can be provided in a motivational and fun manner [9,10]. In weight-shifting exercises, biofeedback is provided typically by using force-sensing equipment placed under the person's feet or inside the shoes. One of the most accurate types of force measurement equipment are piezoelectric force plates [11]. These return three-dimensional ground reaction force (GRF) vectors, which are precise representations of the magnitude and directions of the force exerted on the plates by the person's feet.
Even though force plates are effective to provide biofeedback in balance exercising, they are rarely used outside laboratory settings as they are very costly and resource-demanding to use. More user-friendly substitutes, such as the Wii Balance Board (Nintendo Co Ltd, Japan) have been developed and are used in exergames for balance training. These, however, have drawbacks in settings other than casual gaming, as they are less accurate and register limited information only [12,13]. They have also been reported to cause uncomfortable and unsafe experiences, and increase fear of falling (see, e.g. [14]). This makes the Wii Balance Board less suited for in-home use by elderly persons. More recent exergames for balance training started using kinematic data from depth-sensing cameras such as the Kinect (Microsoft Inc). However, using kinematic data as a proxy for kinetic information is problematic due to insufficient accuracy in the kinematic data provided [15]. Accurate and useful information about exercise performance is vital if independent exercise in older adults is to be effective. At the same time the equipment necessary to provide this information has to be easy to use and resource-friendly, without sacrificing accuracy.
We know from previous research that GRF can be successfully estimated in other movements using machine learning (ML) methods. In [16], GRF was estimated during gait using a long-short term memory (LSTM) model, achieving estimates of GRF components within 12% RSME. In [17,18], feed-forward artificial neural networks (ANN) gave an RMSE of GRF forces of < 10 % in all three components during gait and asymmetric movements. Additional studies successfully estimated GRF during running [19] and activities of daily living [20]. These studies base their estimation on a biomechanical model computed from a 3DMoCap system, which requires measuring several points on the body over time using e.g. inertial measurement sensors [21]. Furthermore, this approach also requires physical measurements of the body of the person playing to scale the biomechanical model. This, combined with an additional computational layer for the calculation of the biomechanical model and the required practical procedures (e.g., full-body device placement), makes it an implausible method for use in in-home settings for elderly users, or outside of a laboratory in general [22].
Nonetheless, the direction of using LSTM does seem promising. LSTM is a form of neural network where sequential data is processed recurrently and important features are "remembered" for future predictions/ estimations [23]. LSTMs are also relatively quick in estimation, allowing for real-time estimates which is a requirement when giving feedback during exergaming. Another approach, widely used because of its powerful method of representing the relationships in the data, is decision tree-based methods. Recently, a version of decision trees, called "extremely boosted gradient trees" (XGBoost, [24]), has been shown to outperform other regression methods [25], including in estimation of forces in a biomechanical setting [26]. In addition, decision trees are inherently transparent in their decision making process, which is a highly valuable feature. This can provide information about which joints are important in estimating GRF, which might inform decisions on relevant motion tracking tools in this context. Furthermore, it was recently shown that standard digital 2D video can be used to extract 2D kinematic data of joint positions (e.g. [27][28][29]). This makes it possible to use devices such as smartphones, tablets, or web cameras to capture movements. We propose utilizing positional data of joint centers from pose estimation systems in combination with machine learning methods to estimate 3D GRF components during balance exergaming. This would remove the need for any physical measurements or biomechanical model of the person playing, and achieving this using a standard digital video camera only would make the system very easy to use and suitable for in-home guided exercise. Therefore, the aim of this paper is two-fold: (1) to investigate the performance of an LSTM model and an XGBoost model for estimation of ground reaction forces during balance exergaming, and (2) to compare performance between using 3D and 2D kinematic data.

Participants and protocol
Twelve healthy older adults were recruited from local exercise groups. Mean age was 72 ± 4.2 years, ten were female. Exclusion criteria were physical or cognitive injuries/impairments that affected their balance and gait ability, and age < 50 or age > 80 years. Data was collected at the Movement Capture and Visualization Laboratory at the Norwegian University of Science and Technology in Trondheim, Norway in June 2019.

The exergame
A custom exergame for balance training was used in this study, using Kinect (v2, Microsoft Inc) to track participants' movements for input to the game. The exergame was designed to elicit medio-lateral weight shifts from the user: An avatar representing the user was shown in a rail cart on a train rail, as seen in Fig. 1. Along each side of the rail there were coins that the user should try to hit by tilting the cart sideways, which was achieved by shifting their body weight over to the foot that on the side of the coin (Fig. 2). There were never more than two coins consecutively on one side. There were approximately 100 coins in total, with 50 % appearing on each side.

Equipment
A four-camera (MX400, 90 Hz, Qualisys Inc, Sweden) setup was used for capturing 3D motion data (3DMoCap) from participants. The Plug-in-Gait Full Body (PiG-FB, [30]) marker setup, excluding head and hands, was used. Two digital cameras (GoPro Hero Black 3+, 25 Hz, GoPro Inc) placed 200 cm behind and to the side of the player were used to capture player movements simultaneously with the 3DMoCap system. To capture force data, two force plates (60 × 5 × 40 cm, 1000 Hz, Kistler AB) were used, one under each foot of the player. The experimental setup can be seen in Fig. 3.

Preprocessing
To extract joint center positions from 2D-DV data, the DeepLabCut(DLC, [28]) framework was used. The 3DMoCap data was gap-filled and the joint center positions were extracted using the standardized PiG-FB biomechanical model implemented in Nexus (v. 2.9, Vicon Motion Systems Ltd). The joint center positions extracted from both data sources were ankles, knees, hips, shoulders, elbows and wrists. From the 3DMoCap system the anterio-posterior (X), medio-lateral (Y) and vertical (Z) positions relative to the Qualisys global coordinate system origin were extracted, and in the 2D-DV data the vertical (Y) and medio-lateral (X) positions relative to the 2D-DV camera origin were extracted. This resulted in 36 input features from the 3DMoCap system, and 24 features from the 2D-DV system. The data was then normalized to the [0,1] range. Data was synchronized by resampling joint center data from digital video using the 3DMoCap data frequency as reference. Force components F x (anterio-posterior), F y (medio-lateral) and F z (vertical) were extracted from the force plate data. GRF components were scaled to body weight (BW) for each time frame. The video data of ankles was occluded in participants 4, 8, 9, and 10, resulting in missing ankle data for these participants. 3DMoCap data from participants 1 and 2 was corrupted, and not used in further analyses.

Machine learning models
Python v. 3.7.10 was used for all analyses and evaluation. Sci-Kit Learn [31] was used for multivariate linear regression (LinReg), GridSearchCV and feature importance, and for evaluation of model performances; the Keras framework [32] was used to build the LSTM model; and XGBoost was implemented using the XGBoost package for Python (https:// github. com/ dmlc/ xgboo st). Multivariate linear regression (LinReg) was used as a baseline model for reference purposes. XGBoost is an improved version of decision tree models that combines a random forest technique of feature bagging, and a gradient decent method to reduce boosting error-hence the name "gradient boosting". This has been shown to perform well on a wide range of non-linear estimation tasks [24]. Long short-term memory model (LSTM) is a version of a recurrent neural network. Stacked LSTM is a version of LSTM models that utilizes several layers of LSTM nodes, which has been shown to improve performance over single layer LSTMs [33]. A schematic of the stacked LSTM model we implemented in this study can be seen in Fig. 4. There is one dense input layer, three hidden layers of 512 nodes each, a dropout layer (.2), and a dense output layer of 6 nodes with sigmoid activation: one for each dimension in the force data for each force plate.

Parameters and optimization
Hyperparameters for the XGBoost model were tuned using GridSearchCV with five cross-validation iterations, and the most optimal hyperparameter settings were found. The hyperparameter grid searched can be found in Table 1. The hyperparameter values in bold font were the ones found to yield the highest performance, and were used in training the final XGBoost model. Optimization of the LSTM network was conducted using Adam optimizer [34] with an initial learning rate of .0001, decay steps 10,000, and decay rate .96. The model was trained for 200 epochs, with a minimum rate of improvement of loss (mean squared error, MSE) of .0003 for three consecutive epochs.
A leave-one-group-out cross validation was performed on all models, where one group was the data from one participant, which served as the test set in each iteration. This was performed on the joint data from 3DMoCap and 2D-DV systems. For evaluation, mean of left and right foot (1SD) root mean square error (RMSE), and mean (1SD) coefficient of determination ( R 2 ) for the different cross-validation splits was computed.

Results
The results showing feature selection and subsequent estimation performance of LSTM, XGBoost and Lin-Reg using 3DMoCap and 2D-DV data, are presented as RMSE in Table 2 and R 2 in Fig. 7. Figure 8 shows illustrative example graphs of estimation performance of the three models using 3D and 2D data, over a randomly selected sequence (1000 frames) from one person during one trial of play.  Furthermore, the contribution of each joint center to estimation performance was computed using a permutation procedure. Here, the data in each feature is shuffled in a random manner, which breaks the real-world relationship between the feature and the target. The resulting difference in estimation performance between using the shuffled and un-shuffled feature is indicative of how much the model depends on this feature [35]. This is then repeated for all features, and inform about which features, i.e. joint centers, are most important to the estimation performance. Results from the feature importance analysis, using 3DMoCap data, showed that eight joint centers contributed with 82.9% of the information needed to estimate GRF components. These joint centers were right and left wrist, right elbow, left knee, and torso joint centers (left and right shoulders, and left and right hip joints). The models were subsequently retrained using these joints.
Using 2D-DV data, there were also eight joint centers that had a total contribution of 78%: Left wrist, shoulder, hip, knee and ankle, and right shoulder, knee, and ankle.
The relative contributions of all joint centers can be seen in Figs. 5 and 6.

Estimation error
Prediction performance is presented in Table 2, with the mean (± 1SD) RMSE (% BW) for the three models using 3DMoCap and 2D-DV data for the three force components. The LSTM model outperforms both XGBoost and LinReg when using both 3DMoCap and 2D-DV data. The XGBoost model achieves at the same level as Lin-Reg using both 3DMoCap and 2D-DV data. Lowest mean RMSE (4.3% BW) was achieved by the LSTM model on the F z component using 3DMoCap data; highest (23.5% BW) was the LinReg model in the F y component using 2D-DV data. RMSE was generally higher using 2D-DV data than when using 3DMoCap data.

Model fit
As shown in Fig. 7, the LSTM R 2 is consistently higher than in the XGBoost and LinReg model using both MoCap and 2D-DV data. Using the MoCap data, the  The XGBoost model also estimates F z very well, but this is not seen to the same degree in F y and F x . In F x and F y the XGBoost model is able to follow the major trends in the data, but rapid changes in force are not estimated well. The LinReg model is able to estimate major changes in F z , but not with the level of detail seen in the LSTM or XGBoost model. F x and F y components, however, are not estimated as well by the LinReg model.

Discussion
This study investigated two facets of estimation of GRF components in balance training using machine learning models. First, we assessed the overall estimation performance of an LSTM and an XGBoost model on GRF components, comparing it to a baseline LinReg model's performance. Second, the performance of the LSTM and XGBoost models in estimating 3D GRF data using 2D joint data was examined. Overall, the LSTM model performance was very good, considering that joint position data was the only input data used for estimation. The LSTM RMSE was < 11 % BW for all GRF components when using 3DMoCap data, and R 2 was moderate to high ( > .58 and > .79 ) for F x and F y , and excellent ( > .97 ) for F z . This shows that the LSTM model was able to accurately estimate the F z component, while achieving only slightly less accurate results in the F x and F y components. The boxplots in Fig. 7 also show that the F z estimation was very stable around the median. This was the case in all three models. The most promising part of our results is that our method does not require information about the person playing or any calculations using the input data to represent the person-i.e., no biomechanical model is needed. This makes our method less computationally expensive, and easier to implement in an in-home setting. Still, our findings on estimation of GRF from kinematic data are in line with related literature in gait analysis, such as Mundt et al. [16], Oh et al. [17], and Choi et al. [18]. The movement pattern is different, so a direct comparison of results is not feasible. These studies used 3DMoCap data to calculate biomechanical features such as joint angles [17,36], body segment velocities [18], and foot contact events [21], which are not obtainable using only joint position data. This demonstrates the strength in our results: our method use the joint center positions directly, skipping both practical and computational steps that complicate the process. This makes our method more accessible and easy to use, while being as accurate as more complicated methods.
Regarding performance using 2D-DV data, our findings support using this modality for estimation of F z during balance exergaming. This is a step in the right direction regarding in-home use of exergaming, as a standard digital camera that most people already possess can provide accurate information about weight shifting performance during exergaming. This can be achieved in the form of a smartphone or a web camera instead of needing to acquire a specialized device such as a Kinect camera. However, our findings also show that when the context requires three dimensional GRF data, the use of 3D kinematic data is preferred to ensure estimation accuracy in all three GRF components. This is also true when the context requires model performance that has a < 10 % BW error requirement in other components than F z . LinReg also performs surprisingly well in F z , with comparable RMSE and R 2 to LSTM and XGBoost, although both the LSTM and XGBoost models are better at estimating the small changes in force that occurs between lateral weight-shifts (i.e., when the person is standing with the majority of their BW on one foot).
The F z component is arguably the most informative of the three directions in balance training, as it represents the vertical force-i.e., the weight that is being pushed straight downwards onto the surface. In practice, this informs about how much body weight the person places on each leg, which is an indication of how well the person is performing a weight shift during exercise. However, F x and F y information may also be relevant to measure accurately as the force exerted in these directions contribute to postural control. For example, force magnitude, directional accuracy, and variability in F y and F x in relation to a (externally or internally induced) disturbance in posture can be informative about balance ability [37,38].
In medio-lateral weight-shifting the F x component might not be as critical to measure as the F z component measures the same movement in this context. In contrast, control over anterior-posterior movement (and thus F y ) is important to maintaining a steady and stable sideways movement pattern, to prevent large anterior-posterior movements during weight-shifting exercises and potentially create destabilizing conditions. This means that even though F z provides the main information about sideways weight-shifting performance, F y can inform about the variability and stability in a weight-shifting movement.
The feature importance information from the XGBoost model showed different joints to be important based on the type of data used. When using 3D data, more joints from the right side contributed to estimation performance, while more joint on the left side were important when using 2D data. From these results we were not able to elucidate any systematic or clear pattern in joint importance, which might be caused by the limited set of movements performed in this study. This might be an interesting avenue to explore further using a data set richer in terms of movements.
The high R 2 achieved could be a sign of overfitting by the LSTM model [39]. However, the tenfold CV process showed a stable fit using test data, which can be seen in the low spread of the LSTM model in Fig. 7 as well. Results from test/train errors also support this, as the difference between test/train errors is low, as seen in Fig. 10. Even more reassuring is the fact that the CV process was not a holdout of random pieces of data, but a holdout of all the data from each person. Thus, estimation of GRF was performed on previously unseen data from a person with an unknown movement pattern.
The XGBoost model, however, does indeed seem to suffer from overfitting, which presents itself as higher RMSE when estimating based on unseen data compared to training data [40] (Fig. 9). This is likely caused by either too much noise in the data (especially in the 2D-DV data), where the limited tree depth (max depth = 12) does not allow for the tree to fully model the real relationship in the data, or that the current data set is too sparse. Even though XGBoost inherently possesses   19:18 features that are known to prevent overfitting, our findings indicate that this was not successful here.

Limitations
There are some limitations to be aware of in the current study. The movement pattern performed by participants was limited to to sideways leaning, and there were a low number of participants. The data was collected in a laboratory setting, and the models used require training data to be usable in a real-world setting.

Conclusion
In conclusion, the LSTM model performed very well, especially in F z . 3DMoCap data produced the best results, and the best F z estimation from 2D video data is also achieved by using the LSTM model. These findings show that it is feasible to develop exergames that provides weight-shifting biofeedback by only using 2D joint position data from a standard digital video camera.
With the support of a standard camera, an exergame in balance training can incorporate the LSTM model to provide real-time biofeedback on weight-shifting performance. This warrants further investigation into how such systems can be integrated into exergames for in-home or in balance exercise, as it opens up broad opportunities for providing accurate feedback in a simple, yet accurate manner. The LSTM model and 2D-DV input data combination has the potential to facilitate more effective and motivating in-home balance training by incorporating accurate feedback on weight-shifting performance in exergames.