Dataset
The analytical flowchart of this study is presented in Fig. 1. Data of 139 patients who underwent Lokomat training at Taipei Medical University Hospital were retrospectively collected. After screening for data completeness, records of 91 adult patients with acute or chronic neurological disorders were included in the study. Clinical information and RAGT parameters of all sessions were incorporated as input variables to predict whether patients would show improvement by comparing the FAC of the 12th session with that of the first session. We incorporated continuous variables (i.e., age, days to complete 12 sessions, BW support, GF, and speed) and categorical variables (i.e., initial FAC, gender, entry point, extremity affected, and diagnosis) as inputs into the prediction models. Among the final dataset, 60 patients (65.9%) showed improvement after completing 12 RAGT sessions, and the remaining 31 patients showed no improvement. Given the importance of walking ability in stroke patients, the RAGT parameters extracted for analysis in this study included BW support, GF, and the speed at which the treadmill was run as they reflect responsive measures of gait ability. These three parameters can individually be adjusted according to a patient’s condition and the therapeutic goals of that patient. At our clinic, therapists develop individualized Lokomat training protocols for each stroke patient. As patients gradually regain strength in their lower extremities, therapists can reduce the weight supported to promote greater muscle activities. When patients regain more temporal muscle activation, the GF can be reduced to promote patients’ active participation in predefined gait patterns. Subsequently, when the performance of a patient adjusts well to the above two parameters, therapists can increase the training walking speed to increase repetitions and challenges. This study was approved by the Institutional Review Board of Taipei Medical University (no. 202,005,039).
Descriptive statistical analysis
The descriptive statistics were compared between the improvement and non-improvement groups of patients undergoing RAGT treatments (i.e., 60 patients in the improvement group and 31 in the no-improvement group). Means and standard deviations (SDs) were computed for continuous variables, and frequencies and percentages were calculated for categorical variables. Comparisons of baseline characteristics between groups with and without progress were analyzed by Student’s t-test for continuous variables and Chi-squared test for categorical variables. Between-group comparisons were considered statistically significant at p < 0.05. Statistical analyses were conducted using RStudio vers. 1.2.5001 software (2009–2019 RStudio) and SAS 9.4 (SAS Institute, Cary, NC, USA).
Imbalanced dataset handling
The original dataset was imbalanced (i.e., 65.9% of participants showed improvement and 34.1% of them did not), and machine learning algorithms tend to predict outcomes of most samples to achieve a better predictive performance. However, one of the most important aims of our study was to identify patients who do not show improvement despite completing many RAGT sessions. Therefore, oversampling of the group with fewer samples, or under-sampling of the group with more samples can be used as an effective technique to handle imbalanced datasets [19]. Therefore, we applied an oversampling technique to double the no-improvement group of patients to generate a final balanced dataset consisting of 60 improved and 62 non-improved participants for development of prediction models in this study.
Machine learning algorithms
Machine learning is a research field in computer science that focuses on how algorithms learn from data [20]. These algorithms incorporate statistics to detect patterns in order to make predictions about a dataset [21]. The use of machine learning algorithms has become increasingly common for obtaining reliable predictions. Compared to traditional methods, many machine learning techniques and algorithms discover more-sensitive and -specific screening algorithms and relax assumptions and restrictions of traditional regressions [22]. The aim of this study was to build the best prediction model to distinguish patients with and without improvement using RAGT for functional gait recovery. Five machine learning algorithms, including logistic regression (which is widely used in medical studies), decision tree (which generates a tree-like model to support decision making), support vector machine (SVM; which is used to perform nonlinear classification by mapping input features to a higher-dimensional space), random forest (RF), and extreme gradient boosting (XGBoost) to compare performances, were incorporated to predict changes in the FAC and develop prediction models. The logistic regression and SVM are two well-known statistical methods used for binary classification. Furthermore, we used tree-based machine learning algorithms to deal with binary classifications as well. A decision tree is a fundamental tree-based algorithm. RFs and XGBoost are two tree-based algorithms as well, but they fundamentally differ. RFs use bagging to generate each new dataset with replacement from the original dataset. XGBoost is a machine learning algorithm for tree boosting.
Cross-validation was incorporated to provide fair estimations of our prediction model. We used ten-fold cross-validation to evaluate our prediction performance using the balanced dataset (i.e., 60 in the improvement group and 62 in the no-improvement group). Since cross-validation can be a better approach for evaluating predictive performance without overestimation, ten-fold cross-validation was also applied for evaluation of the balanced dataset. The balanced dataset was randomly divided into ten subsets; each time a subset was used as a test set, the remaining nine subsets were applied to develop the prediction model as the training set. We built prediction models and evaluated each model in the R environment.
Experimental design and prediction models
To investigate the effects of using different numbers of input sessions (i.e., i as the number of input sessions), the clinical information and raw parameters from the first i RAGT sessions were incorporated as input features in the five machine learning algorithms to predict whether there was an improvement in the 12th session (compared to the 1st session) for the balanced dataset. The number of input sessions with the highest area under the receiver operating characteristic (ROC) curve (AUC) in the test set was selected as the best feature for predicting the success of RAGT. After the optimal number of input sessions was determined, we applied the optimal number of input sessions using ten-fold cross-validation to develop the prediction model for the balanced dataset (i.e., denoted model 1).
Furthermore, instead of predicting whether changes will occur, we investigated whether using a finer granulation to predict the amount of FAC changes as the prediction outcome would improve the performance. The FAC is a six-point categorical scale that assesses the extent to which support is needed by a patient when walking. In addition, changes in the FAC of 0 and 1 accounted for 50.8% and 33.6% (i.e., 62 and 41 patients with FAC changes of 0 and 1, respectively) of the balanced dataset, respectively. Therefore, few samples with FAC changes of ≥ 2 (i.e., 14, 3, 1, and 1 patients with FAC changes of 2, 3, 4, and 5, respectively) were grouped together. We incorporated data from the optimal number of input sessions as features, and FAC changes at three levels (i.e., 0, 1, and ≥ 2) as our outcome target variables to develop a prediction model by ten-fold cross-validation (i.e., denoted model 2).
In addition to an accurate predictive performance, it is highly desirable if clinical insights can be interpreted from experimental results from an analytical study of the data. The RF algorithm estimates the importance of a specific variable by observing how much the error of prediction (i.e., the mean decrease in accuracy, MDA) increases when out-of-bag data for a specific variable are permuted and all other variables remain the same [23].
Evaluation measures
The accuracy, sensitivity, specificity, and the ROC curve, were used to evaluate the performance of each prediction model. The accuracy, sensitivity, and specificity were defined as follows:
$$Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$
$$Sensitivity=\frac{TP}{TP+FN}$$
$$Specificity=\frac{TN}{TN+FP}$$
where TP, TN, FP, and FN respectively denote numbers of true positives, true negatives, false positives, and false negatives.
The area under the ROC curve (AUC) is an indicator used to evaluate the performance of the classification prediction models. A previous study suggested that the AUC is a better indicator for comparing and measuring the performance of a classification algorithm [24] because it avoids the assumed subjectivity in threshold selection, while the continuous probability is converted into binary dependent variables and summarizes the performance of each prediction model under all possible thresholds [20]. Therefore, the AUC was used as an evaluation measure to compare different machine learning algorithms and select the classifier with the best performance.