Estimating upper-extremity function from kinematics in stroke patients following goal-oriented computer-based training

Introduction After a stroke, a wide range of deficits can occur with varying onset latencies. As a result, assessing impairment and recovery are enormous challenges in neurorehabilitation. Although several clinical scales are generally accepted, they are time-consuming, show high inter-rater variability, have low ecological validity, and are vulnerable to biases introduced by compensatory movements and action modifications. Alternative methods need to be developed for efficient and objective assessment. In this study, we explore the potential of computer-based body tracking systems and classification tools to estimate the motor impairment of the more affected arm in stroke patients. Methods We present a method for estimating clinical scores from movement parameters that are extracted from kinematic data recorded during unsupervised computer-based rehabilitation sessions. We identify a number of kinematic descriptors that characterise the patients’ hemiparesis (e.g., movement smoothness, work area), we implement a double-noise model and perform a multivariate regression using clinical data from 98 stroke patients who completed a total of 191 sessions with RGS. Results Our results reveal a new digital biomarker of arm function, the Total Goal-Directed Movement (TGDM), which relates to the patients work area during the execution of goal-oriented reaching movements. The model’s performance to estimate FM-UE scores reaches an accuracy of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^2$$\end{document}R2: 0.38 with an error (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma$$\end{document}σ: 12.8). Next, we evaluate its reliability (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r=0.89$$\end{document}r=0.89 for test-retest), longitudinal external validity (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$95\%$$\end{document}95% true positive rate), sensitivity, and generalisation to other tasks that involve planar reaching movements (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^2$$\end{document}R2: 0.39). The model achieves comparable accuracy also for the Chedoke Arm and Hand Activity Inventory (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^2$$\end{document}R2: 0.40) and Barthel Index (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^2$$\end{document}R2: 0.35). Conclusions Our results highlight the clinical value of kinematic data collected during unsupervised goal-oriented motor training with the RGS combined with data science techniques, and provide new insight into factors underlying recovery and its biomarkers.


Introduction
Stroke is the second major cause of death and disability worldwide, with about 15 million new cases every year [1]. One-third of these cases lead to persistent cognitive and motor disabilities [2]. About 80% of stroke survivors present weakness and partial loss of voluntary control in the upper-extremities [3], or hemiparesis, which is often associated with other sensorimotor alterations, such as hypertonia or tremor.
Although hemiparesis is a highly prevalent symptom and severely limits the independence of affected patients, its causes and recovery dynamics are not fully understood [4]. Recent literature converges on the idea that recovery is mainly due to a combination of residual corticospinal tract capacity and an upregulation of the reticulospinal tract [5,6]. Further, recovery seems to follow a temporal structure where most of the improvement occurs during the first months post-stroke [7,8]. So far the assessment of the hemiparesis phenotype and its progression, however, are based on assessment methods with known limitations (e.g. Fugl-Meyer Assessment [9,10], Action Research Arm Test [11,12]) and there is a need for more sensitive, objective, and reliable alternatives that are also compatible with contemporary digital health technologies.
A recent systematic review [13] of a total of 225 studies (N = 6197) using 151 different kinematic metrics found that kinematic assessments of upper limb sensorimotor function are poorly standardised and rarely measure clinimetrics in an unbiased manner. Specifically, using descriptors of accuracy, efficacy, efficiency, movement planning, precision, spatial posture, speed, temporal posture, and range of movement together with clinimetric properties of these descriptors (i.e., reliability, measurement error, convergent validity, and external validity), the authors showed that the studies analysed exclusively focused on finding correlations between measures of impairment, and only two of the studies reported correlations in change. Overall, there is very limited information regarding test-retest reliability and the external validity of the change of kinematic outcome measures of reaching performance [14]. Exceptionally, Murphy et al. [15] explored external validity of the change in a number of kinematic descriptors and found a significant covariation of the Action Research Arm Test (ARAT) scores with movement time ( R 2 = 0.36), smoothness ( R 2 = 0.31), and trunk displacement ( R 2 = 0.35). Although the results are promising, this study involves a limited number of subjects (N = 24) from a highly homogeneous sample (i.e., acute patients only). Further, the ARAT clinical scale presents poor robustness to compensation and is especially vulnerable to the use of explicit strategies to improve performance. Majeed et al. [16] explored the application of models based on LASSO regression to predict changes in motor ability (FM-UE) and motor function (Wolf Motor Function Test, WMFT). These models proposed that recovery in both scales can be approximated by the patient's age, the patient's motor control during the execution of fast movements, and other demographic and clinical features, altogether accounting for 65% and 86% of the variability for the FM-UE and WMFT scales respectively. Although these models reached exceptional accuracy, their utility is limited because they make use of kinematic data obtained during the execution of very specific pointing movements supervised by clinicians and/or researchers, and are based on generic unbounded linear models, with the consequence that their estimated values could be largely outside the meaningful range of the scale.
We propose a new approach towards using kinematic data obtained in unsupervised rehabilitation sessions to estimate the level of impairment and functional recovery. Data is obtained from patients engaging with goaloriented embodied individualised training with the Rehabilitation Gaming System (RGS) [17,18]. The RGS combines the paradigm of action execution with that of observation of the corresponding movement in Virtual Reality (VR), this goal is achieved by having the patients perform tasks from a first-person perspective, where the movement of their limbs are captured by a camera or a depth sensor (i.e. Microsoft Kinect) and mapped to an analogous virtual representation on a computer screen. RGS includes individualisation mechanisms to adjust the difficulty of the task to the capabilities of the patient, contextual restrictions, and explicit and implicit feedback.
We first explore the potential of hand movements collected during unsupervised RGS sessions to characterise hemiparesis in stroke patients. Secondly, we build and analyse the performance (i.e., test-retest reliability, validity, sensitivity, and generalisation) [13] of a model for estimating impairment and recovery scores captured by standardised clinical scales.

Protocol
Participants followed a rehabilitation protocol including 3-5 weekly sessions of 30 minutes each for 3-12 weeks using the RGS in Fig. 1. The RGS consists of a PC, a 17 inch LCD touch display, an image-based motion capture device positioned on top of the screen [18]. The virtual tasks logic and graphics are implemented using the Torque 3D and Unity 3D game engines. The joint movements of the user's head, shoulders and elbows are tracked and mapped onto an avatar through a biomechanical model using a custom-developed vision-based motion capture system. Arm movements are displayed on a screen from a first-person perspective, realising a rehabilitation paradigm that combines goal-oriented embodied action execution and observation. For the RGS sessions in the main dataset (Table 1) the participants are instructed to intercept virtual spheres by executing horizontal bimanual movements over the surface of a table ('Spheroids' protocol, cf. inset in Fig. 1). The task parameters (the frequency of sphere appearance, their speed, their range, and size) are combined in a single parameter ('difficulty level') and automatically adjusted during the session in order to maintain the user's performance between 70 and 80% success rate [17,21]. The system allows for the storage and extraction of performance parameters as well as hand path trajectories derived from joints' positions and rotations recorded at about 100 Hz.
During the rehabilitation patients are evaluated using standard clinical scales: Fugl-Meyer Assessment for the upper extremity (FM-UE), Chedoke Arm and Hand Activity Inventory (CAHAI) and Barthel Index (BI). When collecting the 191 samples (Table 1), the following measures are taken to improve data quality: • The clinical score measurements (FM-UE, CAHAI, BI) are coupled to the RGS session closest in time,  [17,18] used in the rehabilitation protocol followed by the patients in the collection of the data in this study. In the inset, we show a screenshot from the 'Spheroids' activity during an RGS session with a maximum time separation of 4 days between the measurement and the RGS session; • The first two RGS sessions of a patient at the start of the rehabilitation trajectory are excluded to ensure that patients are familiar with the RGS environment for all collected samples.

Outcomes and analysis
To analyse the potential of RGS-derived movement descriptors for capturing both impairment and recovery in standardised clinical scales, we first extract a set of variables that are known to correlate with the severity of hemiparesis [13,16]. Next, we define a model that combines the information of several variables to estimate the patient's score on the FM-UE clinical scale. The model includes an error estimate. We use repeated cross-validation to avoid overfitting, allocating 50% of the samples to the training set and 50% to the validation set. We study the model evaluating its convergent validity, test-retest reliability, external validity, and generalisation to other tasks and clinical scales.

Analysis of convergent validity: estimating impairment
To explore the convergent validity of RGS-derived kinematic descriptors in comparison to standardised clinical assessments, we compute Pearson correlations between all the variables (the RGS-derived descriptors and the baseline characteristics) and the clinical scores. By comparing these to a randomised outcome distribution we identify a threshold of r ≃ 0.081 for all the Pearson correlations variable-scores ( Fig. 12 in Appendix). Subsequently, we adopt a nonlinear parametric model to combine information from several variables for estimating the patients' impairment level (i.e., FM-UE scale). We use only the patient's baseline characteristics and RGSderived movement descriptors (extracted from a single session logfile) as predictors. In particular, we rule out the two variables sessions completed so far and time since stroke and functions of them. In this way, the estimate is intended to assess the clinical status of the patient at a given moment without knowledge of the rehabilitation history. To avoid overfitting, we perform repeated crossvalidation with 50% of samples for training and 50% for validation, obtaining the optimal active set of variables possible for our dataset. In the cross-validation procedure, we define the accuracy of the model as the percentage of samples that are correctly estimated above or below the median score (e.g. 47 for FM-UE). Additionally, we report the values of the Leave-one-out cross-validation (LOOCV) for the optimal active set. To quantify the performance of the model, we compare true and estimated FM-UE scores reporting the average error, the Pearson correlation, and the coefficient of determination. Finally, we report the performance of the estimation of FM-UE obtained by a linear model with the same active set of variables.

Analysis of test-retest reliability
To evaluate the test-retest reliability, we consider two unseen datasets each composed of 921 RGS sessions, for a total of 1842 unseen RGS sessions. Each session in the first dataset 'test' is associated with a session in the second dataset 'retest' and obtained from the same patient within less than 48 h. The small time frame makes it plausible that the clinical state of the patient is unchanged between the two test-retest sessions, and so they can be used to assess the reliability of the regression models. These data were collected in the same trials as the main dataset, but they correspond to rehabilitation sessions for which we do not have an associated measurement of a clinical scale (so they cannot be used for training the model). To quantify the reliability in the estimation of the FM-UE scores, we evaluate the intraclass correlation coefficient (ICC) between the estimations obtained by the model for test and retest sessions.

Analysis of external validity: estimating the change in impairment
Starting from the original dataset (Spheroids, Table 1), we design a new dataset (recovery dataset, Table 2) composed of 54 samples where each sample represents a couple of sessions of the same patient for which the time lapsed between the two is larger than 16 days. Next, we compute the correlations between the change in the movement descriptors and clinical improvements. We utilise this dataset to analyse the external validity of the previous model (obtained for estimating impairment) in detecting changes in the clinical status of the same patient. Therefore, we adopt the same set of variables used for the estimate of impairment. We evaluate the performance of the model by comparing the true and predicted change of FM-UE score. Next, we identify 38 out of the 54 samples from the recovery dataset (Table 2) for which the associated FM-UE values exceed an MDC of 4 points. We then determine the sensitivity of the model evaluating the percentage of FM-UE that are correctly predicted above the MDC.

Analysis of task generalisation
The Spheroids protocol requires predominantly movements in the lateral (left/right) direction. To explore the potential of the model to generalise to other tasks that involve bimanual 2D planar reaching movements, we identify a second dataset of 37 samples from a previous study in which 19 subacute stroke patients with hemiparesis trained with a different RGS-based activity which is a variation on the well-known arcade game 'Whac-A-Mole' [18,22]. The observed FM-UE scores in this dataset are in the range [5,60], with a mean of 37 and an SD of 14. The gameplay of Whac-A-Mole requires movements on the full 2d plane so, besides the movement descriptors used for Spheroids, we define independent descriptors for the movements in the front/back direction. Subsequently, we adopt the nonlinear parametric model to combine information from different variables to estimate the impairment level as measured by the FM-UE scale. To quantify the performance of the model, we compare true and estimated FM-UE scores reporting the average error, the Pearson correlation, and the coefficient of determination.

Analysis of generalisation to other clinical scales
To explore the potential of the model to generalise for the estimation of impairment captured by other standardised clinical approaches, we considered the estimation of the CAHAI and BI scales that are available in the Spheroids dataset (Table 1). We, therefore, follow the same steps as above to analyse the convergent validity of the model in estimating CAHAI and BI scores. It is useful to anticipate here that the results for CAHAI are close to those for FM-UE. This similarity is expected as the two scales have a high relative correlations and comparable correlations to most of the variables, cf. Table 1. To be quantitative, we can consider the case in which the FM-UE score S FM i of the 'Spheroids' dataset is estimated by simply rescaling the corresponding CAHAI value S CAHAI i by 66/91; this leads to the standard error and an R 2 value of 1 − σ 2 FM,CAHAI /σ 2 FM ≃ 0.62 . Estimating BI scores from kinematic descriptors is instead generally harder. This can be explained by the lower correlation of BI scores with most of the variables (Table 1). If we estimate the FM-UE score by a simple rescaling of the BI value by 66/100 we get a standard error of and a R 2 value of 1 − σ 2 FM,BI /σ 2 FM ≃ 0.12 . So we see again that BI carries different information than FM-UE (or CAHAI). The above values give us a natural benchmark to assess the performance of the model estimation of the clinical scales.
We conducted these analyses using in-house software (SaddlePoint Signature v2.9.3). Table 2 Characteristics of the 54 samples composing the recovery dataset We select couple of sessions of the same patient with a delay of at least 16 days from the main dataset, Table 1. The last three columns report the Pearson coefficient correlation between the variable and the change in the clinical score between the two sessions. Correlations below the threshold r ∼ 0.14 are in parenthesis.
Characteristics of second-order variables for this dataset are shown in Table 6 in Appendix

Identification of variables
Given the technical constraints of the setup (e.g., sampling rate) and the nature of the training protocol (i.e., execution of horizontal reaching movements), we extracted all task-relevant kinematic measures of arm use that we could identify in the literature [13] . In total we obtained 31 variables, 12 first-order variables and 19 second-order variables obtained as functions of the first-order ones. We identify two groups of first-order variables: (1) demographic and physiological data at recruitment, variables 1-5 in Table 1, and (2) kinematic descriptors extracted for the more affected limb during training sessions, variables 6-12 in Table 1. For the evaluation of all kinematic descriptors, the first and last two minutes of each training session are discarded to avoid the interference of behaviours or events related to the start and the ending of the training session (i.e., revision of instructions, postural adjustments, exposure to the final score screen, etc.).
Second-order variables include chronicity (i.e. acute, sub-acute, and chronic categories) and the difference between the less and the more affected upper-limb in each of the quantitative first-order variables, as well as their logarithmic transformations. The descriptive statistics of second-order variables are given in Table 5 in the Appendix.
Among the above-mentioned variables, most are well-known [13]. The work area is computed as the area of the complex hull of the hand movements using standard methods (Jarvis' Algorithm [23]). The distance covered refers to the total length of the hand paths. The performance success rate is defined as the ratio of spheres intercepted over the total. However, we introduce also the new descriptors 'Smoothness' and 'Total-goal directed movement' (TGDM). Specifically, to extract information on UE motor function we introduce an original kinematic descriptor, J (σ ) , to assess the patient's movements at a specific temporal resolution, σ . This metric allows us to isolate goaloriented movements from the hand trajectory in a certain direction, assumed to be stored over time as a function f(t). For the main dataset, we consider the left/ right direction, as it is the principal axes in the movement dynamics of the 'Spheroids' protocol. We assume measurements are taken at discrete time points t i = i for i = 0, 1, . . . , T − 1 , with being the timestep (for the 'Spheroids' dataset we have ≃ 0.01 s). We define the total hand displacement during goal-oriented movements J (σ ) as the difference between the actual movements and a smoothed version of the discrete movements. The smoothed hand path f σ (t) is obtained using a Gaussian smoothing process where the parameter σ defines how smooth the new trajectory will be, see an example in Fig. 2. Therefore J (σ ) is obtained as Following this analysis method, we derive the two new variables corresponding to the value of J (σ ) at the two peaks in the σ-dependent Pearson correlation with the clinical scales, Fig. 3: 'Smoothness' in correspondence to the high-frequency peak, and 'TGDM' in correspondence to the low-frequency peak. The location of the two peaks is weakly dependent on the clinical scale considered, yet it appears to be related to the data structure: the high-frequency peak is linked to the time resolution of the data ( ≃ 0.01 s), while the low-frequency peak is related to the typical timescale of the Spheroids protocol (i.e. a set of spheres is launched every ∼ 10 s).

Models description
To combine variables for estimating clinical scores of impairment and recovery, we introduce a model that allows for the presence of noise on both the variables Z and the score S, and we hence name it a doublenoise parametric model. Its generative functional form is where θ = {β, β 0 , A, B, σ 1 , σ 2 } are the p + 5 model hyperparameters to be inferred: β (association parameters of p active variables), β 0 (parametric offset), A and B (range offsets), σ 1 and σ 2 (noise strengths). The sources of noise u (the covariate noise) and v (the score noise) are both assumed to be standard normally distributed. Since tanh(−x) = − tanh(x) , we remove the resulting parameter sign ambiguity by enforcing B ≥ A . Note that the saturation of the sigmoidal function captures the limits of the clinical scores so that the average over the noise of the estimated score S is constrained in the interval [A, B].
The probability of a particular score S, given the variables and the hyperparameters, is given by We take p(σ 1 ) ∼ e −q/σ 1 and similar for σ 2 with q being a very small number, typically of the order of the accuracy of the numerical work, e.g. q ∼ 10 −10 . These priors guarantee that the log-likelihood function is bounded from below.
We adopt Maximum A Posteriori (MAP) inference: given the dataset D = {(Z 1 , S 1 ), . . . , (Z n , S n )} , the optimal parameters θ correspond to the maximum of the posterior probability or, alternatively, to the minimum of the regularised log-likelihood function: The errors on the inferred parameters θ are estimated from the curvature of the regularised log-likelihood function at the minimum.
Two simpler models derived from Eq. 8 can be considered corresponding to having either score noise only or covariate noise only: • Score noise model, σ 1 = 0 . Taking the limit σ 1 → 0 in Eq. 8 gives • Covariate noise model, σ 2 = 0 Taking the limit σ 2 → 0 in Eq. 8 gives provided |S i − b| < a for all i, otherwise � cov (θ ) = ∞ . This implies that for each b the minimisation over a is to be carried out strictly over the open interval a > max i |S i − b|.

Identification of features
Several of the kinematic descriptors, in particular TGDM and the difficulty level reached (difficulty) correlate highly with all clinical scales, cf. Table 1  All kinematic variables show consistent inter-variables correlation, cf. Fig. 4. In particular, the maximum speed achieved during reaching (max-sp) correlates highly with the work area (w-area) ( r = 0.76 , p < 0.0001 ). The variable age correlates negatively with TGDM ( r = −0.31 , p < 0.0001 ) and difficulty ( r = −0.24 , p = .00098 ), while it is uncorrelated with FM-UE and CAHAI scores, but not to BI ( r = −0.30 , p < 0.0001 ). This suggests that agerelated technology proficiency may affect the kinematic descriptors. Nevertheless, this effect is relatively weak, i.e., the correlations of difficulty and TGDM with the clinical scores are significantly higher than the ones with age.
To better understand the meaning of the two variables obtained with the smoothing techniques (smoothness and TGDM), we also extract finite time-windowed variables for comparison (64 s for the range variable; 14 s for the speed). Specifically, we compute the maximum range of movement (left/right direction) within overlapping time windows of size σ , where σ is again fixed by the condition of maximum correlation with the clinical scale of interest, and we average the measurements across all possible windows along the whole RGS session. The resulting values show a very high correlation with the TGDM variable ( r = 0.98 , p < 0.01 ) and show very similar Pearson coefficients with the FM-UE ( r = 0.54 , p < 0.01 ), CAHAI ( r = 0.57 , p < 0.01 ) and BI ( r = 0.44 , p < 0.01 ) clinical scales. These results suggest that the TGDM is capturing information about the typical range of movement associated with the scenario events occurring within relevant time windows (i.e. about 10 seconds in the Spheroids scenario). Following the same method, we extract time-windowed maximum reaching speed. The resulting values show a very high correlation with the smoothness variable ( r = 0.87 , p < 0.01 ) together with very similar Pearson coefficients with the FM-UE ( r = 0.42 , p < 0.01 ), CAHAI ( r = 0.39 , p < 0.01 ) and BI ( r = 0.21 , p < 0.01 ) clinical scales. These results suggest that smoothness is linked to the ability of the patient to perform fast movements to complete the RGS tasks.

Convergent validity: estimating impairment
We found that the three models implemented (double noise model, score noise model, covariate noise model) offer comparable performance in estimating the FM-UE scale on the dataset in Table 1. The typical error in the final estimate is of order ∼ 10 , while the inferred noise score error σ 2 is typically ∼ 0.1 . This result supports that the score noise has a low impact, therefore we select the covariate noise model to decrease the number of parameters to be inferred.
The most relevant variables for the estimate of the FM-UE score are difficulty and Diff. TGDM (difference in TGDM between the non-paretic arm and the paretic arm). The total number of active variables is 6 and they are shown in Table 3. We have enforced the presence of Diff. distance covered in the active variable set since it is relevant in the estimate of clinical change, Tables 2 and 6. Note that the resulting active variable set does not contain the patient's baseline variables and the estimation of scores on clinical scales are solely obtained from RGSderived data. This also means that estimates made by this model for different RGS sessions of the same patient are considered independent measurements. In total, we use 10 parameters (6 association parameters and 4 hyperparameters) inferred from the 191 sessions.

Spheroids protocol
Note that for FM-UE change the value of a variable refers to the difference between the two sessions. The values listed here refer to the normalised variables, so that the values of the different β s are directly comparable. we obtain that for the active variable set Table 3  To further exemplify the features of the above model, we compare with the simple linear model S = β · Z + β 0 + σ u with the same covariates as in Table 3. The inferred values of β for normalised variables are in this case (same order as in Table 3 14 . Interestingly, the range of estimated FM scores in the dataset is now [16,84], with 46 scores (over 191) estimated above 66. This result exemplifies the problem of mapping unbounded variables on a finite range and illustrates the need for a nonlinear function as in Eq. 5.

Robustness: test-retest reliability
In Fig. 6 we show the FM-UE estimations obtained by the model for the test and retest sets. The ICC between the 'test' and 'retest' is 0.89, which is at the high end of the 'Good' agreement measure according to the Koo and Li guidelines [24]. In Fig. 6 we also show the interval defined by the standard error of the regression E ≃ 12.7 . We measure an average retest error i (S test i − S retest i ) 2 /N equal to 5.9. Finally, we estimate a minimally detectable change (MDC) [25] of 11.6 points.
These results support the internal consistency of this assessment method to estimate the clinical scores of the patients.

External validity: estimating the change in impairment
We observe that the Pearson correlation between change in FM-UE ( FM-UE ) and change in CAHAI ( CAHAI ) is r = 0.68 , Pearson between FM-UE and change in BI ( BI ) is r = 0.67 , while the Pearson correlation between CAHAI and BI is r = 0.72. For each sample, we consider now as variables the change (between the two sessions) of the original variables. By comparing it to a randomised outcome distribution, we identify a threshold of r ∼ 0.14 for the Pearson correlations variable-score. Several of the variables correlate highly with the change in all three clinical scores, cf.  estimating a single session's score, the variables age and chronic correlate more with the outcome when predicting the score change. The initial scores (the clinical scores at the first session) have a high correlation with change because of ceiling effects. In Fig. 7 we show the correlogram of all the variables and clinical scales. We observe that generally the correlations between kinematic descriptors in a single session (Fig. 4) are preserved also when considering the change between sessions; for example, the highest inter-variables correlation is for Change in w-area and Change in max-sp at ( r = 0.87 , p < 0.0001).
We compare the true and predicted FM-UE in Fig. 8. The Pearson correlation between true FM-UE and predicted FM-UE is 0.76. The value of the coefficient of determination R 2 is 0.57. These results show that the model has a good external validity to clinical change with a precision comparable to the one obtained for the crosssectional FM-UE score.

Sensitivity: estimating recovery
We obtain that the model correctly predicts recovery in 36 over 38 of the cases with FM-UE ≥ 4 , indicating a TPR of 95%.

Task generalisation
For the task generalisation analysis, we consider a dataset composed of 37 RGS sessions from 19 hemiparetic participants that trained in a rehabilitation protocol derived from Whac-A-Mole [18].
Unlike the Spheroids protocol, the gameplay of Whac-A-Mole requires movements on the full 2d plane. In response, we utilise the smoothing technique in both cardinal axes of the task. i.e. front/back and left/right directions. Pearson correlations between the clinical scales and the variable J (σ ) reveal a similar pattern to the one observed in the Spheroids scenario, with a peak of the Pearson coefficients at about 1s corresponding to the variable TGDM in each direction ( Fig. 13 in Appendix). The location of the main peak is again close to the typical timescale of the protocol. For the FM-UE score, the highest Pearson coefficient is observed in the frontal direction ( r = 0.54 for σ = 1.3s ); the lateral hand displacement peak is ( r = 0.50 at σ = 1.1s).
When predicting clinical scales, we use now only 2 active variables in order to limit overfitting: the variables TGDMfb (Total-goal directed movement for front/back direction) and TGDMlr (Total-goal directed movement for left/right direction). We then infer 6 parameters (2 association parameters + 4 hyperparameters) from the 37 RGS sessions. The two association parameters are (for normalised variables) β TGDMfb = 0.15(0.13) and β TGDMlr = 0.18(0.14) . The hyperparameters of the model that predicts FM-UE for the 'Whac-A-Mole' scenario are given by a = 31.9(3.5) , b = 31.0(2.5) , σ 1 = 0.49(0.11) , and β 0 = 0.21(0.13).  (Table 2), using the covariate noise model with association parameters given in Table 4 Page 12 of 17 Ballester et al. Journal of NeuroEngineering and Rehabilitation (2021) 18:186 The FM-UE estimates are shown against the true values in Fig. 9. The Pearson correlation between true FM-UE and predicted FM-UE is 0.63, the average error is E ∼ 11.2 , and the value of the coefficient of determination R 2 is 0.39.

Generalisation to CAHAI and BI
Overall, the CAHAI scale has similar properties to FM-UE in relation to the kinematic descriptors, Fig. 4. To stress the generalisation potential of the model, we can then adopt the same model introduced in Table 3 for estimating the FM-UE and CAHAI scores. The association parameters for CAHAI are reported in Table 4 Table 4 The association parameters β for estimating the CAHAI score, Spheroids scenario The active variables are the same as in Table 3. The values refer to the normalised variables so that the values of the different β s are directly comparable.  (Table 1), using the covariate noise model with association parameters given in Table 4 (Table 1). We use the covariate noise model with association parameters given in Table 3 for FM-UE and Table 4 for CAHAI In Fig. 11 we compare FM-UE and CAHAI, both for the true scores and the predicted scores. We note that the relationship between FM and CAHAI is generally well preserved in the estimates; for example, the Pearson between FM-UE and CAHAI scores is r = 0.89 for true values and r = 0.88 for estimates. The fact that the variability in the true FM-UE vs true CAHAI is seemingly comparable to the one in the model, reinforces the idea that the precision we achieve is similar to the one of estimating FM-UE directly from CAHAI, as we estimated using Eq. 1.
Finally, we observe that the model that predicts FM-UE and CAHAI scores does not work well for the BI. Most kinematics variables have a significantly smaller correlation with the BI (in particular work area, TGDM, smoothness) while baseline information and clinical history of the patient are comparatively more relevant (for example the patient's age, Table 1 and Fig. 4). The active variable set for BI is composed of 5 variables ( β values for normalised variables): age ( β = −0.21(0.15) ), sessions completed so far ( β = 0.24(0.19) ), difficulty ( β = 1.21(0.72) ), Log. time since stroke ( β = 0.14(0.14) ), Log. Difficulty ( β = −0.91(0.70) ). Using a double noise model, Eq. 8, we infer the hyperparameters a = 51.56(0.10) , b = 49.22(0.12) , σ 1 = 0.5948(0.0072) , σ 2 = 0.178(0.027) , and β 0 = 1.025(0.011) . Comparing true and predicted BI scores, we measure an average standard error of E BI ∼ 16.8 , a Pearson correlation true-prediction of 0.62, and a coefficient of determination R 2 of 0.35. This accuracy is comparable to the one achieved by the models for FM-UE and CAHAI scores. Nevertheless, the dataset Table 1 is very unbalanced towards high BI scores (mean score 80, with only 3 samples with a score below 25), so may not generalise well to homogeneous unseen BI data (i.e., the precision for low scores may be relatively poor).

Discussion and conclusion
Our understanding of post-stroke motor recovery depends on our capacity to evaluate and characterise impairment and disability. Current standardised assessment methods are mostly subjective and present relevant unsystematic variability due to differences in the evaluators' training, lack of systematicity in the administration of the assessments, and often are excessively focused on one single aspect of the impairment and/or disability.
Different rehabilitation approaches show a preference for using (and even targeting) specific assessment methods for the evaluation of their therapeutic efficacy, and often these methods have been developed by the same team of authors. For example, the effectiveness of Constraint-Induced Movement Therapy [26] is usually evaluated using the Wolf Motor Function Test [27] and the Motor Analog Scale [28], while the effectivity of occupational therapy has been frequently assessed using the Barthel Index [29] and the Functional Independence Measure [30]. There is an urgent need to establish alternative methods for a common evaluation protocol and characterisation of the hemiparesis phenotype, thus allowing us to identify specific impairment features that could advance our understanding of the recovery dynamics and guide the design of effective rehabilitation therapies. In pursuing this objective, we have conducted a careful analysis of the kinematic data from the upperextremities of 191 individuals with post-stroke hemiparesis, and we have constructed a model for estimating impairment and recovery. Our results reveal a new digital biomarker of upper-limbs motor impairment, the Total Goal-Directed Movement (TGDM), which relates to the patients' range of motion during the execution of meaningful goal-oriented reaching movements. The TGDM strongly correlates with the level of impairment captured by the FM-UE and the level of disability captured by the CAHAI, and also carries relevant information about the patients' progress, showing a high correlations with the magnitudes of improvement and deterioration estimated by both scales. The model presents high external validity of impairment estimates ( R 2 : 0.38), robustness (testretest reliability) (ICC: 0.89), external validity of recovery estimates ( R 2 : 0.57), sensitivity (TPR: 95%) and task generalisation ( R 2 : 0.39). Despite the high heterogeneity of the sample and the high level of noise of the selected kinematic parameters, the model's accuracy to estimate the FM-UE is comparable to other standardised clinical scales, such as CAHAI. These results are especially interesting given the currently limited evidence about the external validity of the change of kinematic outcome measures of reaching performance in people with hemiparesis after stroke [31]. According to a recent systematic review on the clinimetric properties of kinematic upper limb assessments [13], only two papers captured external validity of change (i.e., ability to capture longitudinal changes in the measured construct), and just nine parameters showed enough evidence to estimate recovery (i.e., number of velocity peaks, trunk displacement, task/ movement time). The quality of evidence however was very low for all metrics. Further, our results outperform previous methods to estimate the level of impairment [15,16] in two fundamental aspects: (1) it shows robustness to compensation and is resistant to using explicit strategies to boost performance, and (2) it exclusively relies on metrics collected under unsupervised rehabilitation sessions. Although current recommendations point out that wearables with integrated Inertial Measurement Units and vision-based tracking systems are insufficient to measure the quality of movement and improvement in motor function, our findings, together with the growing Although these results support that our method is able to capture improved sensorimotor control during gameplay (i.e., motor synergies), we cannot discard the contribution of task-related learning processes to the overall change measured. In addition, it is important to notice that the RGS setup did not provide antigravity support, thus the different capabilities of the participants to elevate the elbow and wrist may explain part of the unsystematic variability we observe in the model's performance.
The relevance of our results is emphasised by their consistency across clinimetric properties and by their generalisation potential, relying on a large and heterogeneous dataset of patients at different stages post-stroke. Notice however that its generalisation potential has been evaluated in similar setups using vision-based systems (camera-or depth-sensor-based) tracking the joints of the upper limbs during the performance of planar reaching arm movements. We believe that the applicability of the TGDM and its derived models to evaluate impairment and motor recovery is promising for a number of reasons: (1) it can be derived from unimanual displacements executed in the horizontal plane, (2) it does generalise to other tasks involving two-dimensional horizontal reaching movements towards targets, and (3) it can be Table 5 Characteristics of the secondary variables for the 191 samples composing the main dataset (obtained from first-order  variables, Table 1) The r columns refer to the Pearson correlation coefficients with the FM-UE, CAHAI, and BI clinical scales, respectively. Correlations below the threshold r ∼ 0.081 ( Fig. 12 in Appendix) are in parenthesis. From the time since stroke, we obtain the categories Acute (5-90 days), Sub-acute (3-12 months), and Chronic (over 1 year). The variables obtained directly from the RGS system log files are in Italic type. The Diff. variables are obtained as the difference between the value observed for the less affected arm and the value for the more affected one. The Log. variables are obtained as the natural logarithm of the corresponding first-order variables   Tables 1 and 5. The light blue line is the distribution obtained for random outcomes In Table 5 we report the descriptive statistics of the secondary variables (functions of primary variables in Table 1) for dataset Spheroids.
The distribution of the Pearson correlation coefficients of the FM-UE score with all the variables in Tables 1  and 5 is shown in Fig. 12. This is compared with the distribution obtained with the randomised outcome, whose standard deviation is r ≃ 0.081.
In Table 6 we report the descriptive statistics of the secondary variables (functions of primary variables in Table 2) for the recovery dataset (change of variables and clinical scales between two RGS sessions of the same patient with a delay of at least 16 days).
In Fig. 13 we show the Pearson's r between J (σ ) Eq. 4 and the clinical scales as a function of the timescale σ for the database Whac-A-Mole, Sec. Task Generalisation. This is a dataset composed of 37 RGS sessions with a limited range of the clinical scales: FM-UE range of observations [5,60], mean 37, SD 14; CAHAI range of observations [7,49], mean 32, SD 15; BI range of observations [48, 100], mean 85, SD 13. The Whac-A-Mole scenario requires movements in all 2D place (unlike Spheroids scenario, Table 1), so we define two independent J (σ ) : one associated with the front/back trajectory and one associated with the lateral (left/ right) trajectory. In Fig. 13 we show that in both directions there is a peak in the Pearson's r at about σ = 1 s for all three clinical scales. In correspondence with these two peaks, we define two variables: TGDMfb (Total-goal directed movement in front/back direction) and TGDMlr (Total-goal directed movement in the left-right direction). We stress that the timescale of the peak is roughly equivalent to the timescale of the gameplay of the Whac-A-Mole scenario. Finally, we note that for the same analysis done in the Spheroids scenario we observe two clear peaks in the lateral direction (Fig. 3). In Whac-A-Mole instead, the high-frequency peak is not clearly visible. One factor that may affect this difference is that the gameplay of Whac-A-Mole is faster than Spheroids (roughly 1 s instead of 10 s) so that the separation between the two potential peaks is smaller. Other factors that may influence the absence of the second peak are the fact that the gameplay is 2D (so Table 6 Characteristics of the secondary variables for the 54 samples composing the recovery dataset (obtained from first-order variables, Table 2) We select couple of sessions of the same patient with a delay of at least 16 days from the main dataset, Table 1. The r columns refer to the Pearson correlation coefficients with the FM-UE, CAHAI, and BI clinical scales, respectively. Correlations below the threshold r ∼ 0.14 are in parenthesis. The variables obtained directly from RGS system log files are in Italic type. The Diff. variables are obtained as the difference between the value observed for the less affected arm and the value for the more affected one. The Log. variables are obtained as the natural logarithm of the corresponding first-order variables  . For the FM-UE score, the peaks are at ( r = 0.54 for σ = 1.3s ) for the frontal direction and at ( r = 0.50 at σ = 1.1s ) for the lateral direction. For the CAHAI score, the peaks are at ( r = 0.44 for σ = 0.78s ) for the frontal direction and at ( r = 0.43 at σ = 0.31s ) for the lateral direction. For the BI, the peaks are at ( r = 0.44 for σ = 1.2s ) for the frontal direction and at ( r = 0.49 at σ = 1.1s ) for the lateral direction