Enhancing fall risk assessment: instrumenting vision with deep learning during walks

Moore, Jason; Catena, Robert; Fournier, Lisa; Jamali, Pegah; McMeekin, Peter; Stuart, Samuel; Walker, Richard; Salisbury, Thomas; Godfrey, Alan

doi:10.1186/s12984-024-01400-2

Research
Open access
Published: 22 June 2024

Enhancing fall risk assessment: instrumenting vision with deep learning during walks

Jason Moore¹,
Robert Catena²,
Lisa Fournier²,
Pegah Jamali²,
Peter McMeekin³,
Samuel Stuart⁴,
Richard Walker⁵,
Thomas Salisbury⁶ &
…
Alan Godfrey¹

Journal of NeuroEngineering and Rehabilitation volume 21, Article number: 106 (2024) Cite this article

329 Accesses
2 Altmetric
Metrics details

Abstract

Background

Falls are common in a range of clinical cohorts, where routine risk assessment often comprises subjective visual observation only. Typically, observational assessment involves evaluation of an individual’s gait during scripted walking protocols within a lab to identify deficits that potentially increase fall risk, but subtle deficits may not be (readily) observable. Therefore, objective approaches (e.g., inertial measurement units, IMUs) are useful for quantifying high resolution gait characteristics, enabling more informed fall risk assessment by capturing subtle deficits. However, IMU-based gait instrumentation alone is limited, failing to consider participant behaviour and details within the environment (e.g., obstacles). Video-based eye-tracking glasses may provide additional insight to fall risk, clarifying how people traverse environments based on head and eye movements. Recording head and eye movements can provide insights into how the allocation of visual attention to environmental stimuli influences successful navigation around obstacles. Yet, manual review of video data to evaluate head and eye movements is time-consuming and subjective. An automated approach is needed but none currently exists. This paper proposes a deep learning-based object detection algorithm (VARFA) to instrument vision and video data during walks, complementing instrumented gait.

Method

The approach automatically labels video data captured in a gait lab to assess visual attention and details of the environment. The proposed algorithm uses a YoloV8 model trained on with a novel lab-based dataset.

Results

VARFA achieved excellent evaluation metrics (0.93 mAP50), identifying, and localizing static objects (e.g., obstacles in the walking path) with an average accuracy of 93%. Similarly, a U-NET based track/path segmentation model achieved good metrics (IoU 0.82), suggesting that the predicted tracks (i.e., walking paths) align closely with the actual track, with an overlap of 82%. Notably, both models achieved these metrics while processing at real-time speeds, demonstrating efficiency and effectiveness for pragmatic applications.

Conclusion

The instrumented approach improves the efficiency and accuracy of fall risk assessment by evaluating the visual allocation of attention (i.e., information about when and where a person is attending) during navigation, improving the breadth of instrumentation in this area. Use of VARFA to instrument vision could be used to better inform fall risk assessment by providing behaviour and context data to complement instrumented e.g., IMU data during gait tasks. That may have notable (e.g., personalized) rehabilitation implications across a wide range of clinical cohorts where poor gait and increased fall risk are common.

Introduction

Falls can lead to loss of independence and even death [1, 2]. Identifying those at risk of falling is an important clinical task often conducted in e.g., those with visual impairment [3], and the elderly [4,5,6]. Equally, fall risk assessment is of notable importance and pragmatically useful in people with a movement disorder, such as Parkinson’s disease (PD) [7,8,9] or Stroke [10,11,12,13] due to observable functional deficits in motor control. Additionally, assessing fall risk is equally important during pregnancy [14] where a third of pregnant women may fall [15]. In fact, there is a significant increase in falls from pre-pregnancy to the 3rd trimester which cannot be fully explained by morphological [16] or biomechanical [17] changes.

A comprehensive fall risk assessment is multifactorial and a time-consuming process including but not limited to medication review, cognitive screening, detailing a history of falls, as well as evaluating gait, balance [18], and environmental hazards or hazardous activities that have been documented in some cases to be responsible for 50% of falls [19]. For timeliness in many settings, assessing gait alone is usually conducted to evaluate intrinsic fall risk [20]. That is convenient as gait is a good marker of global health [21] and fundamental to many activities of daily life [1]. Consequently, a gait assessment with positive outcomes from subjective evaluation (by an assessor) provides insight into the patient’s independence and ability to ambulate with minimal fall risk. As described, an assessment is typically conducted by manual observation alone, where an assessor examines a person’s gait during a scripted task (i.e., walking protocol). Often, a protocol may include navigating (walking around or over) obstacles [22,23,24,25], deliberately challenging the person by increasing gait demands [26]. Yet, that also places extra burden on the assessor, challenging them to carefully observe the person’s gait during a more complex task. Instrumentation is needed to optimize assessment protocols while providing high resolution objective fall risk data.

The integration of digital technology as an objective standard in fall risk is not routine. While digital tools may provide clinicians with high-resolution data to potentially aid in determining a patient’s fall risk, there is still ongoing work to be done in understanding their full utility and developing appropriate methods. In recent years, technology has matured to include a wide selection of digital tools. Of course, 3D motion capture systems are a perceived gold/reference standard for human movement analysis, but it lacks practicality and deployment in habitual settings. Moreover, reflective markers require timely application. In contrast, wearable devices (i.e., inertial measurement units, IMUs) are quickly attached and provide clinically relevant gait characteristics to a millisecond resolution in any environment [27,28,29,30].

An objective gait assessment to inform fall risk is usually conducted within a laboratory with a single IMU on the lower back [30]. Typically, participants are then asked to undertake a protocol representing walking challenges in daily life [31, 32], like obstacle crossing [25]. However, a key IMU limitation is the provision of inertial gait data only without any insights into navigating behavior and visual attention allocation to environmental/extrinsic details. Accordingly, there is no absolute clarity to understand how gait and fall risk is influenced by other intrinsic (e.g., visual attention) or extrinsic (e.g., obstacles) factors. For example, a comprehensive instrumented assessment would better understand how those being assessed allocate visual attention along their walking path for safe navigation while also determining the role of attention when e.g., peripheral obstacles cause a distraction. Supplementing IMU data with video data from video-based eye tracking wearable glasses could better define intrinsic and extrinsic factors, providing a contemporary and pragmatic approach to fall risk assessment with easily attached wearables. (Indeed, eye tracking offers an avenue for exploring neurocognitive changes as a reason for increased falls incidence.)

Commercial eye tracking glasses capture high quality video data and often in the standardized MP4 format with a resolution of 1920 × 1080. The video contains a superimposed crosshair to display eye location. Accordingly, videos contain data on the general environment and specific objects of where the wearer is looking but data processing of eye-tracker videos is extremely time consuming and needs to be automated to allow clinical application [33]. Including eye tracking (to identify an object/obstacles of interest) with IMU data during a range of simulated free-living tasks (e.g., obstacle crossing) would provide a novel approach for simultaneously instrumenting visual attention during gait within a fall risk assessment. To accomplish this, a suitable methodology to instrument visual attention from video data must first be established as none currently exists. Accordingly, a novel vision-aided fall risk assessment (VAFRA) is proposed in this study.

Instrumenting vision: a computer vision approach

Video-based eye trackers can help understand the allocation of visual attentional (and visual function) [34, 35]. Attention relates to the ability to focus on a task and within the context of gait assessment for fall risk could relate to when and how long one fixates on a hazardous obstacle [36]. It has been shown that duration of fixation and therefore attention on an obstacle is linked to avoidance or clearance and risk of tripping [37]. Quantifying fixation time on an obstacle can reveal the relationship between attention, obstacle avoidance or clearance, and fall risk. During navigation, the visual system provides critical information about factors like object depth, one’s heading direction, and time to contact with an object, but it also includes visual acuity, contrast sensitivity, depth perception, and visual field integrity, among others [38]. Those aspects of visual function contribute to detecting, and perceiving extrinsic/environmental cues, including obstacles, during walks.

Eye trackers can help assess attention (and visual function) by quantifying gaze patterns, saccades, and fixations during walks with obstacle avoidance or crossing tasks [39]. Those with compromised visual function may exhibit inadequate gaze patterns [40] that can lead to inadequate attention allocation to obstacles which, in turn, can increase the risk of tripping or falling [41]. Conversely, those with intact visual function are more likely to allocate appropriate attention to obstacles and hence successfully traverse around them, reducing fall risk. Despite the promising opportunities with eye-trackers, there are two key challenges to be overcome: (i) the time-consuming review of video data with manual labelling of frames [33, 42] and (ii) objective identification of where the person is looking, and hence attending. The latter is complicated by any extrinsic obstacle which may impact walking and/or other items in the environment. Often, labs strive to maintain a clean environment by removing non-critical equipment but often that is not practical [33]. This is especially true in ecologically valid situations (i.e., everyday life), where distractions are common and can significantly contribute to fall risk. Moreover, many settings where fall risk assessment occurs (e.g., inpatient setting) have similar challenges to minimize distractions while maintaining optimal conditions.

An automated approach is necessary to analyze attention and visual function using video eye-tracking within the context of fall risk assessment, complementing the developments in IMU data processing for fall risk evaluation [30]. In fact, many populations do not exhibit their true motor deficits until attention is divided [43, 44] and other real-world motor tasks like object avoidance can be best measured through the person’s visual recognition [45]. Better (instrumented) methods are needed to measure visual attention within real-world motor tasks. Artificial intelligence (AI) methodologies within computer vision (CV) enable automated instrumental approaches with e.g., object detection. CV algorithms label many video frames in a timely manner, classifying environmental contexts while informing attentional behaviors from eye tracking i.e., automating eye location overlapping with extrinsic factors like obstacles, distractions and/or hazards. Importantly there are attainable contemporary CV methodologies that are state of the art and can be tailored to each specific use case e.g., YoloV8 [46].

To date, instrumented gait has received extensive research focus but fall risk assessment through gait alone, although useful, remains limited. Here we propose a deep learning-based object detection algorithm (VAFRA) for the novel instrumentation of allocation of visual attention (gaze) and contextualization of video data to better inform fall risk assessment within a lab during an obstacle crossing based continuous walk. The work is important as it presents a novel approach to better inform lab-based assessment of objective fall risk to advance approaches to rehabilitation via contemporary technologies. Specifically, we suggest pragmatic models to help instrument visual attention components during walks to better inform how video and eye-tracking can be used during fall risk assessments. The technology employed can capture and analyze gaze patterns and environmental interactions in a manner that is not dependent on the visual acuity of the user.

The paper is structured as follows. In the Methods and Materials section, we provide an overview of the participants recruited, lab-based protocol and discussion of the dataset to train a model. That section also details (i) the mechanics detecting eye location and overlap between objects as well as determining object row and (iii) a U-Net approach for creating segmentation masks for the walking tracks/paths. The Results section provides preliminary statistics comparing foot clearance during attention vs. distraction as defined by the model. The Discussion provides insights to the instrumented approach with limitations but highlights future application to assess fall risk in cohorts needing this proposed approach.

Materials and methods

Participants

This research was approved for human subjects’ study by the Washington State University institutional (ethics) review board (IRB# 17442). All participants provided written informed consent. The study recruited a total of twenty healthy pregnant women (29.9 years old ± 4.9 years, 66.0 kg ± 10.5 kg, 166.0 cm ± 6.7 cm). All female participants were in approximately their 13th week of gestation (± 1 week). Recruitment was conducted through flyers distributed during their initial obstetrician visit. They volunteered by calling into an enrollment person/researcher for screening. They were excluded from participating if they were considered a high-risk pregnancy, unable to walk unassisted, had a cognitive inability to read and understand instructions, or if they could not commit to longitudinal testing for the entirety of the pregnancy.

Protocol

Participants were enrolled as part of a larger follow-up (longitudinal) study examining fall risk in pregnancy. Each follow-up timepoint (n = 5) contained a wide testing battery and each lasted approx. 60-min. Accordingly, about 100 h of video data were accumulated. During each testing session, participants wore eye tracking glasses (Tobii Pro2, Stockholm https://www.tobii.com), which captured environment/lab video data at 24 frames per second (fps) from test start to finish along with the participants gaze at each frame (1920 × 1080 px). The lab comprised two walking tracks/paths within a continuous loop with intentionally placed hurdles and distractors, Fig. 1. Six fully visible white PVC pipe obstacle hurdles were placed 3 m apart along the 12 m long, 0.92 m wide, two-sided black walking path. Some hurdles were always set to 10% body height, while surrounding hurdles were randomly assigned to 5%, 7.5%, 10%, and 12.5% of body height of the tested participant.

Dataset

Video data spanning the full 60-min per participant were utilized to train the proposed model. This was due to the frequent examples of more obscure angles and head positions captured when the participant was not performing a direct 2-min walk test and may for example be standing and talking to a researcher between tasks while looking around the lab setting. That aided the model to generalize to more diverse scenarios like rare head angles during the test. A key advantage to the data being captured within a controlled setting is homogeneity. Specifically, within all videos captured from participants, variables such as lighting, hurdles and video quality remain similar. For use within the produced dataset the frames of the captured footage were extracted and labelled using a Python-based tool [47] with example classes being: hurdle, tennis ball, animate distractor, bucket.

The labelling process resulted in a dataset consisting of 987 labelled frames and 18 classes across all frames. These classes represent a variety of objects or obstacles that are pertinent to determining whether a participant is paying attention (i.e., to obstacles along the path) or distracted. Of these 18 classes 3 are defined as “core objects” being tennis ball, support, and hurdles as these are the direct objects and points of investigation for the task. The labelled information was extracted using the inbuilt functionality of the label producing software. The images folder contained the full resolution raw images with accompanying labelled information stored in the annotations folder in.txt format. Annotations contained a line for each object detected within the scene holding the object class id and object bounding box coordinates within the x mid, y mid, width, height format.

Object detection model (ODM)

Model implementation was performed using the Python-based deep learning library PyTorch and the Ultralytics suite of available Yolo algorithms. The final object detection algorithm used was the latest YoloV8 network [46]. That version was chosen as it has been shown to have more accurate results on images and video within 1.3 ms speed per image size at 640 × 640 (to be used in this study) compared to YoloV7 [48]. This architecture (Fig. 2) takes the image as input and feeds it through a series of convolution, pooling and batch norm layers before outputting predicted classes and bounding box coordinates on the extracted features.

The output from the model was then further enhanced using non maximum suppression, used to remove duplicated bounding boxes and reduce noise in detection based on intersection over union (IoU) metrics (i.e., overlap between predicted bounding boxes and ground truth annotations). Given the minute pixel data required to accurately classify important obstacles within the track, the model was trained on images resized to 640 × 640 px to retain ample image information while balancing performance. The requirement was further aided using distributed focal loss (DFL) which is a custom loss function used for improving the ability of models to identify small objects within images (Eqs. 1–4), which was a core requirement for the dataset and to also aid with class imbalances within the training data.

$$s_{j} = \frac{1}{{N_{j} }}\mathop \sum \limits_{i} \left[ {t_{i,j} \times \sqrt {\frac{{w_{j} }}{{h_{i, j} }}} } \right]$$

(1)

$$w_{j} = \frac{s}{{h_{j} }} + \epsilon$$

(2)

$$focal_{l} oss_{i,j} = focal_{l} oss_{i} \times \sqrt {\frac{{w_{j} }}{{h_{i, j} }}}$$

(3)

$$DFL\left( {p, y} \right) = - \alpha \left( {\frac{{n_{y} }}{n}} \right)(1 - p)^{\gamma } \log p$$

(4)

Equations: DFL Loss equation, where s is the average object size for the batch, N_j is the number of anchor boxes in the batch with ground-truth label h_j [2] is the height of the ground-truth bounding box for anchor box i with label j, epsilon is a small constant to avoid division by zero, and focal_loss_[48] is the focal loss for anchor box i with label j. The DFL loss is computed for each class j separately, and the final DFL loss is the sum of the DFL losses for all classes:

The training process was conducted within a Windows based Python 3.8 environment, on a system containing an RTX 3070 graphics card, Ryzen 7 3700X CPU and 24 GB of RAM and took ~ 3 h to train over 100 epochs. The dataset was split using a pragmatic 80:20 train-test ratio outputting evaluation metrics across both training and validation examples: train/box_loss, train/cls_loss, train/dfl_loss, precision, recall, mean average precision (mAP50 and mAP50-95), val/box_loss, val/cls_loss and val/dfl_loss.

ODM: eye location

Classification of the objects provided context to the video data but when considered in isolation, provided little meaningful information. To automate the detection of where visual attention is, a mechanism is required to provide information (Fig. 3). Within the model an algorithm was implemented to detect overlaps between the bounding box coordinates using the × 1, y1, × 2, y2 format. Algorithm 1 outlines the process, by performing a for loop over each detected object, the coordinates are input to the overlap detection function taking the coordinates as arguments. Those coordinates are then compared with the stored eye tracker coordinates returning true if an overlap is detected.

ODM: object row mechanic

Whilst the proposed lab setup mimics that of an optimal walking path for gait assessment [25], it also provides the inclusion of potential distractors and hurdles along the path. The spatial context of these distractors and hurdles are vitally important for inclusion within the model given their clinical significance and implications for assessing a participant’s visuospatial attention and ability to navigate environmental obstacles. To achieve this, detection of what the participant is looking at is performed first, followed by the classification of what row the object belongs to, appending this provided context within the CSV file (e.g., tennis ball row 2) upon completion.

When navigating the hurdles, a participant will encounter up to three sequential hurdles along each track/path, and it is important to understand which row the participants attention is on. For example, if it is known that the participant is looking at the immediate hurdle, it can be inferred they are paying attention to the hurdle and planning safe crossing (no contact). This assumption is based on typical gaze behavior observed in most individuals. However, we acknowledge that there may be exceptions, particularly among experienced participants or those familiar with the path. It can also be inferred that if the participant is not paying attention to the nearest hurdle before crossing, they are distracted. Here, across all scenes involving obstacle crossing, the same core objects are present and organized along the walking path into rows (Fig. 4), (i) a set of tennis balls (at each side of the walking track and used by the participant to judge horizontal opening size, defined as the horizontal distance between the two balls;), (ii) supports (are used to hold up the tennis balls and can also be used by the participant to judge opening size) and (iii) a hurdle (obstacle to be navigated by participant).

Given the consistent spatial relationship of these objects the vertical pixel coordinates can be used to begin to cluster these objects into their respective rows Algorithm 2. The algorithm first sorts the detected tennis ball objects based on their Y positions. Then, it iterates through the sorted list, calculating the distance between each consecutive pair of balls. If the distance is less than 50 pixels, the balls are considered to belong to the same row and are added to the current row array. If the distance exceeds 50 pixels, the current row is appended to the rows array, and a new current row is initiated to begin capturing the next set of balls within the same row.

Once the algorithm for sorting hurdles into clustered rows was established, the loop responsible for classifying the actual row of the objects was created. Algorithm 3 gives each detected object an associated row by looping over every detected object and determining the object’s midpoint. With this information deduced and each ball clustered into its respective row, the y point of the object can then be compared with the detected row lines. Whichever row is determined to have the least absolute difference is classified to be the row of the object.

Track segmentation model (TSM)

If participants are looking downwards at the track/path ahead of their immediate foot placement, this can provide context i.e., thinking about foot placement. Determining the exact spatial location of the participant’s walking path is more difficult, because a more detailed classification is required compared to general object detection. To address this, a further segmentation model was developed and deployed to provide a pixel-wise segmentation mask for exact track location. This means that the exact location of the tracks themselves were detected not just a general bounding box. To develop this model, the same process for dataset collection was utilized as with the object detection tool. The videos were broken down into component frames to be used as images within the dataset. To create the segmentation masks (black and white images containing white pixels only where the regions we want the AI to detect are) the VGG image annotation tool [49] was used (https://www.robots.ox.ac.uk/~vgg/software/via/) Fig. 4.

Using VGG, the tool for creating segmentation masks to be used in AI models, a dataset of 388 frames and accompanying binary segmentation masks were created. This dataset was then used to create a U-Net based segmentation network (Fig. 5) with PyTorch. This model was then trained within the same Python 3.8 environment using a Ryzen 3700x, 24 GB of RAM and an RTX 3070ti based machine over a course of 100 epochs. After gaining a binary (white/black) segmentation mask of track location, detection of overlap between the eye location and track mask (black and white segmentation masks where only the location of the tracks are white) can be identified, Algorithm 4.

TSM: left/right object direction

With a methodology in place to assess an object’s row, a methodology for detecting which track a set of objects belong to is required (left or right). Understanding which side of the tracks an object belongs to is an important classification to assessing whether or not the participant is distracted (like paying attention to your driving lane vs oncoming traffic on a two-lane road). The track being actively navigated will always be on the right from the participants perspective, meaning any attention paid to objects on the right track will be relevant to navigation planning either immediately or in the near future. Conversely, attention paid to obstacles on the left track indicates a distraction, as when they are visible they will be beyond the immediate area of the participant. A further algorithm (Algorithm 5) can be implemented to attain what side an object is on relative to the participant by inferring the mid-point between the different segmented track points.