1 Introduction

Ambient assisted living (AAL) is an active research area that has attracted a lot of interest in recent years through the development of various solutions to enable independent living and promote quality of life and well-being for an ageing human populace (Blackman et al. 2016). AAL solutions utilise assistive robots and other technologies to aid in daily routine activities. These robots are incorporated in various applications which involve human–computer interaction that traverse humans of all ages. Such applications include care for older adults (Xiao et al. 2014; Jayawardena et al. 2016).

However, due to the dynamic nature of the environment in real-world applications, it is quite challenging to have assistive robots execute functions easily. A specific case is assistive robots that can interact with older adults as carers. These robots learn tasks by observing a human carer execute the tasks. Such robots learn human activities by extracting descriptive information of the activities in order to classify them as they are executed. This process involves a transfer of knowledge/information of the activity performed which is referred to as transfer learning (Weiss et al. 2016).

Fig. 1
figure 1

A conceptual overview of learning of human activity by an assistive robot using information from an RGB-depth sensor

Regardless of the method applied to learning an activity by a robot, there is a knowledge gap contained in the varied information acquired of a person executing an activity and a robot carrying out a similar activity. Transfer learning helps to bridge this gap by providing faster learning of activities and better collaboration of assistive robots in AAL environments (Helwa and Schoellig 2017). A conceptual overview of the processes involved in learning of human activities for assistive robotics is given in Fig. 1. It is evident in this context that the ability to correctly recognise a human activity and correctly learn (as highlighted in steps 1–3 of Fig. 1) such activity plays a significant role in the amount of knowledge which can be transferred to an assistive robot to be used in learning.

To obtain information of human activities as they are executed, recent research has made use of visual sensors (e.g. RGB-D sensors) (Sung et al. 2011, 2012; Han et al. 2017) and non-visual sensors (e.g. wearable sensors) (Capela et al. 2015) which make it a lot easier to obtain information of activities. Although non-visual sensors have certain advantages, they are sometimes invasive and burdensome. The development of visual sensors like RGB-D sensors provides a better means to detect human pose used to build human activity recognition systems (Han et al. 2017). These sensors provide platforms for identifying body shape, depth maps and detecting skeleton of human joints in 3D space which can be exploited in learning activities.

The aim of this paper is to propose a human activity learning (HAL) system for assistive robotics. This will act as part of the process of transfer learning for assistive robots. The research presented in this paper is an extension of the system proposed earlier by Adama et al. (2018). The focus is on the three steps shown in Fig. 1. An RGB-D sensor is used to obtain 3D skeleton information of body joints during activities as they are executed by a human. Descriptive features are then extracted from the skeleton information obtained, and the most informative features are selected to be used in training a classifier model. These features are extremely valuable in evaluating the performance of the system because redundant and noisy features can have negative effect on the system performance. An ensemble of classifiers model is used in building the learning model for activities. The approach presented here employs three classifiers—multiclass support vector machines (MSVMs), K-nearest neighbour (K-NN) and random forest (RF)—in creating the ensemble model. These classifiers are classical algorithms used in machine learning problems. The proposed approach is not only focused on using the selected algorithms but a combination of them in an ensemble. The reason for using an ensemble of classifiers is to improve performance compared with a single classifier model (Tahir et al. 2012). The results discussed in subsequent sections show the improved performance.

The remaining sections of the paper are structured as follows. Section 2 presents a review of relevant related work in this area with emphasis on the main contributions. In Sect. 3, details of the methods applied in 3D data processing and feature representation are explained. Section 4 explains the classifier ensemble model approach for human activity learning. Section 5 presents experimental results and their evaluation, and Sect. 6 summarises the main results and provides discussion of the future work.

2 Related work

Learning and classification of human activities using computational intelligence/soft computing techniques is often referred to as human activity recognition (HAR) (Iglesias et al. 2010; Jalal and Kamal 2014). One of the main objectives is to extract descriptive information (i.e. features) from human activities to be able to distinctly characterise and classify one activity from another. An integral component of learning an activity is how information of the activity is obtained or observed. For human activities, information obtained using visual and non-visual sensors makes it a lot easier to understand and learn activities as they are performed. Visual sensors such as RGB cameras can be used to obtain descriptive information of an activity in 2D. However, this information is limited in effectively characterising an activity (Han et al. 2017). Additional depth information using RGB-D sensors provides several advantages as they are better suited to observing human activities and detecting human poses used to build activity recognition systems.

To effectively characterise activities from information obtained using RGB-D sensors, soft computing techniques such as machine learning and reasoning methods have been applied by many researchers (Koppula et al. 2013; Li et al. 2015; Han et al. 2017). These methods provide an understanding of how activities are learned and relationships between activities. However, there is some uncertainty regarding how one actor performing an activity would differ from another actor performing similar activity. This hinders HAR systems from going mainstream.

Data obtained from RGB-D sensors give information relevant for a robot to understand an activity. By exploring human pose detection using RGB-D sensors, activity recognition has advanced recently (Sung et al. 2011; Faria et al. 2014). Using RGB-D sensors extracts 3D skeleton data from depth images and body silhouette for feature generation. In Faria et al. (2014), the RGB-D sensor is used to generate a human 3D skeleton model with matching of body parts linked by its joints. They extract positions of individual joints from the skeleton in a 3D form x, y, z. The authors in Jalal and Kamal (2014) use similar RGB-D sensor to obtain depth silhouettes of human activities from which body points information are extracted for the activity recognition system. Zhou et al. (2018) also used an RGB-D sensor to capture human skeleton information as part of a system for controlling a mobile robot using human gestures which is also a similar application proposed by Chao et al. (2017). Another approach is shown in the work in Gu et al. (2012) where the RGB-D sensor is used to obtain orientation-based human representation of each joint to the human centroid in 3D space. Raw data obtained from these sensors have to be pre-processed. This process is carried out to reduce redundancy in data for better representation of features of an activity.

Classification of human activities is carried out by extracting relevant features from data obtained using RGB-D sensors. In our previous work, a method for activity recognition using RGB-D data was proposed (Adama et al. 2018). The 3D joint position information extracted from the sensor is transformed into feature vectors by applying selected soft computing techniques to group key postures of an activity. The posture features are used as input to a learning algorithm for classification of human activities. SVM and KNN algorithms were used separately in classifying activities and the results compared. In the work by Faria et al. (2014), the authors proposed a combination of multiple classifiers to form a Dynamic Bayesian Mixture Model (DBMM) to characterise activities using features obtained from distances between different parts of the body. Hussein et al. (2013) applied statistical covariance of 3D joints (Cov3DJ) as features to encode the skeleton data of joint positions which are then used as input to an SVM model for activity recognition. Another approach applied by Wei et al. (2013) used a sequence of joint trajectories and applied wavelets to encode each temporal sequence of joints into features used in activity classification. Deep learning neural networks (Ijjina and Chalavadi 2017) have also more recently been applied in activity recognition problems with results showing robustness of the method in activity recognition. However, deep learning neural network systems require large amount of data to achieve for concise predictions of activities and in most cases more resources such as time and reliable processing architectures.

Fig. 2
figure 2

Architecture of proposed human activity learning model. Stage 1: Model Learning (top): learning human activities by training a set of classifiers (SVM, KNN and RF) from 3D skeleton features obtained from activity frames captured using an RGB-D sensor. Stage 2: Activity Classification (bottom): observations from human activity are used to extract/ select relevant features which are fed into the trained classifier models, and activities performed are detected

3 Methodology for human activity data processing and feature representation

The proposed approach to HAL described in this paper works by extracting features from 3D skeletal data and applying feature selection techniques for selecting the most informative features used in building a learning model for human activities. The overview of the system architecture shown in Fig. 2 illustrates the main stages within the process. This is divided into two stages as follows:

Stage 1 :

Model learning

  • Data input into the system from a dataset containing 3D skeleton information of human joints. These data are captured using an RGB-D sensor and pre-processed before it is used in training activity classifier ensemble model.

  • Features representing activities are computed from the data. This step also includes the selection of optimal features relevant for learning activities.

  • Training selected classifier models through supervised learning of activities. The output of this step is the learned classifier ensemble model ready to be utilised in activity classification.

Stage 2 :

Activity classification

  • Data input in this stage is similar to that described in the model learning stage. However, this has to be unseen data in order to validate the performance of the learned models. The data can be obtained from a dataset or on-the-fly from an RGB-D sensor.

  • Similar features are extracted from the data to be classified. This stage differs from the model learning stage in that unlabelled activity data is used, while the model learning stage is based on labelled activity data. The features extracted from unlabelled activity data are passed into the learned classifier ensemble model for identification of activity classes.

Fig. 3
figure 3

Skeleton representation of Microsoft Kinect V2 with 25 joints. 15 key joints are used in this work as shown in the label definition in the figure

3.1 3D activity data pre-processing

Human activity is composed of a continuous transformation of a series of human poses. Pre-processing the information is necessary to reduce irregularities in the data obtained from the sensor. RGB-D sensors provide information in three modes, namely RGB image, depth image and skeleton joint coordinates. However, this work uses only the skeleton joint coordinates information. A Microsoft Kinect V2 (Microsoft 2017) RGB-D sensor which has a skeleton model consisting of 25 joints as shown in Fig. 3 is used in this work. From the information obtained from the kinect sensor, 15 key joints as outlined in Fig. 3 are selected for use. Data are acquired from the sensor as frames containing different poses that make up an activity. 3D skeleton joint coordinates J are obtained from pose approximation in each frame (Yang and Tian 2014) with coordinates relative to the sensor position where,

$$\begin{aligned} J = [j_1, j_2, j_3, \ldots , j_i], \quad \text{ for } J \in \mathbb {R}^{3\times d} \end{aligned}$$
(1)

\(j_i\) represents the ith joint with coordinates x, y, z corresponding to horizontal, vertical and depth positions, respectively, and d is the total number of skeleton joints used.

To make the joint coordinates invariant of the sensor position, the origin of the skeleton is translated along the vector \(\overrightarrow{s_oj_t}\), where \(s_o\) is the sensor coordinates origin and \(j_t\) represents the torso centroid joint of the skeleton. Each joint coordinate position \(\overrightarrow{j_i}\) (\(j_i\) is a vector representing the ith joint coordinates of the skeleton) is computed with reference to the new origin of torso centroid \(\overrightarrow{j_i}\) - \(\overrightarrow{j_t}\). Thus, the skeleton is independent of the sensor position as shown in Fig. 4. Each sample posture of activity is then reformulated to the torso centroid origin.

Fig. 4
figure 4

Translation of skeleton coordinate system from the sensor origin to the torso centroid origin

Another stage of pre-processing is done to symmetrise the data in order to eliminate ambiguity in gestures performed by left- and right-handed people. This ensures each activity is represented in a variation of its original form as shown in Fig. 5. The symmetry is computed along the y-axis of the origin (torso centroid).

3.2 Extraction and representation of 3D features

Extraction of descriptive information from acquired raw sensor information is crucial to any learning system as raw data do not provide adequate information for learning. This is carried out after the data are pre-processed. In this work, the features used are divided into two distinct categories: joint displacement-based features and statistical features in the time domain. Joint displacement-based features encode information relative to position and motion of body joints (Yang and Tian 2014; Han et al. 2017). This information considers displacement between joints of an activity pose and 3D position differences of skeleton joints across different time periods of an activity. Similarly, statistical time domain features encode information of variations across a collection of activity poses within a specified time domain. The following sections provide details of the features used in this work.

3.2.1 Displacement-based features

  1. 1.

    Spatial displacement between selected joint skeletal joint coordinates is computed as the Euclidean distance \(\delta \) between any two joints described in Eq. 2. The joints are selected based on relevance to activities.

    $$\begin{aligned} \delta _{(j_{m},j_{n})} = \sqrt{\sum _{x,y,z} (j_{m} - j_{n})^2}, \end{aligned}$$
    (2)

    for \(1\le \) (m,n) \(\le i\) and m \(\ne \) n. \(j_{m}\) and \(j_{n}\) are any pair of selected joints with coordinates x, y, z.

  2. 2.

    Temporal joint displacement features consider 3D consecutive motion of joints \(t_{cp}\) and overall motion dynamic of joints \(t_{ci}\). \(t_{cp}\) is computed as the joint coordinates position difference between the current pose c and its preceding pose p in Eq. 3 and \(t_{ci}\) as the temporal difference between the each joint current pose from the initial pose i in Eq. 4.

    $$\begin{aligned} t_{cp}= & {} [j_m^c - j_n^p]; \quad \text{ for } j_m^c \in J^c \text{ and } j_n^p \in J^p \end{aligned}$$
    (3)
    $$\begin{aligned} t_{ci}= & {} [j_m^c - j_n^i]; \quad \text{ for } j_m^c \in J^c \text{ and } j_n^i \in J^i \end{aligned}$$
    (4)
Fig. 5
figure 5

Skeleton symmetrisation of an activity posture about the y-axis. a represents the original activity posture, and b is the symmetry obtained of same posture

3.2.2 Statistical features in time domain

This is computed as the projected difference of joint coordinates \(j_i\) of the current pose c (also referred to as the current activity frame) from the mean, variance, standard deviation, skewness and kurtosis of joints coordinates for an activity sequence. These are computed as follows:

  1. 1.

    Joint coordinate-mean difference;

    $$\begin{aligned} j_{(i,\mathrm{mean})} = j_i - j_{\mathrm{mean}} \end{aligned}$$
    (5)

    where the mean of all positions for a joint coordinate is \(j_{\mathrm{mean}}\) = \(\frac{1}{N} \sum _{c=1}^N j_i\) and N is the sum of poses in an activity.

  2. 2.

    Joint coordinate-variance difference;

    $$\begin{aligned} j_{(i,\mathrm{var})} = j_i - \frac{\sum _{c=1}^N (j_i - j_{\mathrm{mean}})^2}{N} \end{aligned}$$
    (6)
  3. 3.

    Joint coordinate-standard deviation difference;

    $$\begin{aligned} j_{(i,\mathrm{std})} = j_i - \sqrt{\frac{\sum _{c=1}^N (j_i - j_{\mathrm{mean}})^2}{N}} \end{aligned}$$
    (7)
  4. 4.

    Joint coordinate-skewness difference;

    $$\begin{aligned} j_{(i,\mathrm{skw})} = j_i - \frac{\sum _{c=1}^N (j_i - j_{\mathrm{mean}})^3}{(N - 1){\sigma }^3} \end{aligned}$$
    (8)

    where \(\sigma \) refers to the standard deviation of each joint coordinate for all poses in an activity.

  5. 5.

    Joint coordinate-kurtosis difference;

    $$\begin{aligned} j_{(i,\mathrm{kur})} = j_i - \frac{\sum _{c=1}^N (j_i - j_{\mathrm{mean}})^4}{(N - 1){\sigma }^4} \end{aligned}$$
    (9)
Fig. 6
figure 6

Overview of weighted voting architecture of classifier ensemble

All activity feature vectors computed are concatenated to form a matrix A of extracted activity features in which the columns correspond to feature vectors and the rows correspond to features extracted from different frames of activities. A is represented by the following;

$$\begin{aligned} A = [\delta ,\quad t_{cp},\quad t_{ci,}\quad \ldots ,\quad j_{(i,\mathrm{kur})}] \end{aligned}$$
(10)

3.3 Feature normalisation

HAL systems can be problematic if the extracted features are not well processed. This is due to heterogeneity in features. A further pre-processing of extracted features is needed to deal with the issue of features heterogeneity before classification. This is done through feature normalisation which is often applied in many machine learning applications (Sung et al. 2012; Capela et al. 2015). Normalisation of each feature in the activity features matrix obtained in Eq. 10 is done according to:

$$\begin{aligned} a_{\mathrm{norm}} = \frac{a_{cf} - \hbox {min}(A_f)}{\hbox {max}(A_f) - \hbox {min}(A_f)} \end{aligned}$$
(11)

where \(a_{cf}\) is a feature on the current pose c of the fth column feature vector. The obtained feature matrix after normalisation becomes \(A_{\mathrm{norm}}\).

3.4 Feature selection

Feature selection is performed on the normalised activity features matrix. This is important to any learning model as it enables faster training, reduces overfitting, improves accuracy and reduces model complexity (making it easier to interpret Gupta and Dallas 2014; Capela et al. 2015). In this paper, a filter method—Relief-F (Kononenko 1994) of feature selection—is applied. Filter methods are preferred to other methods such as wrapper methods since they do not require a fixed learning mechanism and therefore have more generalisation across different learning models (Gupta and Dallas 2014).

The Relief-F method uses a statistical approach rather than heuristic to provide relevance weights to rank potential features. The features ranked above a set threshold are selected for the model. In this paper, the threshold is determined from the number of features that provide the best substitution accuracy with the learning model. The performance achieved using the selected features is presented in the experimental results in Sect. 5.

4 Classifier ensemble model

The final stage in developing an activity learning system is training a classification model with the selected features to achieve a good learning performance score. Building on previous work by Adama et al. (2018) in which a selection of learning models was used separately to identify activities, this work employs a combination of different learning models in a framework referred to as a bagging ensemble of classifiers in order to achieve an improved performance of the system. The use of an ensemble of classifiers model generally allows for better predictive performance than the performance achievable with a single model (Diao et al. 2014; Yao et al. 2016). According to Tahir et al. (2012), ensemble models are learning models that construct a set of classifiers used in classifying new information based on a weighted vote of individual classifier predictions. Three base classifiers are used in this work to construct a bagging ensemble of classifiers: multiclass support vector machines (MSVMs), K-nearest neighbour and random forest classifiers. The pictorial overview of the bagging ensemble method applied is shown in Fig. 6.

The weighted votes work by computing the weighted majority vote \(\hat{q}\) given in Eq. 12 through allocation of weights \(\omega _r\) to each classifier \(C_r\).

$$\begin{aligned} \hat{q} = \arg \max _{i} \sum _{r=1}^3 \omega _r \times (C_r(s) = i), \end{aligned}$$
(12)

where \(C_r(s)\) is a classifier characteristic function in a set of unique classifier labels.

The weights assigned to individual classifiers in the ensemble are computed during the learning phase by weighted votes. At the initial stage, uniform weights are set and updated at each iteration of cross-validation. The updated classifier weights used in succeeding iterations are computed as ratios of the average precision obtained in the preceding iteration of each classifier in the ensemble.

The multiclass SVM model follows the configuration reported in Cippitelli et al. (2016) and Adama et al. (2018) which is an extension of a binary classifier. A one-against-one approach based on the construction of several binary SVM classifiers suitable for M classes contained in a dataset—where \(M>2\)—is implemented as one of the base classifiers. The K-NN classifier algorithm is one of the simplest machine learning algorithms used in classifying observations based on the closest training points in the feature space. An instance of observation is assigned to a class most common among its k nearest neighbours by a majority of votes of its neighbours, where \(k>0\). Euclidean distance is used in most cases as a metric in finding nearest neighbours. In the proposed HAL model, a value of \(k=5\) nearest neighbour is used in the configuration. Random forest classifier consists of an ensemble of decision trees where each decision tree is trained from randomly selected samples of an original training set. In this work, RF is used with 10 decision trees. The configuration used is similar to Nunes et al. (2017) implementation of RF.

5 Experiments and evaluation

To evaluate the performance of the proposed HAL system, data collected from our experimental setup are used. This is used in order to verify the proposed system via a limited test we performed before it is tested on public datasets. Afterwards, the system is also evaluated using publicly available benchmark human activity dataset, Cornell Activity Dataset (CAD-60) (Sung et al. 2011). The following sections describe the experiments conducted in this work and discussion of the results obtained.

5.1 Experimental setup

Skeletal data are collected from three actors using a Microsoft Kinect V2 RGB-D sensor as mentioned previously in Sect. 3.1. The data are obtained at a frame rate of 30 frames per second (fps). Four activities are carried out, namely brushing teeth, pick up object (from the ground), sit on sofa and stand up. Each actor performs a single activity for a duration of 45–90 s. Sitting on sofa activity is performed by an actor going through a sequence of sitting, and getting up poses with more time spent in sitting, and standing activity is performed in a similar way with more time spent staying standing. The summary of the data collected is presented in Table 1.

Table 1 Summary of experimental human activity data collected from 3 actors using Microsoft kinect V2 RGB-D sensor

The data acquired are pre-processed following the process earlier mentioned in Sect. 3.1. Key features representing activities are extracted from the processed data. Table 2 shows the number of activity features computed from the RGB-D sensor skeleton with 15 joints. The number of joints used in computing spatial displacement features is selected based on the importance of the joints while carrying out the selected activities. Nine features are computed which represent the Euclidean distance between both left and right hands, each hand and head, each hand and its corresponding foot, each shoulder and corresponding foot, each hip and corresponding foot. The other features are obtained for each joint coordinate—given that 15 joints are used, each feature description comprises \(15\times 3\) = 45 features extracted.

Features selected from the experimental dataset are fed into the learning model to test the performance of the system. A K-fold cross-validation test strategy is applied with \(K=4\). This involves splitting the data into fourfold in which threefold is used as training data for the model and the remaining fold is left out for validation. This process is repeated using each fold for validation, and the final result is the average performance of all test validation folds.

Table 2 Activity features computed from raw RGB-D sensor information of skeleton with 15 joints used in this work

5.2 CAD-60 dataset and experiment

The CAD-60 dataset comprises RGB-D sequence of human activities acquired using an RGB-D sensor at a frame rate of 15 fps. The dataset contains RGB image, depth image and skeleton joint coordinates information of 15 skeletal joints of activities carried out. However, the proposed HAL system utilises only the skeleton joint coordinates information. Four different actors perform 12 activities in five different locations, namely bathroom, bedroom, kitchen, living room and office. The activities performed are: rinsing mouth, brushing teeth, wearing contact lens, talking on the phone, drinking water, opening pill container, cooking (chopping), cooking (stirring), talking on couch, relaxing on couch, writing on whiteboard, working on computer and a random \(+\) still activity. The random \(+\) still contains random movements sequence and a still pose performed by each actor. The stages described in the proposed HAL system are applied, with the CAD-60 dataset as raw input to the system. The same number of features as shown in Table 2 is computed from the dataset.

Learning the activities is done as a grouping of activities in the various locations. This grouping shown in Table 4 follows the format used by all approaches reported in the state of the art in Table 6. For testing the trained model, a method of leave-one-out cross-validation is carried out in which the model is trained on three actors and tested on the unseen actor. This is also called a new person test strategy.

5.3 Evaluation and discussion

The proposed HAL system is evaluated on both datasets mentioned in Sects. 5.1 and 5.2 following the test methods described. The CAD-60 dataset tests are performed following similar test methods described by Sung et al. (2011) and other approaches by the authors in Table 6. Test results and discussions are presented in the following sections.

Table 3 Performance of the proposed HAL system on experimental dataset comprising four activities: brushing teeth, pick up object, sit on sofa, stand up

5.3.1 Experimental dataset results and evaluation

Table 3 shows the results obtained from the performance of the proposed HAL system on the experimental dataset. These are presented in terms of precision and recall. The system achieves an overall average precision of \(70.65\%\) and recall of \(68.43\%\) with the dataset. In Fig. 7, the confusion matrix shows the percentage of correctly classified activities along with the percentage of false classified activities. It can be noticed that the performance in activities of pick up object with recall of \(94.69\%\) and sit on sofa with recall of \(100\%\) are quite impressive. However, the model did not perform as impressively in correctly classifying brushing teeth and stand-up activities. This is due to the fact that both activities have closely related poses as brushing teeth is performed while in a stand-up pose. This gives rise to more stand-up data—i.e. \(64.87\%\)—characterised as brushing teeth which affects the overall performance achieved. In order to adequately test the robustness of a supervised learning system, the availability of more data samples is required for proper training and validation of learning models. However, the experimental dataset collected contains fewer data samples when compared with other human activity datasets such as the CAD-60 dataset. This can also be a reason for the performance achieved on the experimental dataset. Therefore, we also tested the HAL system with the CAD-60 dataset which contains more samples of human activity.

Fig. 7
figure 7

Confusion matrix of the proposed HAL system on experimental data

Table 4 Performance of the proposed HAL system with selected features on the CAD-60 dataset using a “new person” test in different locations: bathroom, bedroom, kitchen, living room and office
Table 5 Performance of the proposed HAL system with all features extracted from the CAD-60 dataset using a “new person” test. This shows the average performance from different locations

5.3.2 CAD-60 dataset results and evaluation

The results obtained from the performance of the proposed HAL system on the dataset are shown in Table 4. This is presented in terms of precision and recall of the HAL system. The proposed system achieved an overall average performance of \(92.32\%\) precision and \(89.66\%\) recall with features selected using the Relief-F feature selection method described in Sect. 3.4 and a performance \(90.96\%\) precision and \(88.52\%\) recall when all the features extracted are used. In Table 5, the result from different locations is shown. When compared with Table 4, the system achieved a better performance with selected features than with all the features as reported in Table 5. Table 6 shows the proposed system performance compared to the state of the art performances on the same dataset (Cornell University 2009). The table also shows information of the state-of-the-art works which employ extended modality of RGB-D sensor information which is a combination of skeletal joint coordinates information with either of RGB image and depth image sensor information modes. The proposed HAL system’s performance indicates the features extracted in our system sufficiently discriminate the selected human activities from skeletal joints information.

Comparison of the proposed HAL system’s performance with the state of the art on the CAD-60 dataset presented in Fig. 8 shows the proposed system is able to attain an impressive performance. While some other proposed systems performance outperforms the HAL systems performance, the proposed HAL system differs from the other better performances in the following ways. The system proposed by Zhu et al. (2014) reported a performance of 93.2% precision and 84.6% recall. Although their precision exceeds that of the proposed HAL system, our system performs better in terms of recall. Also, the system by Zhu et al. (2014) uses a fusion of spatiotemporal interest point features obtained from combination of RGB-D sensor information modalities, i.e. depth image, RGB image and skeleton information as indicated in Table 6. This process can increase computational cost. The proposed HAL system utilises only the skeleton information offered by the RGB-D sensor to achieve such high performance. This shows that by adding more information for computer vision processing our system has the potential to achieve a higher performance.

Table 6 Overall average precision and recall of the proposed HAL system with the state of the art on the CAD-60 dataset in a “new person” setting in order of increasing precision reported by Cornell University (2009)
Fig. 8
figure 8

Precision and recall performance comparison of proposed HAL system with the state-of-the-art results on the CAD-60 dataset

Table 7 Proposed classifier ensemble method performance comparison with single classifier performance on CAD-60 dataset

However, the performance attained by Shan and Akella (2014) slightly out performs our proposed HAL system which is observed from the comparison of state-of-the-art results in Table 6. This approach performed tests excluding the random \(+\) still activity performed by all actors in the dataset which is included in the tests performed using the proposed HAL system. This information is relevant in generalising the robustness of the system across varying human activities.

The system proposed by Cippitelli et al. (2016) on the CAD-60 dataset attained a higher performance of both precision and recall of 93.9 and \(93.5\%\), respectively. Their system is tested with the dataset in a similar way observed in the system by Shan and Akella (2014) which excludes test on the random \(+\) still activity. Another reason could also be due to the fact that the proposed HAL uses all 15 skeleton joints of the CAD-60 dataset, whereas Cippitelli et al. (2016) used 11 selected skeleton joints to achieve the high performance. The selected joints do not include relevant joints such as the shoulders which are needed for our proposed application in assistive robots effectively executing human activities via transfer learning. However, the proposed HAL system with 15 skeletal joints achieves higher performance when compared with Cippitelli et al. (2016)’s performance with 15 skeletal joints of \(87.9\%\) precision and \(86.7\%\) recall using all 15 skeleton joints.

With the performance achieved using the proposed HAL system with both experimental and publicly tested CAD-60 datasets, this shows the systems potential in applications of assistive robots learning of human activities.

5.3.3 Comparison of classifier ensemble with single classifier performance

The method of using a classifier ensemble as proposed in this work shows the increase in activity learning accuracy when compared with other proposed methods which use single classifiers. Table 7 shows the performance of the proposed classifier ensemble method with other methods which apply single classifiers in learning human activities. Also, it can be noticed that majority of the other approaches apply SVM in recognising human activities which is also used in the proposed classifier ensemble method and results show the classifier ensemble outperforms the other single classifier methods. In addition, the proposed classifier ensemble approach proposed also has the benefit of attaining high activity learning performance with a small amount of training samples when compared to other widely used methods such as deep learning neural networks (Ijjina and Chalavadi 2017)—which require a lot of data and more time in training such networks for concise predictions.

6 Conclusion and future work

The work presented here proposes a system for human activity learning with the use of skeletal data obtained using an RGB-D sensor. We have shown explicitly the process of refining the raw sensor data obtained, computing relevant features and training the learning model. The main objective of this work is to have an activity learning system which is able to distinctly recognise activities as they are performed. The system can then be incorporated in an assistive robot to aid learning to perform such human activities. The performance attained by the proposed system on the CAD-60 benchmark dataset shows its reliability if used with an assistive robot.

Although we used a selection of three base classifiers in building the ensemble model, this could be extended to include more classifiers which may improve performance and also deep learning neural networks which are increasingly used in human activity recognition systems. The system could also be extended to learning activities on-the-fly as they are carried out by an actor. We plan to implement this in future. The direction of research following this work is to segment different aspects of each learned activity into representations that any assistive robot platform can adopt in reliably executing human activity.