A Systematic Review on Machine Learning and Deep Learning Based Predictive Models for Health Informatics

Health informatics (HI) has become a significant research area due to the massive generation of digital health and medical data by biomedical and health research organizations. The health data sources are available in different forms namely electronic health records (EHRs), biomedical imaging, bio-signals, sensor data, genomic data, medical history, social media data, and so on. The structured health data can be utilized for HI and effective predictive modeling of health data assists in the decision-making process. The recently developed artificial intelligence (AI), machine learning (ML), and deep learning (DL) techniques pave a way for effective predictive modeling on health data. Numerous existing works have been presented in the literature depending upon the ML and DL based HI for various applications. With this motivation, this study aims to review the recent state of art ML and DL based predictive models for health sector. This survey primarily identifies the difference between the ML and DL architectures with their significance in health sector. In addition, the existing works are extensively reviewed and compared in terms of different aspects such as objectives, underlying methodology, input source, dataset used, performance validation, metrics, Review Article Aloyuni; JPRI, 33(47B): 183-194, 2021; Article no.JPRI.76514 184 and so on. Finally, the open challenges and future scope of the HI are examined in detail. At the end of the survey, the readers find it useful to identify the present research and possible future scope of the ML and DL based predictive models for HI.


INTRODUCTION
Health informatics (HI) is determined as a systematic application of computer science, technology, and information in the area of public health, involving prevention, surveillance, health promotion, and preparedness. The primary application of HI is stimulating the health of the whole populations that would eventually stimulate the health of individuals and prevents injuries and diseases by altering the condition which increases the risks of population [1]. Generally, HI is employing informatics in public health data analysis, actions, and collection. Emphases on disease avoidance from the population, realize its objective with a wide range of intervention, and works within governmental setting is aspect which makes HI distinct differently from another field of informatics. The possibility of HI consists of design, conceptualization, deployment, development, maintenance, refinement, surveillance, information systems, and evaluation of communication applicable to public health. HI can be deliberated as the most suitable system in tackling epidemics, diseases surveillance, bioterrorism, and natural disasters. Fig. 1 illustrates the number of publications related to health.
In recent years, HI study has been increasing rapidly. Particularly, the health and biomedical research enterprise is making a huge number of medical data and digital health. The development in health sensors and mobile applications has driven the study of remote monitoring, health tracking systems, and telemedicine. Generally, the medical data source includes clinical text, electronic health records (EHRs), health sensor data, biomedical imaging and signals, spontaneous reporting system, genomic and pharmaceutical databases, social media data, biomedical and health literature [2]. The researcher is functioning through the boundary for producing improved medical solutions to health consumers and empowering patients to deal with their well beings and health. The EHR is the electronic form of patients' health history i.e., retained with the medical suppliers over time.
The EHR is proposed for providing an effective and efficient approach to edit and access the patient's record. Likewise, structured medical data is a better resource for HI in prediction modeling. But distinct medical suppliers might utilize distinct EHRs systems. Integrating and sharing data from distinct schemes could be difficult, and thus, the study might be constrained to the data accessible from certain The medical text comprises medical data in unstructured form, i.e., generally attained from the speech recognition /transcript of dictation. It is based highly on natural language processing methods for extracting useful knowledge. Biomedical signals and images give higher quality images of an anatomical structure hidden by the bones and skin and magnetic/electrical signals created using biological activity. Biomedical signal and imaging include phonocardiogram (PCG), computed tomography (CT), electrogastrogram (EGG), ultrasound (U/S), electromyogram (EMG), electroneurogram (ENG), electroencephalogram (EEG), electrocardiogram (ECG), positron emission tomography (PET), magnetic resonance imaging (MRI), etc. Fig. 2 shows the possible sources of health data.
Biomedical signals and imaging have generally deliberated parts of pathology. Databases are made to classify pathology from standard. Classification using expert is time consuming and labor intensive. Intelligent automated classification using signal processing technique is preferred for supplementing the limitation in automatic classifications. Besides the medical sensor data produced in the conventional medical environments like hospital wards, clinics, intensive care units, there is increasing attention in ubiquitous wearable sensor data i.e. interconnected to mobile devices for continuously tracking the condition of health consumers. This is a wide-open area to investigate how they could improve smart methods to harness a large number of medical sensors data in retrospective/real-time analytics for supporting precision and preventive treatment. Recently, AI, ML, and DL methods are predominating amongst each accessible AI method in health domains because of their implicit feature engineering ability, word embedding integration capability efficient solution, and capability of handling unstructured and complex data. In the meantime, the accessibility of unprecedented amounts of data associated with health like EHR, medical text on social networking platforms, text in EHR, and health image is very accountable for increasing the popularity of DL in the health domains. This paper performs a comprehensive survey on recently developed predictive models based on Machine learning (ML) and Deep learning (DL) approaches in health sector. The review process starts with a detailed discussion of the major differences between ML and DL models with their significance in health sector. Besides, a comprehensive survey of the available ML and DL models are investigated under varying dimensions namely objectives, underlying methodology, input source, dataset used, performance validation, metrics, and so on.
Lastly, the open challenges and future scope of the HI are discussed to avail new ideas to the readers.

Fig. 2. Possible sources of health data
The rest of the survey is arranged as follows. Section 2 discusses the existing survey papers related to HI and section 3 offers the ML vs DL models. Next, section 4 offers the existing ML and DL techniques available in the literature. Followed by, section 5 identifies the open challenges in the HI area. Finally, section 6 concludes the study..

PRELIMINARY LITERATURE REVIEW
This section reviewed the existing survey papers related to health using ML and DL models. Solares et al. [3] performed a detailed survey of major DL models which have been employed on EHRs. In addition, a huge Clinical Practice Research Datalink or CPRD dataset is introduced to train effective models. It also shares some proficient guidelines to assess the performance of the DL techniques to predict clinical risks and offer possible research possibilities in health domain. Esmaeilzadeh [4] examined the utilization of AI medical gadgets with the decision making tools from the aspect of consumer. An online survey is conducted and gathered data from 307 persons in the United States. This study identified the motivational source and stress for patients in the design of AI enabled gadgets. The outcome reported that the technical, ethical, and regulatory issues considerably related to the perceived issues of employed AI in medical field.
Miotto et al. [5] reviewed the recently presented DL models in the health sector. It is suggested that the DL models can be used to translate the massive amount of health data to improve the health of human beings. In addition, the drawbacks and need for enhanced models are also identified at the end of the study. Lamba et al. [6] surveyed ML and prediction models for HI. The present practices, major definitions, and research issues are investigated. Then, the presented works on active, semi-supervised, and transfer learning using DL models. Gupta and Katarya [7] presented a bibliometric investigation of the 1240 articles from reputed publications in the duration of 2010-2018. The articles are surveyed with respect to ML models employed to examine the health oriented text posted on social networking sites. Shamshirband et al. [8] investigated the use of DL models related to HI by surveying the recent technologies, application areas, and industrial trends. Salman et al. [9] offered a comprehensive review of ML models in the domain of electronic emergency triage (E-triage) and arrange patients for quick medical services in telemedicine application. It also highlighted the effective performance of the ML models in remote health telemedicine systems.

OVERVIEW OF MACHINE LEARNING VS DEEP LEARNING
ML is one of the common learning methods based on artificial intelligence (AI) that can learn from data with no requirement of human interference in computers. Mostly, the demands on using ML methods are to derive prediction model without taking into account the fundamental mechanism which isn't known or sometimes not fully described. An ML method could be performed by 4 phases like representation learning, data organization, data evaluation, and model fitting [10]. At first, emerging ML models require domain expertise and feature engineering to transform the raw data to a proper internal depiction form, where the learning subsystems such as classifiers are acquired, that can identify patterns in the dataset. Recently, the technique comprises linear conversion of the input data space as well as has restrictions in processing raw data as such. DL is a developing ML method that displays the variance with respect to representation learning from raw data. Conventionally, the training procedure of multilayer neural networks often results in local optimum problems or could not assurance convergences. Fig. 3 demonstrates the differences between ML and DL technique.
A DL model is proposed for overcoming this problem with 2-phase scheme like pre-training and fine-tuning to train the network efficiently. In recent times, the rise in computation power of computers and also data size developed DL methods more common. Once the big data is developed, DL model becomes widespread in offering solutions by analyzing and processing big data. For training largescale DL models, higher performances computing system is needed. Through the use of a GPU based DL architecture, the training time is decreased from several hours/ weeks to one day. Therefore, initially, DL models undergo unsupervised training, followed by a supervised training method is used to better fine-tuning and learn representations and features of big data for pattern and classification detection tasks [11]. Excluding health domains, DL method has shown significant results in speech recognition, computer vision, and natural language

Fig. 3. Difference between ML and DL
processing, and so on. Formerly, DNN wasn't more concentrated on because of the requirements of higher computation capability for processing and training, especially for some real-time applications. In recent times, advancements developed in hardware technologies in addition GPU acceleration, likelihood of parallelization, cloud computing, and multicore processing, and the limitation were overwhelmed, that enable DNN to become a common learning method on the basis of AI.

REVIEW OF EXISTING ML AND DL MODELS FOR HI
This section provides a comprehensive review of existing ML and DL models related to HI as shown in Table 1. Hsu et al. [12] present an efficient computer aided diagnoses scheme assisted by the intelligence learning model. A ML based feature modelling is presented for improving the prediction performances. A supervised learning algorithm is used for training and validating the optimum feature decreased with the presented model. Furthermore, this scheme act as miscellaneous tool to capture the patterns from various medical tests for many types of cancer diseases. Awotunde et al. [13] proposed architecture for IoT WBN based on an ML model. The information gathered from distinct wearable sensor nodes such as glucose sensors, body temperature, chest, and heartbeat sensor has been transferred by an IoT device to incorporated cloud databases. For selecting the best feature from the captured information, the ML has been employed, and the sensor signals are examined by an ML model for the diagnoses of patient information.
The presented architecture could be employed broadly in remote areas for monitoring and diagnosing a patient's health condition to eliminate, and reduce medical faults, minimize pressure on medical experts, reduce health costs, enhancing patient satisfaction, and increase productivity.
In Saha et al. [14] an automatic detection system called EMCNet has been developed for identifying COVID19 patients through calculating a chest X-ray image. A convolutional neural network (CNN) was proposed concentrating on the easiness of the models for extracting high and deep level features from X-ray images of patients diseased with COVID19. Using the extracted feature, binary ML classification (support vector machine (SVM), random forest (RF), AdaBoost, and decision tree (DT)) have been proposed for detecting COVID19. Lastly, the output of this classifier is integrated for developing an ensemble classifier, that ensures effective results for the datasets of different resolutions and sizes. Kumar et al. [15] elected 3 crucial diseases like diabetes, coronavirus, and heart disease. In the presented method, the data is entered into an android application, the analyses are later carried out in a realtime database with pretrained ML models that are trained on similar datasets and placed in firebase, and lastly, the disease detecting results are displayed in the android application. Logistic regression (LR) is employed to perform computations for the predictions.
Elhoseny et al. [16] proposed an automatic heart disease diagnosis (AHDD) scheme which incorporates a binary CNN with a novel MAFW model. The MAFW model includes 4 software agents which function a GA, SVM, and NB. The agent instructs the GA for performing a global search on HD feature and adjusts the weight of SVM and BN in the course of early classification.
A tuning process to CNN model is later executed for ensuring that an optimal group of features is involved in HD identification. The CNN includes 5 layers which classify patients as healthier or with HD based on the analyses of enhanced HD features. Ryzhikova et al. [17] described a novel technique for diagnosing AD on the basis of CSF through NIR Raman spectroscopy and ML analyses. Raman spectroscopy is able to probe the whole bio-chemical compositions of a biological fluid simultaneously. It has high possibility of detecting slight variations certain to AD, at the early phases of pathogeneses. NIR Raman spectra have been evaluated using CSF samples attained from sixteen HC subjects and twenty persons detected from AD. The ANN and SVM-DA statistical models are employed for distinction purposes, with an effective result allows to the distinction of AD and HC subject to 84% specificity and sensitivity.
Lamba et al. [18] proposed a speech signal based hybrid Parkinson disease diagnoses scheme for its earlier diagnoses. For this purpose, the researcher tried some combination of classification algorithms and feature selection (FS) approaches and developed the method using an optimal integration. To develop numerous integrations, 3 FS approaches like GA, mutual information gain, and extra tree as well as 3 classifications such as NB, KNN, and RF were employed. Rasheed et al. [19] investigated the possibility of ML approaches for automated diagnoses of COVID19 with higher performance from X-ray images. The 2 most generally employed classifications have been elected: LR and CNN. The primary objective is to create the scheme efficiently and fast. Furthermore, a reduction dimension method has also been examined according to the PCA model to additionally accelerate the learning method and enhance the classification performance by electing the high discriminative feature. The DL based method demands huge amounts of training samples than traditional methods, but still sufficient number of labelled training instances wasn't accessible for COVID19 X-ray image.
Hence, data augmentation system with GAN model has been used for additionally increasing the training sample and decrease the overfitting problems.
Li et al. [20] propose an accurate and efficient organization for the diagnoses of heart disease and the systems are depending on the ML methods. The method is proposed on the basis of classification algorithm consist of SVM, LR, ANN, KNN, NB, and DT when typical FS algorithm has been employed like Minimum redundancy maximum relevance, Relief, Local learning to remove redundant and irrelevant features as well as Least absolute shrinkage election operators. Also, they presented new faster conditional mutual data FS algorithms for solving FS problems. The FS algorithm is employed for increasing the classification performance and decrease the runtime of classification method. O'Connor et al. [21] demonstrates an effective DL approach for disease diagnosis and cell identification with spatio and temporal cell data record in a digital holographic microscopy scheme. Shearing digital holographic microscopy is applied with a lower cost, 3D-printed, compact, and field portable microscopy scheme for recording video rate data of live biological cells using nano-meter sensitivity with respect to axial membrane fluctuation, next feature is extracted from the recreated phase profile of segmented cell at alltime instances for the classification. The time differing data of all extracted features are inputted to recurrent Bi-LSTM networks that learn to categorize cells according to their time differing nature.
Kavitha et al. [22] identify and diagnosis the coronavirus family rapidly. The ResNet-100 CNN (RCNN) DL technique along with LR classifier is used to identify the coronavirus pandemic rapidly. The AI applications against COVID19 are medical imaging, cough samples, molecular scale from protein to drug development, Lung delineation, etc. Chai [23] employs knowledge graph technique for connecting scattered and trivial knowledge in many health data systems for assisting in diseases diagnoses. This study considers thyroid diseases as an instance, constructs a medicinal knowledge graph, and applies them to smart health diagnoses. Initially, extracts the relationship among biomedical entities for constructing a bio-medical knowledge graph. Next, the relationships and entities in the knowledge graph are converted to lower dimension constant vector via knowledge graph embedding model. Lastly, the recognized pathological disease relationship data is employed for training the disease diagnoses system of the BSTLM model.
Stephen et al. [24] presented a CNN technique trained from scratch to classify and detect the presence of pneumonia from a set of chest X-ray sample images. Different from another approach which is based only on TL methods or conventional hand-crafted methods for achieving an outstanding classification accuracy, they created a CNN method from scratch for extracting features from a provided chest X-ray images and categorize them to define whether a patient is diseased with pneumonia. This method can assist in mitigating the interpretability and reliability problems frequently confronted while handling health images. Chen et al. [25] proposed an intelligent IoT based application with ML algorithm for human brain hemorrhage diagnoses. Depending on the computerized tomography scan image for intracranial datasets, the SVM and FFNN were used for classification purposes. Total, classification outcomes of 86.7% & 80.67% are evaluated to the SVM and FFNN, correspondingly. It is determined from the resulting analyses that the FFNN outperform in categorizing intracranial image. Civit-Masot et al. [26] examines the efficacy of VGG16 based DL models for the detection of COVID19 and pneumonia with torso radiograph. The result shows a higher sensitivity in the detection of COVID19, nearly hundred percentage, and using a higher amount of specificity, indicate that it could be employed as a screening test. AUC on ROC curve is higher compared to 0.9 for each class taken into account. Recent studies on disease detection, classification and image segmentation based works are discussed by many authors using various deep learning techniques [27]. Venugopal et al.
[28] presents a unique multi-modal data fusion-based feature extraction technique with Deep Learning (DL) model, abbreviated as FFE-DL for Intracranial Haemorrhage Detection and Classification, also known as FFEDL-ICH. Vaiyapuri et al [27], presented an IoT enabled elderly fall detection model using optimal deep convolutional neural network (IMEFD-ODCNN) for smart homecare. Sundaram et al. [29], presented a novel deep transfer learning based framework for COVID19 detection and segmentation of infections from chest X-ray images. It was realized as a twostage cascaded framework with classifier and segmentation subnetwork models. Yacin et al. [30] tested the efficacy of classification model for skin lesion diagnosis by combining a GrabCut algorithm and Adaptive Neuro-Fuzzy classifier (ANFC) model.

OPEN ISSUES AND FUTURE SCOPE
This section discusses the open issues that exist in the health sector in the decision-making process. The medical text containing several unstructured data i.e., beneficial for making decisions.
The automatic procedure of transforming the unstructured data to structured data is costly and difficult process. It requires trained personnel. Even though automated relation extraction approaches, like ML based and rule-based approaches are accessible. However, they depend on extracting modifiers of medical entities. The ML based method works well for these purposes still they need difficult feature engineering tasks. DL resolves this problem however it is constrained because there is an absence of techniques which could efficiently capture and represent each syntactic and semantic feature from a complex and long sentence. For that reason, the relation extraction tasks are difficult for the researcher. In order to generate an effective usage of medical text, it is necessary for converting them to result from labels which have accurate data. But, automatic classification of medical text using key annotation is a difficult task since it contains narrative sentences, ambiguous words, concepts, sentence boundaries, and abbreviations. This absence of key annotations is the critical factor which prevents medical text from being employed in any similar fields of study.
The medical record contains many private and personal data about the patients. The procedure of removing and identifying this data is known as de-identification.
The conventional deidentification approaches like rule base, i.e., depending on rare data in a text learning and machine medical, i.e., depends on data, not in the dictionary and another method that depends on regular expressions, heuristics and dictionary lookups aren't appropriate for medical text which has irregular terms, large abbreviations and not tokenized word. In spite of current works on visualizing higher level features through the weight filter in a CNN model, the whole DL models aren't always interpretable. Subsequently, many authors utilize DL methods as a black box with no possibilities to describe why they provide better results or with no possibilities to employ modification in case of misclassification problems. For training effective and reliable models, huge set of training data is necessary for the expressions of novel concepts. Another main factor to consider while DL models are applied, for several applications the raw data could not be employed directly as input for the DNN model. Therefore, change/normalization, preprocessing of input domains are frequently required beforehand of the training. Furthermore, the set-up of various hyperparameters which control the framework of a DNNs, like number and size of filters in a CNN, or its depth, still it is a blind exploration procedure which typically needs precise validation. Detecting an accurate pre-processing of the data and the optimum set of hyperparameters could be difficult, as it makes the training procedure still further, require considerable human expertise and training resource, i.e., impossible for obtaining an efficient classification model. E.g., it is potential for adding smaller variations to the input examples (like imperceptible noise in the images) for causing sample to be misclassified. But it should be noted that most of the ML methods are vulnerable to this issue..

CONCLUSION
This paper has carried out a detailed review of the recently developed predictive models using ML and DL approaches in health sector. This study has begun with the elaborated discussion of the major differences between ML and DL models with their significance in health sector. Moreover, a detailed review of the existing ML and DL models are investigated under varying dimensions namely objectives, underlying methodology, input source, dataset used, performance validation, metrics, and so on.
Finally, the open challenges and future scope of the HI are discussed to avail new ideas to the readers. In future, concentration will be on the design of advanced DL architectures with hyperparameter optimizers to accomplish maximum prediction performance in the health sector.

DISCLAIMER
The products used for this research are commonly and predominantly use products in our area of research and country. There is absolutely no conflict of interest between the authors and producers of the products because we do not intend to use these products as an avenue for any litigation but for the advancement of knowledge. Also, the research was not funded by the producing company rather it was funded by personal efforts of the authors.

CONSENT
It is not applicable.

ETHICAL APPROVAL
It is not applicable.