Machine Learning Algorithms to Predict Disease from symptoms.
Contents. Problem Definition. Methodology. Dataset.
Problem Definition. Medical data mining holds a lot of promise for uncovering hidden patterns in large data sets in the medical field. Clinical diagnosis can be made using these patterns. The accessible raw medical data, on the other hand, is widely dispersed, varied, and large. These details must be gathered in a systematic manner. This information can then be combined to establish a hospital information system. Data mining is a user-friendly way of uncovering new and hidden patterns in data. In medicine, disease diagnosis is a crucial and difficult task. Medical diagnosis is seen as a critical but difficult process that must be completed correctly and efficiently. This system's automation would be tremendously beneficial..
Methodology. The main methodology used for this research was through few types of algorithms that are optimal to make a decision in choosing correct answer, these methods are based on decision making techniques which are as follows:.
Naive Bayes Naive. Naive Bayes algorithm (NB) is Bayesian graphical model that has nodes corresponding to each of the columns or features. It is called naive because, it ignores prior distribution of parameters and assume independence of all features and all rows. Ignoring prior has both an advantage and disadvantage..
Decision Tree. Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation. For instance, in the example below, decision trees learn from data to approximate a sine curve with a set of if-then-else decision rules. The deeper the tree, the more complex the decision rules and the fitter the model..
Random Forest. Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.
Gradient Boosting. The main idea behind this algorithm is to build models sequentially and these subsequent models try to reduce the errors of the previous model. But how do we do that? How do we reduce the error? This is done by building a new model on the errors or residuals of the previous model..
Dataset. Disease Prediction Using Machine Learning Dataset, it is available on kaggle , Complete Dataset consists of 2 CSV files. One of them is training and the other is for testing model. Each CSV file has 133 columns. 132 of these columns are symptoms that a person experiences and last column is the prognosis. These symptoms are mapped to 41 diseases to classify these set of symptoms to..
System Design and Implementation. This study was implemented using the Python language, where the SKLearn library was used to build machine learning algorithms, the Pandas library was used to read the data and load it into the algorithms. The Label Encoder function is used to convert class names into numbers so that we can pass them to algorithms. The Matplotlib and Seaborn libraries were used to plot and display the confusion matrix in order to be able to evaluate the performance of the algorithms. The study design and implementation process can be summarized as shown in the figure 5..
Thanks !. Lorem ipsum dolor sit amet , consectetur adipiscing elit . Etiam aliquet eu mi quis lacinia.