COMPARISON OF SOME CLASSICAL AND META-HEURISTIC OPTIMIZATION TECHNIQUES IN THE ESTIMATION OF THE LOGIT MODEL PARAMETERS

Ozge Akkus 1 and Emre Demir 2 . 1. Department of Statistics, Faculty of Science,Mugla Sıtkı Kocman University, Turkey. 2. Department of Biostatistics, Faculity of Medicine, Hitit University, Turkey. ...................................................................................................................... Manuscript Info Abstract ......................... ........................................................................ Manuscript History

In the second part of the application, the performance of the studied optimization approaches is discussed over a real data set on Alopecia disease. The best model results are interpreted and the statistical significant factors influencing the disease are determined.

Literature review;-
Some of the researches in the current literature related to the GA and NM algorithms are presented below. Lee and Kim (2015) compared the performance of the genetic algorithm and particle swarm optimization approaches forinverse surface radiation problems.
In their study, Stylianouet al. (2015) discussed the effectiveness of neural networks, genetic algorithms, support vector systems and many other heuristic optimization techniques to investigate the mortality rate due to burns. Yuan and Lee (2015) compared the effectiveness of different methods for calculating a risk probability taking into account the financial indicators of 88 Taiwan companies in period of 2001-2010. Pfeifer et al. (2015) focused on delays in the completion of project. Genetic algorithm approach was utilized to estimate the maximum risk of project delays will move.
In the study made by Hadjiet al. (2015), the theoretical and experimental analyses of maximum power point tracking were presented over photovoltaic systems (PV) using genetic algorithm.
Aguilar-Rivera et al. (2015) discussed the use of many evolutionary computation methods such as genetic algorithms, genetic programming, multi-purpose evolutionary algorithm for the solution of the financial problems. It is concluded that as the interest of other methods varies depending of time, genetic algorithm maintains its popularity in each slice.
In his study on the bankruptcy probability of SMEs in Italy, Gordini (2014) compared the results obtained from the genetic algorithm, logistic regression and support vector machines. It is observed that genetic algorithm gives significantly successful results. Johnson et al. (2013) used logistic regression and genetic algorithm in the diagnosis of Alzheimer disease. Meng and Weng (2011) used the genetic algorithm approach and logistic regression analysis to assess the risk of the working area. The results demonstrated that genetic algorithm approach performs better than binary logistic regression model.
In the study of Babaoğluet al. (2010), the efficiency of particle swarm optimization and genetic algorithm is compared over a data set on coronary artery disease.
Kohet al. (2008) used genetic algorithm approach in constructing a probability scoring system for assessing the adverse drug reaction. That the new founded system provided excellent signals at %83 ratio is observed. Liu and Ong (2008) used genetic algorithm approach to determine the most effective variables and number of clusters in order to get successful results from the cluster analysis for the marketing segmentation. Considering the number of 1028 clusters and variables determined, it is observed that the genetic algorithm is extremely successful in finding the global optimum points.
In the study made by Hadi and Gonzalez-Andujar (2009), seedling time was estimated as a function of thermal time using both nonlinear regression and genetic algorithm. Given the flexibility and other advantages of assumptions conclude that genetic algorithm is a more effective method.

Methodology:-Binary Logit Model (BLM):-
The purpose of the BLM is to establish an acceptable model which can identify the relationship between the dependent variable and explanatory variables using minimum number of variables in order to achieve the best model fit.BLM model is easy to apply and interpret thus it is commonly used in many areas, especially in social sciences, biology, medicine, economy, agriculture and veterinary sciences.
In BLM, dependent variable Y has two categories and coded as "0" and "1". The probability expression of the model that uses the "logit link function" is expressed as the following.
In Eq.(1),  P(Y 1 / x) represents the probability of observing the category of "1" conditional on the explanatory variables X, β is the estimated model parameters and N is the number of observations. The relationship between the expected value of Y conditional on X is not in linear form.Considering that the model parameters are also nonlinear, MLE is commonly used for getting efficient estimations of the parameters (see Menard, 2002 for model assumptions, theoretical background of the model and other details).

Maximum Likelihood Estimation Method for BLM:-
Probability models are generally expressed as follows: It is equal to for binary dependent variable models. Considering each observations has a Bernoulli distribution with i P parameter, the probability of i Y being equal to 0 or 1 is given in Eq.
The probability ofobserving N sample depending on the explanatory variables X and the dependent variable set Y is equal to the multiplication of N probability expression under the assumption that observations are independent.
Because i P and P(Y / X) are dependent on parameters and the aim is to estimate the unknown parameters, this dependency could be revealed by defining a likelihood function L(.).
L(.) denotes the probability (or likelihood) of observing a sample Y. The basis of the MLE is to find the optimal estimated  vector which maximizes the probability of observing Y.
The likelihood function for BLM is given below. 1029 The logarithm of both sides of Eq. (7) is taken for much easier calculations.
 , then it also maximizes log L(Y / X, )  . Therefore, in order to obtain parameter estimations maximizing the probability of observing sample Y , first order derivatives are taken, equalized to 0 and a total of K equations for K parameters are obtained. Simultaneous solution of these K equations provides likelihood estimations. Likelihood equations for the BLM are derived from the expression given below.
Because the likelihood equations are not linear in k  , open forms of these equations could not be obtained. In this case, some traditional iterative optimization procedures like Newton-Raphson or Fisher-Scoring algorithms are applied (Agresti, 2002).

Newton-Raphson (NR) Algorithm:-
In NR algorithm, approximate parameter estimations can be determined depending on the starting point. Using Taylor series expansion and omitting high level terms, the first step of the iteration for NR algorithm can be expressed as follows.
In Eq.(10), H is the Hessian matrix consisting of the second order partial derivatives of the log-likelihood function. These algorithms based on unconstrained optimization of an objective function are commonly used in traditional methods for the maximization of the log-likelihood function. However, due to the restrictive assumptions mentioned above, the efficiency of the NM algorithm in the estimation of BLM model parameters is discussed, as well.

Nelder-Mead (NM) Algorithm:-
The classical NM method which is one of the multivariate, unconstrained, non-derivative optimization techniques is a simplex method developed by John Ashworth Nelder and Roger Mead in 1964in order to find the local minimum of a multivariate function.Simplex, which has a N-dimensional geometrical form, contains (n+1) points in n dimension.Thus, a simplex in two dimensions is constituted by a triangle. A simplex in three dimensions turns into a triangular prism with four surfaces.While finding the coordinates and minimum value of the point where the ndimensional function is a minimum with this method, initial coordinates of (n+1) point must be provided. Given these coordinates, the algorithm searches surface topography and moves to a minimum. Calculated value depending on the starting coordinates may be the local or global minimum.
In this method, (n+1) point is mutually placed as a regular simplex (simple and regular) in n-dimensional Euclidean Space. As the simplex is a triangle in two dimensions, the process begins with taking three corner points of the first triangle. The point which has the biggest function value for the maximum problem and the smallest function value for the minimum problem is called ("Best Vertex: B) while the next one is called (Good Vertex: G). In maximization problems, the point with the smallest function value and in minimization problems, the point with the largest function value is called (Worst Vertex: W). This process is repeated for each iteration and moves to the optimum solution with one series of triangles in two dimensional problems. The point where the corner points of the triangle called B, G, and W is equal to each other is the optimal point.
In two dimensions, this method is a model search method comparing function values in corners of the triangle.Considering the minimization of a two-variable z f(x,y)  function, it takes the greatest value in the worst corner of the triangle. Therefore, this worst corner is changed place with a new point.Thus, a new triangle is obtained.Now the search process is conducted in this new triangle.So this process becomes a series of triangles where corner point gets smaller in each iteration step.The process ends with triangles approaching a single point, which is the desired minimum point.This method, developed for two-variable

Midpoint of Good Side:-
It is used for calculating the midpoint of the line drawn between points B and G.This point is found by calculating the mean value of the coordinates. f(w)), then the iteration moves towards the minimum. The R point could be further from the minimum point. Therefore, when the line from point M to point R is expanded for the length of "d", end point E is obtained. Thus, a BGE triangle from BGW triangle is obtained by using the expansion process.

Fig2:-Producing the BGE Triangle and Point E by the Expansion Operator
If the function value in point E is smaller than the function value in point R (f(E) < f(R)), a better corner than R is found. Vectorial formulation for point E obtained by expansion process is as given in Eq.(13) (Mathews and Fink, 2004).
Finding Point C Using the Contraction Procedure:-If the function value in point R is equal to the function value (f(R) = f(W)), another point must be tested. Function may be smaller in point M but M and Wcannot be written again because there is a need to obtain a new triangle. When midpoints of lines WM and MR are called C1 and C2 respectively, our new triangle will be called BGC. In this case, BGC1 or BGC2 will not be different in two dimensions. However, there will be a difference in multiple dimensions (Mathews and Fink, 2004).

Logical Decisions for Each
Step:-In each step described above, a new corner is found and this is replaced with W. We can explain it via an algorithm given below. , ,     by using the most suitable method. NM algorithm seems to be a good alternative to the NR algorithm.

Phase 2: Shrinkage of Contraction
However, considering restrictive assumptions of the iterative methods, GA is one of most commonly used metaheuristic methods, especially in the optimization of non-linear functions.
Genetic Algorithm:-GA is an intelligence technique that calculates the best solution for the problem using evolutionary processes.Influenced by Charles Darwin's evolution theory, GA is a meta-heuristic algorithm that was introduced in the study of Rechenberg (1973) called "evolution strategies" (He was a pioneer in evolutionary computation and artificial evolution); developed by John Holland (1975) and, popularized by a book published by Goldberg in 1989 (for details see; Genetic Algorithms in Search, Optimization and Machine Learning). GA, like other heuristic algorithms, does not guarantee the optimum point which minimizes or maximizes of an objective function. However, its solution quality is very high.
Unlike other optimization methods, GA conducts the search using a community of candidate solutions instead of just one candidate solution.Thus, the solution space is scanned in parallel with the multiple starting points. This 1032 characteristic is the major factor enabling GA"s to find the global optimum value without getting stuck on local optimums.GA works with coded formats of candidate solutions (according to certain rules) instead of candidate solutions themselves.This encoding is formed by using an alphabet consisting of 0"s and 1"s (BIT -Binary digit) in the classical genetic algorithms.While binary coding is common in GA"s, symbols and real number encoding are also used in broad applications (Mitchell, 1999;Karr and Freeman, 1999).
In GA, the optimum solution of a problem is performed by some basic operations. These are the determination of the objective function, encoding, creating the initial population, selection, crossover and mutation operations. First, encoding is performed in accordance with the nature of the problem. Then, the starting population is created, often randomly. Suitable value for each array in the initial population is calculated. Finally, selection, crossover and mutation operators are used in order to change the population and create a new generation. GA operations proceed by assigning the adaptation value to the new population until the best solution is found (for details of schema theorem, see Holland, 1992;Goldberg, 1989;Reeves and Rowe, 2002).
Each individual in the population is called chromosome. In accordance with the fitness condition, chromosomes evolve and create new generations. Two chromosomes from current generation are obtained using Crossover or Mutation operators to create new generations. His new individual is calledthe "Parent". Parents matching the eligibility criteria (fitness value) are kept in the generation while parents who are not, are eliminated from the generation. This process is called the natural selection. Thus, GA"s produce constantly improving solutions in accordance with the "survival of the fittest" rule found in nature. For this purpose, in order to determine what"s best, GA"s use a fitness function and for creating new solutions, they use operators like Recombination, Crossover and Mutation. During these processes, the change for the best chromosomes to be selected is maximized. A few generations later, the algorithm converges to the optimum solution where the best chromosomes exist.
Encoding:-It is required to decide how solutions of individuals in the population are represented. It usually takes place in the form of chromosomes. Special structures in each chromosome are called "Genes". In binary encoding, every chromosome consists of character sequences like (0 or 1). Binary encoding is the most commonly used encoding method as well as the one most suitable for the structure of GA. However, there is no general coding technique that applies to all problems.
Starting Population:-In this step, n chromosomes are chosen randomly as a starting group. Determination of the initial population is important in terms of possibilities like GA to fast converge or kept in local optimums.Population size is one of the parameters that display the performance of the solution (Pasia, Hermosilla and Ombao, 2005). Determining a small number of populations can cause the search space not to search effectively while determining large number of population can cause the processing load to increase and algorithm to be slowdown. Some studies suggest that population size should be around 20-30 while others argue that ideal population size should be between 50 and 100 (Reeves and Rowe, 2002).

Fitness Value:-
In each iteration step of GA, fitness value for each individual in the population is calculated and recorded. Individuals with higher fitness value are more likely to be transferred to the next generation. GA cannot perform the function of selection without fitness values. A new population is created in each iteration step. In this population, every individual has a fitness value and the chromosome with the best fitness value is recorded. After the iteration is completed, all populations are compared and the best of each population"s best value are considered for the optimum solutions of the GA.

Selection Operator:-
The selection process is the process of selecting a chromosome in the old population to the new population which will be created according to fitness values, in order to have individuals with higher fitness value in new generations. The parent with high fitness value increases its chance for selection by taking the principle of the selection method used. For the selection process, Tournament, Roulette Wheel, Sorting, Steady State and Stochastic Uniform methods can be used. Goldberg and Deb (1991) stated that none of these methods have advantages of one another.

1033
Crossover Operator:-The basic function of this operator is to increase the probability of desired outcome by enabling new generations to produce different individuals than earlier generations and thus working in the solution space. Crossoveremploys the idea that new chromosomes become better by taking good genes from the old chromosomes. Crossover operator is expected to create better chromosomes by combininggood characteristics of the current chromosomes. Probability of crossover specifies how often crossover operation will be performed. If there is not any crossover (if probability of crossover is 0%), the child will have the same chromosomes as the parents. If the probability of crossover is 100%, children are completely created by crossover.
Mutation Operator:-Mutation prevents the searches to be stuck to local optimum points. How often crossover operation is to be performed to the individuals are determined by the probability of crossover (Pc) while it is determined by the probability of Pm for the Mutation operation. Mutation is often used within a smaller probability than Crossover. The reason for this is in order not to lose high adaptation levels achieved by Crossover. Crossover probability is often between 0.25 and 1 while Mutation probability is between 0.01 and 0.001. Mutation probability determines the frequency for chromosome parts to mutate. If there is no mutation, parents are directly copied without Crossover. If there is mutation, one or more parts of parent chromosomes change. If mutation probability is 100%, all chromosomes will change, if it is 0%, none of the chromosome will change.

Application:-
In this part of the study, performance comparisons of NR, NM and GA methods in the estimation of BLM model parameters are made over a simulated data and a real data set on Alopecia disease.

Simulated data Results;-NR results for the estimation of BLM parameters:-
Applicants generally use the results of this algorithm regardless of considering the iteration number, starting points and whether the function is differentiable or not. However, as mentioned earlier, this method may not achieve convergence in number of iterations if a starting point is chosen far away from the best solution. The NR algorithm results of the BLM model are given below. The minimum log-likelihood function value is 32.4960 according to the NR algorithm.

NM results for the estimation of BLM parameters:-
In this part of the study, parameter estimations for the same data set are obtained by using the NM algorithm. Since this algorithm is not included in the standard package programs, Matlab program is used for constructing the required commands for estimation. The general representation of the NM commands and explanations are as the following. , [-0:0,-0:0,-0:0,-0:0,-0:0]) (14) x: represents the estimated parameter values; y: represents the optimized value of the log-likelihood function; exitflag: shows one of the integer values (-1, 0, 1) for specifying the reason for the algorithm to stop. output:monitorize the results. fminsearch: is added to the "command window" to obtain the best parameter estimations. Starting corner points for all parameters are taken "0".After the command is run, results given in the last row of  When Table 4 is examined, the optimum value of the log-likelihood function is found 32.48285466470697 in 5146 iteration. Three different starting corner points and different iteration numbers are selected.

GA results for the estimation of BLM parameters:-
In this section, the performance of GA is discussed in terms of the best starting population; selection, crossover and mutation operators; crossover and mutation probabilities over randomly generated data. Tables 5, 6 and 7 give the estimated LR results under different combinations of alternatives in GA such as "Population Size", "Selection Function", "Number of Generations", "Stopping Criteria" etc. In order to achieve the optimum results, at least 100 different combinations are examined by individually trying alternatives. Four alternatives for the Mutation function (Use constraint dependent default, Gaussian, Uniform, Adaptive feasible); ten for Mutation rates; six for Crossover function (Scattered, Single point, Two point, Intermediate, Heuristic, Arithmetic); ten for Crossover rates; five for Selection preferences.
1035 Table 5 contains the estimated parameters and the minimum values of the negative of the log-likelihood function according to the different population sizes while other factors are the same. It shows that the best result (the smallest function value of 32.48285466470753) is obtained when the population size is 40. This table also emphasizes the selection of the optimal population size for getting better results. Table 6 shows the results under different number of generations and stopping criteria for the iteration process. We can easily see that better results could be obtained as the value of the stopping criteria decreases. In this case, the iteration number will increase. Thus, it is an expected result that we could have much better results as the iteration number increases.
"Stopping criteria" and "number of generation" are taken as variable and population size is set to 60. The smallest function number (32.482854664706964) is obtained for the stopping criteria of "1e-6" for the number of generations "20000". This means that the optimal generation number has to be truly determined for better model results.
Additionally, when the results are examined according to three different stopping values of 1e-6", "1e-10" and "1e-15", optimal function value is obtained for the "1e-6"stopping valuein the scenario which the number of generation is set to 20000. In Table 6, the optimal value is obtained when the "Roulette" selection function is used. As a result of all the treatments made up till now, even they do not very much differ from each other; the smallest value found of the negative of the log-likelihood function is 32.4828546647066964. In Table 7, results obtained according to the different selection functions are given. These are the "Stochastic Uniform", "Roulette", "Uniform" and "Tournament". The best solution is obtained from the function of "Roulette" whereas the worst solution is for "Tournament" function. The combination of the optimum results through all the trials is given by Table 8 and the graphical representation of the iterative process is presented by Figure 5.

Comparison of NR, NM and GA results:-
The final comparison is on the performance of NR, NM and GA methods over the minimum value of the negative of the log-likelihood in Table 10. When the results are compared, no significant variations are observed in terms of the parameter estimation. According to the minimum values of the negative of the log-likelihood function, GA is the most powerful method, followed by NM and NR. This result implies the success of the GA in the parameter estimation of the categorical dependent variable modeling when the assumptions of the classic optimization techniques are provided. This also means that GA will probably give efficient estimates even though the classical optimization assumptions are not met.

Real data application:-
In this part of the study we have compared the performance of NR, NM and GA approaches in the estimation process over a real data set on Alopecia disease and interpreted the best model results obtained. Data set has been taken from the records of Dermatology Unit of Çorum Training and Research Hospital of Hitit University in 2013.The total number of patients applied to the hospital with hair loss problem is 95 during 2013.
The study variable (alopecia disease) is binary. The diagnosed patients are coded as "1" and the non-diagnosed patients are coded as "0". Variables affecting the Alopecia diagnosis and the necessary abbreviations are given in Table 11. AST (AspartateAminotransferase) Quantitative X 5 Glucose Quantitative X 6 HDL (High DensityLipoprotein) Quantitative X 7 LDL (LowDensityLipoprotein) Quantitative X 8 , X 9 , X 10 Cholesterol, Free t3 andFree t4 thyroids Quantitative X 11 TSH (ThyroidStimulatingHormone) Quantitative X 12 Trigliserit (Tg) Quantitative The explanations of some medical terms we used in the analysis is presented below. ALT and AST in Table 9 are enzymes existent in liver cell. Cholesterol is a greasy element existent in circulation system. HDL is good cholesterol by blood, which is a special type of molecular proteins generated in liver transports. LDL is bad cholesterol by blood, which is a special type of molecular proteins generated in liver transports. Free t3 and Free t4 are thyroid gland hormone. TSH is a thyroid stimulating hormone and Trigliserit is a principle component of 1039 grease. For the unique qualitative explanatory variable of "Gender" is coded as "1" for male patients and "0" for females.
MLE estimation results of BLM model parameters obtained via NR, NM and GA are given in Table 12.
The optimal values of the statistically significant parameters (i.e. the roots of the log-likelihood function) are marked bold.
In GA, we get the following log-likelihood function by substituting the independent variable values of patients in Eq.(5).
The results show that we can obtain very close and sometimes even better results in BLM using GA with different operators and rates when the necessary conditions for applying the classical methods are satisfied. Additionally, these findings represent the success of GA because we can use GA method successfully in more flexible circumstances.

A Brief Interpretation of the Statistical Results:-
According to the NM algorithm results in Table 12, the statistically significant factors affecting the Alopecia disease at a %5 significance level are Gender (p=0.010), AST (p=0.022), Free t3 thyroid (p=0.048) and Free t4 thyroid (p=0.028).
Exp ( ) values give the odds ratios in LR analysis. This value is 5.063 for Males. This means that Males are nearly 5 times possible to exposure Alopecia disease compared with Females.
Since the estimated coefficients of AST and Free t4 thyroid are positive, they are all associated with higher probability of having the disease with a unit change in their values. Inversely, a unit change in Free t3 thyroid leads to the decrease in the probability of having the disease. As for the interpretation of their odds-ratio values, a unit increase in the AST and Free t4 thyroid values result in 1.143 and 10.72 times increases in the probability of having the disease, respectively. A unit decrease in the value of Free t3 thyroid leads to nearly 5.025 (1/0.199) times increases the probability of having the disease.

Conclusion:-
In this study, the success of two classical (NR and NM) and a meta-heuristic optimization method (GA) in the parameter estimation of LR model are examined over simulated and real data sets.

1041
In LR model, the method used for finding parameter estimations is based on the maximization of the likelihood function via the method of MLE. In the classical optimization approach, the performance of the process depends on the validity of the required assumptions.That is, likelihood function has to be differentiable and true starting points related to the parameters have to be defined.In addition, iterative methods should be used to estimate approximate parameters in categorical dependent variable modeling due to the nonlinear structure of the likelihood equations obtained by taking the first order derivatives of the likelihood equations with respect to the each parameter.The classical NR algorithm is one of the best known and widely used root finding methods. NM algorithm is somewhat different from NR in terms of the required assumptions. Namely, it is in the search methods category in the classical optimization approach and unlike NR, it does not require differentiable objective function.
In NR and NM methods, search for the solution starts from a certain point and moves to another point according to some criteria. In order to find better parameter estimations in these methods, a starting point can be determined with a preliminary graphical drawing. However, incorrect selection of the starting point may cause the solution not to be found or solution time to be increased.
In GA, search process is performed on a set of potential solutions instead of a single point and solutions are evaluated until the best solutions are found. GA has certain advantages like not requiring derivatives or other complementary information and not getting stuck to local optimum points in order to find global optimum points for the solution of the problem. For this reason, in order to examine the performance of the GA method in the parameter estimation of the binary logit model when the assumptions of the classical optimization techniques are satisfied, its success on parameter estimation are demonstrated over an application.
In the application part of the study, first randomly generated data is used where the dependent variable is binary. LR model parameters are estimated using NR, NM and GA methods. In GA, many different combinations of the elements are applied and the best combination is tried to find that gives the best set of the estimated parameters. Results show that GA can successfully be used in categorical dependent variable modeling as well as the classical approaches on condition that the best component of the algorithm is defined correctly. Additionally, we could infer from the results that, if GA is powerful in the estimation even though the strict conditions are provided from the classical methods, it can be applied to the data, unhesitatingly under the worst conditions.