EXAMINING STUDENT RETENTION WITH DATA ANALYTICS

Student retention is a critical issue for institutions today. As students have increasing options for educational and career opportunities, institutions need to engage and retain students so they complete their degrees. Simultaneously, the need for Data Analytics knowledge and talent is exploding in the Information Systems field. This paper aims to utilize prevalent Data Analytics techniques using the open source software packages R and RStudio to examine characteristics important for student retention. K-means clustering and Decision Tree analyses were conducted to tell the story of student retention. Data used was collected in a large Freshmen class at a public institution. Results show that the perceived atmosphere at the institution, a student’s GPA

Student retention is a critical issue for institutions today. As students have increasing options for educational and career opportunities, institutions need to engage and retain students so they complete their degrees. Simultaneously, the need for Data Analytics knowledge and talent is exploding in the Information Systems field. This paper aims to utilize prevalent Data Analytics techniques using the open source software packages R and RStudio to examine characteristics important for student retention. K-means clustering and Decision Tree analyses were conducted to tell the story of student retention. Data used was collected in a large Freshmen class at a public institution. Results show that the perceived atmosphere at the institution, a student's GPA, and lack of financial pressure are critical factors for retention. Other factors are also important and discussed. Suggestions are offered to increase student retention.

Methodology:-
The research question is an exploration of student retention over multiple potential factors. The goal is to use multiple data analytics techniques on multiple potential retention factors to build a story of, and guidelines for, student retention. Potential factors used come from Xu's study in 2015 (Xu, 2015). To explore the research question, surveys were made available electronically, on a voluntary basis, to 302 students in a large business information systems course at a public institution. The goal of this research was not to create and validate a new instrument, so questions from Xu's study on student retention were used (Xu, 2015). The Appendix lists the questions used for this study. The questions were administered during a normal class period. Extra credit was given for completing the survey questions which were asked using a student response system. Students were informed that all responses were anonymous, participation was voluntary, and all collected results would be reported in aggregate only. Numbers of responses varied by one or a couple of students per question (students might leave to use the restroom or various other reasons to not answer one or more questions), and 252 students out of the 302 enrolled responded to the survey (83% response rate). The 252 responses were made up of 162 freshman, 44 sophomores, 34 juniors, 1 senior, 4 "other", and 7 students did not identify their year. All students were enrolled in the College of Business. (IRB Protocol # HS15-0346 "Examining student retention with data analytics", reviewed by Institutional Review Board #2 on 11-Nov-2015 and it was determined that it meets the criteria for exemption, as defined by the U. S. Department of Health and Human Services Regulations for the Protection of Human Subjects, 45 CFR 46.101(b), 2.).

Data Collection and Analysis:-
A five-point Likert-scale was used for each question in Xu's (2015) survey (see Appendix) administered to the 252 students. Answers ranged from "Strongly Agree" (numeric value 1) to "Strongly Disagree" (numeric value 5). The open source software packages R and RStudio were used to examine the data to determine characteristics important for student retention. K-means clustering and Decision Tree analyses were conducted to tell the story of student retention.
The first analysis performed was K-means clustering. According to EMC (2015), K-means should be used when the data scientist wants to group items by similarity, and wants to find structure (commonalities) in the data. In that sense, K-means is often an exploratory technique, used as a prelude to classification. K-means uses Euclidian distance to categorize ordered pairs that are closest to one other. Individual data points then get assigned to their closest cluster. This is an excellent technique to use to start to tell the story of student retention… the idea is to cluster student answers into similar groupings. Most notably, if students indicated they were considering dropping out or leaving the institution, what other survey answers were grouped in that cluster? What traits showed an inclination to "be or not be retained"? The plot of the "Within Groups Sum of Squares" of the potential number of clusters is used to determine K, the final number of clusters. For this data set, K was chosen to be 8 (8 clusters). Table 1 shows the survey response means for the 8 clusters, with the clusters appearing as rows. The larger boldfaced means are of note. Questions 24 and 26 (see Appendix) were used for the output variable of Retention. Cluster 1 has very low means (1.68 and 1.47) for questions 24 ("I have seriously considered dropping out of college") and 26 ("I have seriously considered leaving my institution for another school"), with "1" being "Strongly Agree". Therefore, Cluster 1 could be considered the "retention risk" cluster. These results will be discussed in the following section. While K-means clustering provides good guidance for determining what factors help retain (or not) students, in data analytics a more sophisticated classification technique is generally used after clustering. For this paper, that technique is Decision Trees. Table 2 shows the R code for fitting a Decision Tree to the retention data, and plotting and summarizing the tree. R uses Information Gain and Entropy to determine the most appropriate nodes to make a decision, in this case a prediction of Retention ("Student will be retained"/"Student will not be retained"). As mentioned, Questions 24 and 26 (see Appendix) were used for the output variable of Retention, and for the training data considered for this paper if a student answered "Strongly Agree" (1) or "Agree" (2) to either Question 24 or 26, the student was labelled a "NO" Retention. The second check was whether the student answered "Neutral" (3) to either Question 24 or 26, and if so the student was labelled a "MAYBE" Retention. The third and final check was whether the student answered "Disagree" (4) or "Strongly Disagree" (5) to either Question 24 or 26, and if so the student was labelled a "YES" Retention. "MAYBE" students were discarded, and the training data set was left with 115 NOs and 87 YESs. Table 3 shows the R Summary Output after fitting a Decision Tree to the retention training data. As EMC (2015) notes, "The output produced by the summary is difficult to read and comprehend" (EMC Education Services, 2015, pp. 209). Nonetheless, the "Variable Importance" provides good guidelines for building the retention story. Figure 1 displays the RStudio produced Decision Tree itself, adding more information. Table 2:-R code for fitting a Decision Tree to the retention data, and plotting and summarizing the tree > install.packages("rpart.plot") > library(rpart) > library(rpart.plot) > Retain_decision <-read.table("DataforRBin.csv", header=TRUE,sep=",") > fit <-rpart(Retain ~ Year + GPA + StudyGroup3 + FacultyInteraction5 + Greek6 + ResHall7 + Cultural8 + CommServ9 + Friends11 + SocialLife12 + CoursesInter13 + AcademicDevel14 + FinSupportFamily15 + FinancialEase16r + AcademicGood18 + AtmosphereGood19 + AdvisingSufficient21r + AccessFaculty22 + ImpFinish623, method="class", data=Retain_decision) > rpart.plot(fit, type=4,extra=1) 378 > summary(fit) splits as RLLR, improve= 6.873189, (0 missing) SocialLife12 < 2.5 to the right, improve= 6.253316, (1 missing) AcademicDevel14 < 2.5 to the right, improve= 6.244695, (1 missing) 379

Discussion and Conclusions:-
Looking first at the K-means clustering output in Table 1, Cluster 1 has low means for Questions 24 and 26, the "Retention" questions, and it is important to note how the students in this cluster responded to the other questions. Question 2, the GPA question, was coded from left to right as 1 to 6. So a lower survey response average indicates a lower GPA. The Cluster 1 respondents have an average response of 4.05, which translates to a GPA of 2.5-3.0. While not failing by any measure, this GPA is still easily the lowest of the 8 clusters. This could indicate the intuitive result that as a student's GPA goes down, so does the likelihood of retaining that student. Additionally, Question 12 ("I am satisfied with my social life on campus"), Question 14 ("I am satisfied with my academic development at this institution"), Question 15 ("I have financial support from my family members for completing my college degree"), Question 16 ("Financial pressure is distracting me from my college coursework") and Question 19 ("The atmosphere at the institution which I attend is good") showed to be important questions in the Retention story, as shown in Table 1. All of these factors showed up in clustering as being important in the Retention story, and need to be "fed in" to the following technique, the Decision Tree.
The Decision Tree shown in Figure 1 demonstrates that the top node, the Root Node, branches on atmosphere (Question 19: "The atmosphere at the institution which I attend is good."). If students choose a response greater than 2.5 (meaning at least at the "Neutral" response, value 3, and extending to "Disagree", value 4, and "Strongly Disagree", value 5) then they are highly likely to not be retained (at 45 NO, or "Student Not Retained", and 9 YES). Note that, with default pruning, R returned the tree in Figure 1 which does not have all leaf nodes with explicit decisions. Meaning, the tree shows that if a student chooses "Strongly Disagree", "Disagree", or "Neutral" to the atmosphere question (#19) that student is 83% (45/54) likely to not be retained, but this determination is not at 100%. A tree with forced 100% leaf nodes was also constructed, but this tree contained 113 nodes as compared to the 23 nodes in the tree shown in Figure 1. Such a tree, completing every decision at 100%, is often referred to in data analytics as "over-fit".
Next in importance after the atmosphere at the institution is GPA (Question 2), not surprisingly. Of the 148 students who felt the atmosphere was good, 50 had low GPAs (less than 3.0) and were 66% (33/50) likely to not be retained. Those in the "atmosphere is good" group with GPAs above 3.0 were 62% (61/98) likely to be retained. Question 16: "Financial pressure is distracting me from my college coursework" (scale reversed prior to input for R), was the next most important factor. Similar guidelines can be obtained by following down the tree and reviewing the summary results from R. For example, next in retention importance is Question 13: "I find my courses to be generally interesting." The three most important factors, Atmosphere, GPA, and Financial Pressure, are consistent with the K-means clustering results, providing confidence that these factors are critical to student retention.
Using the decision tree can provide many opportunities for administrators and policy makers in higher education. Knowing that the student response to "The atmosphere at the institution which I attend is good" was the most important branching variable in the decision tree is critical. What makes an atmosphere "good" is of course another discussion. But one example of a possible action taken would be that surveys could be conducted with multiple potentials factors for what makes an atmosphere good. Many of these factors undoubtedly would be easier to "solve" or implement than something like financial pressure or a low GPA. Similar direction can be obtained from the decision tree, and R and data analytics have begun to build a story of student retention.