Using artificial intelligence to create biology multiple choice questions for higher education

ABSTRACT


INTRODUCTION
Algorithms driven by machine-learning technologies are now gaining maturity. ChatGPT is one such innovation. ChatGPT is an interactive chatbot created by OpenAI, a California-based artificial intelligence (AI) startup (Susnjak, 2022). OpenAI's ChatGPT is a comprehensive language model. ChatGPT AI was trained on a massive corpus of text data using a deep learning algorithm to create replies like those of a human for natural language questions (ChatGPT, 2023). ChatGPT AI bot is now accessible at https://chat.openai.com/chat. AI natural language processing (NLP) technologies, such as ChatGPT AI, provide a means through which computers may engage with human language. A crucial stage in NLP, known as tokenization, is transforming unstructured information into organized text appropriate for computing (Hosseini et al., 2023). ChatGPT AI is interactive, able to comprehend what is being requested, and able to deliver it if it meets with application policies and data availability. For example, if you ask a search engine like Google to offer a list of questions connected to a particular topic, Google will send a link to a website that includes information relevant to the query you requested. When asking the same command to ChatGPT AI, the application will provide the question in that column.
The emergence of ChatGPT AI is similar to the emergence of other new innovative technologies that, if used appropriately, have the potential to benefit education. Despite the fact that ChatGPT AI has the potential to be utilized for activities that are not acceptable in the academic sector. Students, for example, utilize ChatGPT AI to generate assignments such as essays. However, teachers may be able to use AI to spot AI-created works.
Teachers can use ChatGPT AI in a variety of ways, including asking information-related questions, confirming the accuracy of data, reviewing topics, etc. Teachers can request ChatGPT AI to generate multiple-choice questions for tests. Obviously, with its current version, ChatGPT AI has not been able to create an assessment instrument that can accurately measure a learning objective if it is not given explicit instructions by an expert or teacher. However, it is not impossible that in the future ChatGPT AI may be able to generate complex questions if it has access to a huge amount of data and has received extensive training.
A question arises regarding the form of questions that the current version of ChatGPT AI is capable of compiling. How valid and reliable are the question sets generated by ChatGPT AI? What is the difficulty level of the questions created by the ChatGPT AI? What do students think about the questions created by ChatGPT AI? is it easy to read or understand? Is it relevant to the material being studied? Is it comparable to questions posed by humans?
Reliability and validity are, at a minimum, the two most important and essential aspects to consider when evaluating any measurement instrument or tool used (Mohajan, 2017). A measurement instrument is valid when it measures what it is intended to measure (Muijs, 2011). In other words, if an instrument measures a required variable accurately, it is termed a valid instrument for that variable (Ghazali, 2016). In comparison, reliability is defined as "the degree to which test scores are free of measurement error" (Muijs, 2011). It is a measurement of the stability or internal consistency of an instrument used to measure particular variable (Jackson, 2003). Multiple-choice questions are regarded to have a high level of reliability since they are scored objectively (Considine et al., 2005;Haladyna, 1999). Validity and reliability are related. It is possible for an instrument to be reliable but not valid; however, it cannot be valid if it is not reliable (Jackson, 2003). In other words, a valid instrument must also be reliable (Ghazali, 2016).
The quality of a multiple-choice questions test instrument can be determined by its validity and reliability, as well as its level of difficulty and discrimination power (Considine et al., 2005;Friatma & Anhar, 2019;Setiawaty et al., 2017;Rao et al., 2016;Salwa, 2012). The item's difficulty corresponds to the proportion of correct responses (McCowan & McCowan, 1999). It is the frequency with which test-takers select the appropriate response (Thorndike et al., 1991). Items with a higher difficulty index are less difficult. A question that was answered correctly by 75% of test-takers has a difficulty level of 0.75. A question that was answered correctly by 35% of test-takers has a difficulty level of 0.35 (McCowan & McCowan, 1999). Item discrimination contrasts the proportion of high scorers and low scorers who correctly answer a given item. It refers to the degree to which items discriminate between students in the high and low groups. The whole test and each individual item should assess the same concept. High performers should be more likely to answer a good question properly, but poor performers should be more likely to do so wrong (McCowan & McCowan, 1999).
This study aims to determine the validity, reliability, level of difficulty, and discrimination power of an AI-generated collection of biology questions for higher education. Students' responses to AI-generated questions are also presented in this study.

METHODS
This research is a descriptive quantitative analysis to explain the validity and reliability of ChatGPT AI's questions. Before conducting the research, questions obtained from ChatGPT AI were compiled and administered to students. The steps of research are described in more detail below.

Accessing ChatGPT Artificial Intelligence
The researcher accessed the ChatGPT AI website in 2023, created an account, and logged into the application. Version 30 January 2023 of ChatGPT AI is in use (Figure 1).

Creating questions
Researchers ask ChatGPT AI to create questions using the query "write me a multiple choice question with one correct answer option and four wrong answer options about <subject> for bachelor's degree, tag the correct answer". <subject> are seven basic biology studies discussed in high school and university biology subjects. The seven studies and the distribution of questions made by ChatGPT AI can be seen in Table 1.
In accordance with the request, ChatGPT AI successfully created 21 questions, each with five multiple-choice options, one of which was the correct answer and four of which were incorrect. ChatGPT AI also marks the correct answer for each question. The questions created by ChatGPT AI are written in English, which are then translated into Indonesian and evaluated by English and Indonesian lecturers with expertise in both languages. The researcher then compiled the 21 questions on the Google form and administered them to students in person. Students were presented with questions in both English and Indonesian. The test is administered under strict supervision and with closed books to ensure that students' responses are based solely on their own knowledge and not on the assistance of others or the internet/books. The exam is administered in 42 minutes, with two minutes allotted to each question.

Students' Responses to Artificial Intelligence-Generated Questions
We gathered student responses to AI-generated questions using the criteria developed by Susnjak (2022) in his research to assess AI responses. After completing the AI-generated questions, students work on this questionnaire. Students were told that the questions they had just worked on had been created by AI, and they were given 10 minutes to complete this questionnaire. Only students who are willing to complete this questionnaire will be eligible (not required for all students). Table 2 displays the response questionnaire as well as the criteria.

Participants and Data collection
This study was carried out at the department of science education at a state university in East Java, Indonesia. A sample of 272 students were selected using random sampling technique from two study programs, namely biology education and natural science education. Not all pupils in class do the questions, only those who wish to. And only those students who choose to fill out a response questionnaire to students are asked to do so. 68% of the students that worked on the questions were those who worked on the response questionnaire, which only included 185 students.

Statistical Analysis
All statistical analyses were calculated using IBM SPSS statistics 26 software. The validity of the questions was determined using Pearson product-moment correlation (Ahrens et al., 2020;Cho et al., 2006;Harahap et al., 2019;Mutmainah & Isdiati, 2022;Salwa, 2012). The reliability of the questions was determined using the Cronbach's alpha value (Ahrens et al., 2020;Cho et al., 2006;Harahap et al., 2019;Mutmainah & Isdiati, 2022;Salwa, 2012). The level of difficulty of the questions is determined by the following formula from McCowan and McCowan (1999): Difficulty index (P)=# who answered an item correctly/total # tested. The difficulty level of the questions is described, as follows: Difficult if below 0.3, medium if between 0.3 and 0.7, and easy if above 0.7. Using the following formula from Salwa (2012), the discrimination power level of the questions is determined: Discrimination index (D)=# of top test takers who answered an item correctly/total # of top test takers tested (27% of all students)-# of bottom test takers who answered an item correctly/total # of bottom test takers tested (27% of all students).
The discrimination power of the questions is defined, as follows: poor if below 0.2, adequate if between 0.2 and 0.4, good if between 0.4 and 0.7, and excellent if over 0.7. If the result is negative, the discrimination power level of the item is inadequate, and the item must be eliminated. Students' responses to AI-generated questions were analyzed descriptively.

ChatGPT Artificial Intelligence-Generated Questions
ChatGPT AI successfully generated 21 questions. Appendix A shows a list of all questions.

Validity
The results of the validity test of all ChatGPT AI-generated questions can be seen in Table 3.

Reliability
The results of the reliability test for all ChatGPT AI-generated questions may be viewed in Table 4 if the invalid question is not removed (question no. 17), and in Table 5 if the invalid question is removed.

Level of Difficulty and Discrimination Power
The results of the level of difficulty and discrimination power of all ChatGPT AI-generated questions can be seen in Table 6.

Student responses to Artificial Intelligence-Generated Questions
The percentage of student responses to questions generated by ChatGPT AI can be seen in Figure 2.

DISCUSSION
According to Kimberlin and Winterstein (2008), validity is generally described as the degree to which an instrument measures what it claims to measure. It is necessary for an instrument to be valid so that it may be used to measure its intended subject. Using Pearson product moment correlation method to assess the validity of the questions, it was determined that 20 out of 21 items were valid, while one item was invalid. The invalid question is number 17, which is related to ecology. The results of the validity test indicate that 20 of the 21 questions generated by AI are valid and may be used.
The question number 17, which is invalid, asks students to choose the term used to describe the way by which organisms obtain energy from their environment. Option D (photosynthesis) is the correct answer, selected by 90 students (or 33%). Option A (metabolism), which 113 students (41.54%) selected, option B (ecosystem), which 40 students (14.7%) selected, option C (biodiversity), which 21 students (7.7%) selected, and option E (biogeography), which eight students (2.9%) selected, are all incorrect answer choices. Based on student choices, it was determined that a higher number of students selected option A (incorrect answer option) than the right answer. It is possible to derive that, first, answer choice A is an excellent diversion, or, second, there is a problem with question number 17. According to follow-up interviews with three random students who claimed to have chosen answer option A, they were confused by the question sentences. If the question is 'how do plants get energy from their environment' or 'how do organisms obtain energy from nature', it is likely that the student would choose option D (the correct one). Yet, given the wording in question 17 is how organisms get energy from their environment, students have misinterpreted the organisms at issue as animals and plants, and the environment referred to are other living species (by way of prey).  In addition, language and sentence issues that may be present in multiple-choice questions created by ChatGPT AI, such as question number 17 in this research, may be corrected by experts using content and face validity, as suggested Considine et al. (2005) and Harahap and Nasution (2022). Nevertheless, we did not do so in our research because we wanted to ensure that the questions generated by ChatGPT AI were free from any human adjustment.
Cronbach's alpha was used to assess the scale's internal consistency. Cronbach's alpha coefficient was determined to be 0.623% if item 17 (invalid item) was not removed, and 0.655% if item 17 (invalid item) was removed. The acceptable values for cronbach's alpha vary according on the source. According to van Griethuijsen et al. (2014), the acceptable values of cronbach's alpha are 0.7 or 0.6. Arulogun et al. (2020), George and Mallery (2003), Morgan et al. (2004), Rii et al. (2020), Taber (2018), and Wongpakaran and Wongpakaran (2012) emphasized the same point, that a Cronbach's alpha above 0.6 can be recognized as a reliable instrument. If this value is adhered to, then the multiple-choice questions generated by ChatGPT AI in this study may be deemed reliable. Several other resources, however, say that the allowable values for cronbach's alpha are 0.8 or even 0.9; if this figure is used, the multiple-choice questions created by ChatGPT AI in this study may be regarded unreliable.
By evaluating the level of difficulty of the questions, it was determined that, of the 21 questions created by ChatGPT AI, nine were classified as easy, 10 were categorized as medium, and two were classified as difficult. It is preferable to use a proportionate distribution of easy, medium, and difficult multiple-choice questions. In this context, proportionate means that there should be at least twice as many questions at the medium level as at the easy and difficult levels, with an equal number of questions at the easy and difficult levels. ChatGPT AI developed multiple choice questions having nearly identical easy and medium levels, and only two items (9.5%) are classified as difficult. It is preferred if questions with easy and challenging difficulty levels be revised to be more proportional, or to become questions with a medium difficulty level. Rao et al. (2016) stated that ideally multiple choice questions have a medium level of difficulty. Of course, this should be revised depending on the aim of the assessment.
By assessing the discriminating power of the questions, it was determined that, of the 21 questions created by ChatGPT AI, 4 had low discrimination power, nine had adequate discrimination power, and the remaining eight had good discrimination power. Questions with low discrimination power should be modified to have discrimination power that is adequate or greater. There are no items with negative discrimination power, suggesting that there are no questions that should be deleted based on the discriminating power analysis. One of the items, however, has a discrimination value of zero, indicating that this item has very poor discriminatory power, since the number of students who answered this item correctly in the upper group and the lower group are identical. This question turned out to be number 17, which was classified as invalid based on the validity test, thus it is not unexpected that this question has very poor discriminating power. Moreover the difficulty index and discrimination index are reciprocally related (Chauhan et al., 2013;Mehta & Mokhasi, 2014;Rao et al., 2016;Suruchi & Rana, 2014). For instance, if a question is determined to have a low level of difficulty and poor discriminating power, the question should be revised (Rao et al., 2016).
Based on student responses to questions produced by ChatGPT's AI, it was determined that 79% of students indicated the AIgenerated questions were relevant to the departmental subject they study. This finding suggests that ChatGPT AI is capable of generating questions pertaining to the specified subject, in this case biology in natural science, including change and growth, cell, biodiversity, genetics, evolution, ecology, and biotechnology. 72% of students reported that the questions generated by AI were clear. This suggests that the majority of students are capable of comprehending the questions posed by ChatGPT AI. The clarity of questions is determined by three survey items. The first item on the questionnaire asks whether the questions generated by ChatGPT AI are simple to comprehend. 66% of students indicated that the questions were straightforward. The second question asks if the questions generated by ChatGPT AI are logically structured and ordered. According to 76% of students, the questions were well-structured and logically ordered. The last question asks if questions generated by ChatGPT AI employ the proper language. 73% of students feel the question language is suitable. The questions in an assessment must be clear and concise. Difficult-to-understand questions will surely make it harder for students to answer, and there is a chance that students will respond erroneously not because of their incompetence but because of an error in the question.
73% of students stated that the AI-generated questions were accurate. This means that the majority of students consider the questions created by AI to be accurate; they see no grammatical or conceptual errors in the questions. However, you can't depend just on students' opinions to confirm the accuracy of a question. Several experts should be consulted to validate the question. Nevertheless, as stated previously, the questions in this study were not evaluated by professionals in order to determine how the questions were generated by AI.
74% of students indicated that the questions generated by AI were precise. This suggests that the majority of students consider AI-generated questions to be explicit and detailed. Students comprehend the intent of the questions and the required responses. If questions are not made clear and explicit, it is possible that students may have difficulties answering.
71% of students indicated that the questions posed by AI were of sufficient depth. The majority of students found that the questions generated by ChatGPT AI were challenging, not overly simple, and appropriate for their college or university level. As was done in this study, measuring the difficulty level of the questions is another method for determining if the questions are too easy or too difficult. Just two of the twenty-one questions generated by AI are difficult, while nine are quite easy.
The majority of students responded positively to the questions generated by ChatGPT's AI, according to the results of the student response questionnaire. Therefore, the teacher can use AI to assist him construct an assessment tool, but this must be complemented by the teacher's capacity to provide AI with clear instructions and to verify and optimize the resulting assessment tool as needed. Further study is required to determine if students can differentiate between questions developed by AI and those created by humans, as well as their perspectives on the conditions for AI-created questions.
Given that constructing multiple-choice questions is a complex and time-consuming process (Rao et al., 2016), it would be highly beneficial if AI could aid teachers or the education sector in the future in developing standardized and high-quality multiple-choice questions. Nevertheless, the present version of ChatGPT AI has several limitations, as mentioned by OpenAI on its website (ChatGPT, 2023), such as the possibility of producing wrong information, harmful instructions, or biased material, and limited awareness of the world and events after 2021. Quite likely, ChatGPT AI will acquire more data and better training over time, allowing it to assist its users more effectively.

CONCLUSION
Based on the research findings, twenty of the twenty-one questions generated by ChatGPT AI are valid. Ecology-related questions are the only question that is invalid. Cronbach's alpha coefficient was determined to be 0.65 for the twenty valid questions. By assessing the level of difficulty of the questions, it was determined that, of the 21 questions created by ChatGPT AI, nine were rated as easy, 10 were classified as medium, and two were classified as difficult. By assessing the discriminating power of the questions, it was determined that, of the 21 questions created by ChatGPT AI, four had low discrimination power, nine had adequate discrimination power, and the remaining eight had good discrimination power. Based on student responses to questions generated by ChatGPT's AI, it was determined that 79% of students indicated that the AI-generated questions were relevant to the class subject. 72% of students reported that the clarity of AI-generated questions was acceptable. 73% of students reported that the accuracy of AI-generated questions was good. According to 74% of pupils, the accuracy of AI-generated questions was good. 71% of students reported that the depth of the questions generated by AI was acceptable.
Funding: No funding source is reported for this study. Ethical statement: Author stated that all participants were over the age of 18 and that their participation was entirely voluntary. The author also stated that since no personal data was analyzed and pseudonyms were used in this article, no ethics committee approval was required.