The use of open-ended items, especially in large-scale tests, created difficulties in scoring open-ended items. However, this problem can be overcome with an approach based on automated scoring of open-ended items. The aim of this study was to examine the reliability of the data obtained by scoring open-ended items automatically. One of the objectives was to compare different algorithms based on machine learning in automated scoring (support vector machines, logistic regression, multinominal Naive Bayes, long-short term memory, and bidirectional long-short term memory). The other objective was to investigate the change in the reliability of automated scoring by differentiating the data rate used in testing the automated scoring system (33%, 20%, and 10%). While examining the reliability of automated scoring, a comparison was made with the reliability of the data obtained from human raters. In this study, which demonstrated the first automated scoring attempt of open-ended items in the Turkish language, Turkish test data of the Academic Skills Monitoring and Evaluation (ABIDE) program administered by the Ministry of National Education were used. Cross-validation was used to test the system. Regarding the coefficients of agreement to show reliability, the percentage of agreement, the quadratic-weighted Kappa, which is frequently used in automated scoring studies, and the Gwet's AC1 coefficient, which is not affected by the prevalence problem in the distribution of data into categories, were used. The results of the study showed that automated scoring algorithms could be utilized. It was found that the best algorithm to be used in automated scoring is bidirectional long-short term memory. Long-short term memory and multinominal Naive Bayes algorithms showed lower performance than support vector machines, logistic regression, and bidirectional long-short term memory algorithms. In automated scoring, it was determined that the coefficients of agreement at 33% test data rate were slightly lower comparing 10% and 20% test data rates, but were within the desired range.
|
This study aims to compare Sequential Probability Ratio Test (SPRT) and Confidence Interval (CI) classification criteria, Maximum Fisher Information method on the basis of estimated-ability (MFI-EB) and Cut-Point (MFI-CB) item selection methods while ability estimation method is Weighted Likelihood Estimation (WLE) in Computerized Adaptive Classification Testing (CACT), according to the Average Classification Accuracy (ACA), Average Test Length (ATL), and measurement precision under content balancing (Constrained Computerized Adaptive Testing: CCAT and Modified Multinomial Model: MMM) and item exposure control (Sympson-Hetter Method: SH and Item Eligibility Method: IE) when the classification is done based on two, three, or four categories for a unidimensional pool of dichotomous items. Forty-eight conditions are created in Monte Carlo (MC) simulation for the data, generated in R software, including 500 items and 5000 examinees, and the results are calculated over 30 replications. As a result of the study, it was observed that CI performs better in terms of ATL, and SPRT performs better in ACA and correlation, bias, Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) values, sequentially; MFI-EB is more useful than MFI-CB. It was also seen that MMM is more successful in content balancing, whereas CCAT is better in terms of test efficiency (ATL and ACA), and IE is superior in terms of item exposure control though SH is more beneficial in test efficiency. Besides, increasing the number of classification categories increases ATL but decreases ACA, and it gives better results in terms of the correlation, bias, RMSE, and MAE values.
|
This study aimed to examine the effect of rater training on the differential rater function (rater error) in theprocess of assessing the academic writing skills of higher education students. The study was conducted with apre-test and post-test control group quasi-experimental design. The study group of the research consisted of 45raters, of whom 22 came from experimental, and 23 came from control groups. The raters were pre-serviceteachers who did not participate in any rater training before, and it was investigated that they had similarexperiences in assessment. The data were collected using an analytical rubric developed by the researchers andan opinion-based writing task prepared by the International English Language Testing System (IELTS). Withinthe scope of the research, the compositions of 39 students that were written in a foreign language (English) wereassessed. Many Facet Rasch Model was used for the analysis of the data, and this analysis was conducted underthe Fully Crossed Design. The findings of the study revealed that the given rater training was effective ondifferential rater function, and suggestions based on these results were presented.
|
In this study, differential item functioning (DIF) and differential bundle functioning (DBF) analyses of the Academic Staff and Postgraduate Education Entrance Examination Quantitative Ability Tests were carried out. Mantel-Haenszel, logistic regression, SIBTEST, Item Response Theory-Likelihood Ratio and BILOG-MG DIF Algorithm methods were used for DIF analyses. SIBTEST was the method used for DBF analyses. Data sets for the study came from an earlier application of the examination. Gender DIF analyses showed that eleven items showed DIF. Four of the items favored male applicants, where seven of them favored female applicants. In order to investigate the sources of DIF, we consulted experts. In general, the items which could be solved using routine algorithmic operations and which are presented in the algebraic, abstract format showed DIF in favor of females. The “real-life” word problems favored males. According to DBF analyses, the operations item group favored females and the word problems item group favored males.
|
In this study, it is aimed to show how student achievement can be monitored by using the cognitive diagnosis models. For this purpose, responses of the 6th, 7th, and 8th grade Mathematics subtests of High School Placement Tests (HSPT) in 2009, 2010, and 2011, which provide longitudinal data, were used, respectively. There were 49933 examiners’ responses in data sets. The attributes examined by these tests were determined by the Mathematics experts, and the Q matrix consisting of five attributes was developed. As a result of the analysis, it was seen that the largest latent class for all three years consisted of those non-master for any attribute. It was observed that the probability of attribute mastery increased in the 7th grade and decreased in the 8th grade. The high classification accuracy seen as a result of the analysis applied to HSPT, which is not intended for the cognitive diagnosis, shows that the results can be used for monitoring student achievement.
|
The purpose of this study is to prove the equitability of pre and post-tests with the Rasch Model and to provide the observability of individual and interindividual ability changes by evaluating the equated tests with stack analysis within the scope of the Rasch Measurement Theory. The pre-test and post-test data that are applied in this study were derived from the project named A Model Proposal to Increase Turkey’s Success in the field of Mathematics in International Large-Scale Exams: Effectiveness of the Cognitive Diagnosis based Monitoring Model No. 115K531, which started on 15/11/2015 and was supported by the TUBİTAK SOBAG 3501 program. The tests were analyzed with the Rasch model, and the fit of the data to the Rasch model was evaluated, and then the Rasch Model and the Separate estimation-Common person method were applied for equating process. Lastly, individual and interindividual ability changes were observed by applying the stack analysis method with the Rasch model. As a result of the analysis of pre and post-tests with the Rasch model, it was concluded that they meet the requirements of the model. As a consequence of the equating process, the equitability of pre-test and post-test was proved, and it was observed that the individual and interindividual ability change could be evaluated by analyzing the pre-test and post-test data with the stack analysis method.
|
Propensity score analysis, such as propensity score matching and propensity score weighting, is becoming increasingly popular in educational research. When a propensity score analysis is conducted, examining the covariate balance is considered to be crucial to justify the quality of the analysis results. However, it has been pointed out that solely considering how covariates balance after matching may not be enough for justifying the quality of the propensity score analysis results. Suitable covariate balance may still yield biased estimates of treatment effects. The current study aimed to systematically demonstrate this problem by a series of simulation studies. As a result, it was revealed that a good covariate balance on the mean and/or the variance does not guarantee reduced bias on an estimated treatment effect. It was also found that estimation of the treatment effect can be unbiased to some degree, even with a lack of balance under specific conditions.
|
Eğitim sistemlerinin performansları günümüzde ulusal ve uluslararası başarı izleme çalışmaları ile yakındanizlenmektedir. Bu izleme sürecinde ülkelerin akademik performansının yanı sıra eğitimde eşitlik konusundakidurumları da değerlendirilmektedir. Sosyoekonomik düzey ve okullar arası başarı farklarının akademik başarı ileilişkisi bu değerlendirmeler arasında öne çıkmaktadır. Türkiye’de yapılan eğitim tartışmalarında da okullar arasıbaşarı farkları kronik bir sorun olarak öne çıkmakta, sosyoekonomik farklılıkların da bu sorunu oluşturanunsurlardan biri olduğu kabul edilmektedir. Bu çalışmada, 2011, 2015 ve 2019 TIMSS döngülerine ait verilerkullanılarak Türkiye’de öğrencilerin sosyoekonomik düzey ve okullar arası başarı farklarının akademik başarı ileilişkisinin belirlenmesi amaçlanmıştır. Bu amaç doğrultusunda çok düzeyli regresyon analizi kullanılmıştır. Eldeedilen bulgular, 2011 ile 2019 döngüleri arasında ortalama puan artmasına rağmen sosyoekonomik özellikler ilebaşarı arasındaki ilişkinin benzer düzeyde kaldığını, 2019 döngüsünde kısmen azaldığını göstermiştir. Bu sonuç,Türkiye’nin TIMSS performansı artmasına rağmen sosyoekonomik özelliklerin performans ile ilişkisinin bu artışaeşlik etmediğini göstermektedir. Araştırma bulgularının ortaya koyduğu diğer bir sonuç da her iki sınıf düzeyindede okullar arası başarı farklarının 2019 döngüsünde artmasıdır. Sonuçlar, Türkiye’nin TIMSS performansındagörülen anlamlı artışın 8. sınıf düzeyinde büyük bir iyileşmeye işaret ettiğini, sosyoekonomik özelliklerin başarıdaaçıkladığı varyansın kısmen azalmasının da eşitlik açısından olumlu olduğunu ancak okullar arası başarıfarklarının azalması için önemli gelişim alanlarının olduğunu göstermektedir. TIMSS 2019 döngüsünde kısmenzayıflamasına rağmen sosyoekonomik özellikler ile akademik başarıyla arasında hala anlamlı bir ilişkinin olması,öğrencilerin sosyoekonomik dezavantajlarının destek programları aracılığıyla telafi edilmesinin önemli olduğunugöstermektedir.
|
The question of how observable variables should be associated with latent structures has been at the center of the area of psychometrics. A recently proposed alternative model to the traditional factor retention methods is called Exploratory Graph Analysis (EGA). This method belongs to the broader family of network psychometrics which assumes that the associations between observed variables are caused by a system in which variables have direct and potentially causal interaction. This method approaches the psychological data in an exploratory manner and enables the visualization of the relationships between variables and allocation of variables to the dimensions in a deterministic manner. In this regard, the aim of this study was set as comparing the EGA with traditional factor retention methods when the data is unidimensional and items are constructed with polytomous response format. For this investigation, simulated data sets were used and three different conditions were manipulated: the sample size (250, 500, 1000 and 3000), the number of items (5, 10, 20) and internal consistency of the scale (α = 0.7 and α = 0.9). The results revealed that EGA is a robust method especially when used with graphical least absolute shrinkage and selection operator (GLASSO) algorithm and provides better performance in the retention of a true number of dimension than Kaiser's rule and yields comparable results with the other traditional factor retention methods (optimal coordinates, acceleration factor and Horn's parallel analysis) under some conditions. These results were discussed based on the existing literature and some suggestions were given for future studies.
|
In this study, person parameter recoveries are investigated by retrofitting polytomous attribute cognitivediagnosis and multidimensional item response theory (MIRT) models. The data are generated using twocognitive diagnosis models (i.e., pG-DINA: the polytomous generalized deterministic inputs, noisy “and” gateand fA-M: the fully-additive model) and one MIRT model (i.e., the compensatory two-parameter logistic model).Twenty-five replications are used for each of the 54 conditions resulting from varying the item discriminationindex, ratio of simple to complex items, test length, and correlations between skills. The findings are obtainedby comparing the person parameter estimates of all three models to the actual parameters used in the datageneration. According to the findings, the most accurate estimates are obtained when the fitted modelscorrespond to the generating models. Comparable results are obtained when the fA-M is retrofitted to other dataor when the MIRT model is retrofitted to fA-M data. However, the results are poor when the pG-DINA isretrofitted to other data or the MIRT is retrofitted to pG-DINA data. Among the conditions used in the study,test length and item discrimination have the greatest influence on the person parameter estimation accuracy.Variation in the simple to complex item ratio has a notable influence when the MIRT model is used. Althoughthe impact on the person parameter estimation accuracy of the correlation between skills is limited, its effect onMIRT data is more significant.
|