Abstract:
This thesis develops an integrative approach to automatic linguistic complexity analyses for German and applies it to predict the proficiency of learner writing and the readability of texts for native and non-native speakers of German. Complexity is a central concept in applied linguistics and has been used in Second Language Acquisition (SLA) research to characterize and benchmark language proficiency and to track developmental trajectories of learners (Ortega, 2012). However, the focus of SLA complexity research has been on the analysis of syntax and lexicon and the English language (Housen et al., 2019; Wolfe-Quintero et al., 1998). More research on other linguistic domains—such as morphology or discourse—is needed to model complexity as a multidimensional construct. Furthermore, more languages should be studied to promote complexity research. Measures of linguistic complexity have also been found to be important features in computational linguistic research on Automatic Proficiency Assessment (APA) and Automatic Readability Assessment (ARA). This thesis combines insights from SLA complexity research and computational linguistic approaches to APA and ARA to address important research gaps in SLA complexity research and work on APA and ARA for education contexts.
We propose a linguistically broad approach to complexity that combines measures of syntactic, lexical, and morphological complexity, as well as measures of discourse, human processing, and language use. In doing so, we integrate theories and concepts form different research disciplines including SLA complexity research, computational linguistics, and psychology. We implemented a system to automatically calculate these measures relying on Natural Language Processing (NLP) techniques. With 543 measures, it calculates to the best of our knowledge the largest and most diverse collection of measures of absolute and relative complexity for German. To make this resource accessible to other researchers and thereby promote the comparability and reproducibility of complexity research for German, we integrated this system into the Common Text Analysis Platform (CTAP) by Chen and Meurers (2016). We generalized the originally monolingual web platform for English to support multilingual analyses, leading to its extension to several additional languages. In an empirical study on the impact of non-standard language on the NLP annotations and subsequent calculation of measures, we confirmed that even on language from beginning learners, our analysis remains overall robust and errors hardly impact our complexity estimates or models trained with them.
We then demonstrate the value of our integrative broad linguistic modeling approach to linguistic complexity for APA and ARA. First, we provide an overview of the current research landscape for both domains by conducting two systematic surveys focusing on automatic approaches for German published in the past twenty years. Both surveys showcase the need for more research on approaches targeting second or foreign language (L2) learners and young native speakers, more cross-corpus testing, and more accessible models. For ARA, we observed that traditional readability formulas remain the de facto standard in research that is not specifically dedicated to the development of new ARA approaches, even though they have been criticized as overly simplistic by ARA researchers and generally perform below the current state-of-the-art (SOTA). Second, we report on several machine learning experiments that build on these insights and take into consideration the research needs we identified. We train models for predicting language proficiency for L2 learners on long texts at the full Common European Framework of Reference for Languages (CEFR) scale (A1 to C1/C2) and short answers to reading comprehension questions in the form of course levels (ranging from A1.1 to A2.2). We also train a model for capturing early native language (L1) academic language proficiency of students using grade levels (1st to 8th grade). For text readability, we train models for L2 learners for longer texts (distinguishing texts for learners at the CEFR levels A2, B1/B2, C1) and sentences (using a 7-point Likert scale) as well as a model for German media language aimed at children or adults (making a binary distinction). We test these models across corpora and on hold-out data sets. With this, we illustrate the generalizability of our models across different task contexts, elicitation contexts, languages, and publishers. We also perform linguistic analyses on all data sets studied, which yields important insights into the characterization of developmental trajectories in German. This thesis makes a special methodological contribution to ARA, as we compile a total of three new readability corpora which for the first time facilitate cross-corpus testing and cross-language testing for German ARA.
In sum, this thesis provides novel insights into the developmental variation of linguistic complexity in German and its role for text readability. It also contributes important new resources for research on complexity, ARA, and APA by making available the multilingual CTAP system, new readability corpora, and new models for German.