January 14, 2022 | 11:00AM
zoom
The Language Evolution, Acquisition and Processing (LEAP) workshop will be meeting next Thursday on January 14 at 11:00am-12:20pm (CT). Note that this meeting will be fully online, in accordance with the University’s instruction on remote-only meetings for the first two weeks of the quarter. Click here to join the meeting; the link can also be found at the end of this email.
This year’s first guest speaker is Hyunji Hayley Park from University of Illinois at Urbana-Champaign, who is a PhD candidate working on computational linguistics. Getting insights from linguistics, Hayley will talk to us about the pitfalls and possibilities of NLP systems. Please find the abstract and some relevant work below.
-------------- Abstract --------------
Title: Pitfalls and possibilities: What NLP systems are missing out on
Despite recent advancements in language modeling and NLP in general, there are still many areas where NLP systems face difficulty. With NLP research disproportionally dedicated to English and a few other high-resource languages, the effect of morphology on NLP systems is clearly an under-studied area. Most high-resource languages such as English and Chinese utilize little morphology, encoding more information syntactically (e.g. word order) than morphologically (e.g. case inflection). Morphologically rich languages like Turkish and St. Lawrence Island Yupik use much more variations in word forms to encode meaning and have flexible or free word order. Regarding this issue, I present two studies that augment the existing data to investigate how morphology interacts with NLP systems. First, I compile a parallel Bible corpus and a linguistic typology database to study the effect of morphology on LSTM language modeling difficulty. The results show that morphological complexity, characterized by higher word type counts, makes a language harder to model. Subword segmentation methods such as BPE and Morfessor mitigate the effect of morphology for some languages, but not for others. Even when they do, they still lag behind morpheme segmentation methods based on FSTs. Next, I develop the first dependency treebank for St. Lawrence Island Yupik and demonstrate how morphology interacts with syntax in the morphologically rich language. I argue that the Universal Dependencies (UD) guidelines, which focus on word-level annotations only, should be extended to morpheme-level annotations for morphologically rich languages. As for another area that requires further research, I present a recent study on long document classification. Several methods have been proposed for the task of long document classification using Transformers. However, there is a lack of consensus on a benchmark to enable a fair comparison among different approaches. In this paper, I provide a comprehensive evaluation of existing models' relative efficacy against various datasets and baselines-- both in terms accuracy as well as time and space overheads. Our results show that existing models often fail to outperform simple baseline models and yield inconsistent performance across the datasets. The findings also emphasize that future studies should consider comprehensive baselines and datasets that better represent the task of long document classification to develop robust models.
References:
Hyunji Hayley Park, Katherine J. Zhang, Coleman Haley, Kenneth Steimel, Han Liu, Lane Schwartz. 2021. Morphology Matters: A Multilingual Language Modeling Analysis. Transactions of the Association for Computational Linguistics, 9: 261–276.
Hyunji Hayley Park, Lane Schwartz, and Francis M. Tyers. 2021. Expanding Universal Dependencies for Polysynthetic Languages: A Case of St. Lawrence Island Yupik. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP), pages 131–142, Online. Association for Computational Linguistics.
Hyunji Hayley Park, Yogarshi Vyas, Kashif Shah. Under Review. Efficient Classification of Long Documents Using Transformers.
-----------------------------
* To join the Zoom meeting room: https://uchicago.zoom.us/j/94052449015?pwd=aUNQNU15RnQvSTIzd041dU5QK2t1dz09
Meeting ID: 940 5244 9015
Passcode: 151010
* If you would like to have an individual meeting with the guest speaker, please send an email to the coordinators (cc-‘ed in this email) no later than this Friday (Jan 7).
* This quarter, we will be meeting on Jan 14, Feb 11, Feb 18, and Feb 25. Please visit the LEAP website for the updated schedule and presenter information.
We look forward to seeing you next week!