DiCanio Colloquium

April 9
Pick 016
University at Buffalo
"Big data" in small languages: challenges in extracting phonetics from endangered language corpora

We live in an era of big data, where it is now easier than ever to collect vast amounts of information about the observable world. This capability has even had an impact in the small corners of the globe where linguists have gone to investigate and document undescribed languages spoken by ethnic minorities. Yet, like all big data, a vast amount of linguistic data is only useful if there is some way to analyze it. In this talk, I examine the challenges one faces in extracting the phonetic features of a language from a large corpus of recordings and how some computational tools have recently come to aid in this process. Investigating data from Yoloxóchitl Mixtec and Arapaho, I evaluate the accuracy of these methods and demonstrate their efficacy in an investigation of vowel production data in spontaneous and elicited speech. These findings not only demonstrate effective methods for more quickly examining the structure of a language from large data sets, but pinpoint the areas where computational and linguistic expertise best cooperate.