Sravana Reddy

sravana.reddy at

I am a researcher at Spotify in Boston, where I work on projects related to natural language processing and machine learning.

I got my PhD in Computer Science from the University of Chicago, and have spent time at USC ISI, Dartmouth and Wellesley.


I've spent several years studying NLP, speech, machine learning, and linguistics. Most of my academic research centers around language variation: both dealing with it in practical systems, and analyzing it using large corpora. I'm also interested in the applications of computation to literature and writing.

I developed and maintain DARLA, a web application for automating sociophonetics.


Implementing a Hidden Markov Model Toolkit. Sravana Reddy. In Proceedings of the AAAI 2017 Symposium on Educational Advances in Artificial Intelligence (EAAI): Model AI Assignments.


Hidden Markov Models (HMMs) are a backbone of speech and natural language processing (NLP) and computational biology. They are a great way to teach general concepts such as dynamic programming, data-driven learning and inference, and expectation maximization for unsupervised learning. We have found that asking students in undergraduate NLP or machine learning courses to implement an HMM program is helpful in solidifying these ideas. Students also derive a great deal of satisfaction from completing a sizable project. However, the algorithms can seem abstract, and the scope of the project may be intimidating to some students. Our assignment breaks down the HMM into modular chunks, and motivates HMMs through applications drawn from NLP research rather than toy problems. We include
  • Starter code in an object-oriented Python framework, and programming guidelines.
  • Data files that draw from part-of-speech tagging, unsupervised discovery of vowels and consonants, and decipherment of substitution ciphers.
  • Lecture slides on HMMs, including step-through visualizations of the inference algorithms that translate easily into code.

Obfuscating Gender in Social Media Writing. Sravana Reddy and Kevin Knight. In Proceedings of the EMNLP 2016 Workshop on Natural Language Processing and Computational Social Science.


The vast availability of textual data on social media has led algorithms to automatically predict user attributes such as gender based on the user's writing. These methods are valuable for social science research as well as targeted advertising and profiling, but also compromise the privacy of users who may not realize that their personal idiolects can give away their demographic identities. Can we automatically modify a text so that the author is classified as a certain target gender, under limited knowledge of the classifier, while preserving the text's fluency and meaning? We present a basic model to modify a text using lexical substitution, show empirical results with Twitter and Yelp data, and outline ideas for extensions.

Toward completely automated vowel extraction: Introducing DARLA. Sravana Reddy and James N. Stanford. Linguistics Vanguard (2015).

preprint paper

Automatic Speech Recognition (ASR) is reaching further and further into everyday life with Apple's Siri, Google voice search, automated telephone information systems, dictation devices, closed captioning, and other applications. Along with such advances in speech technology, sociolinguists have been considering new methods for alignment and vowel formant extraction, including techniques like the Penn Aligner (Yuan and Liberman, 2008) and the FAVE automated vowel extraction program (Evanini et al., 2009, Rosenfelder et al., 2011). With humans transcribing audio recordings into sentences, these semi-automated methods can produce effective vowel formant measurements (Labov et al., 2013). But as the quality of ASR improves, sociolinguistics may be on the brink of another transformative technology: large-scale, completely automated vowel extraction without any need for human transcription. It would then be possible to quickly extract vowels from virtually limitless hours of recordings, such as YouTube, publicly available audio/video archives, and large-scale personal interviews or streaming video. How far away is this transformative moment? In this article, we introduce a fully automated program called DARLA (short for "Dartmouth Linguistic Automation,", which automatically generates transcriptions with ASR and extracts vowels using FAVE. Users simply upload an audio recording of speech, and DARLA produces vowel plots, a table of vowel formants, and probabilities of the phonetic environments for each token. In this paper, we describe DARLA and explore its sociolinguistic applications. We test the system on a dataset of the US Southern Shift and compare the results with semi-automated methods.

A Web Application for Automated Dialect Analysis. Sravana Reddy and James N. Stanford. In Proceedings of NAACL 2015 (Demos).

paper poster website

Sociolinguists are regularly faced with the task of measuring phonetic features from speech, which involves manually transcribing audio recordings -- a major bottleneck to analyzing large collections of data. We harness automatic speech recognition to build an online end-to-end web application where users upload untranscribed speech collections and receive formant measurements of the vowels in their data. We demonstrate this tool by using it to automatically analyze President Barack Obama’s vowel pronunciations.

Decoding Running Key Ciphers. Sravana Reddy and Kevin Knight. In Proceedings of ACL 2012.


There has been recent interest in the problem of decoding letter substitution ciphers using techniques inspired by natural language processing. We consider a different type of classical encoding scheme known as the running key cipher, and propose a search solution using Gibbs sampling with a word language model. We evaluate our method on synthetic ciphertexts of different lengths, and find that it outperforms previous work that employs Viterbi decoding with character-based models.

G2P Conversion of Proper Names Using Word Origin Information. Sonjia Waxmonsky and Sravana Reddy. In Proceedings of NAACL 2012.

paper poster data

Motivated by the fact that the pronunciation of a name may be influenced by its language of origin, we present methods to improve pronunciation prediction of proper names using word origin information. We train grapheme-to-phoneme (G2P) models on language-specific data sets and interpolate the outputs. We perform experiments on US personal surnames, a data set where word origin variation occurs naturally. Our methods can be used with any G2P algorithm that outputs posterior probabilities of phoneme sequences for a given word.

Learning from Mistakes: Expanding Pronunciation Lexicons Using Word Recognition Errors. Sravana Reddy and Evandro Gouvêa. In Proceedings of Interspeech 2011.

paper slides

We introduce the problem of learning pronunciations of out-of-vocabulary words from word recognition mistakes made by an automatic speech recognition (ASR) system. This question is especially relevant in cases where the ASR engine is a black box -- meaning that the only acoustic cues about the speech data come from the word recognition outputs. This paper presents an expectation maximization approach to inferring pronunciations from ASR word recognition hypotheses, which outperforms pronunciation estimates of a state of the art grapheme-to-phoneme system.

Unsupervised Discovery of Rhyme Schemes. Sravana Reddy and Kevin Knight. In Proceedings of ACL 2011.

paper slides data code

This paper describes an unsupervised, language-independent model for finding rhyme schemes in poetry, using no prior knowledge about rhyme or pronunciation.

What We Know About The Voynich Manuscript. Sravana Reddy and Kevin Knight. In Proceedings of the ACL 2011 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities.

paper slides code and data press

The Voynich Manuscript is an undeciphered document from medieval Europe. We present current knowledge about the manuscript's text through a series of questions about its linguistic properties.

An MDL-based Approach to Extracting Subword Units for Grapheme-to-Phoneme Conversion. Sravana Reddy and John Goldsmith. In Proceedings of NAACL 2010.


We address a key problem in grapheme-to-phoneme conversion: the ambiguity in mapping grapheme units to phonemes. Rather than using single letters and phonemes as units, we propose learning chunks, or subwords, to reduce ambiguity. This can be interpreted as learning a lexicon of subwords that has minimum description length. We implement an algorithm to build such a lexicon, as well as a simple decoder that uses these subwords.

Substring-based Transliteration with Conditional Random Fields. Sravana Reddy and Sonjia Waxmonsky. In Proceedings of the ACL 2010 Names Entities Workshop.


Motivated by phrase-based translation research, we present a transliteration system where characters are grouped into substrings to be mapped atomically into the target language. We show how this substring representation can be incorporated into a Conditional Random Field model that uses local context and phonemic information. Our training and test data consists of three sets: English to Hindi, English to Kannada, and English to Tamil (Kumaran and Kellner, 2007) from the NEWS 2009 Machine Transliteration Shared Task (Li et al., 2009).

Understanding Eggcorns. Sravana Reddy. In Proceedings of the the NAACL 2009 Workshop on Computational Approaches to Linguistic Creativity.


An eggcorn is a type of linguistic error where a word is substituted with one that is semantically plausible -- that is, the substitution is a semantic reanalysis of what may be a rare, archaic, or otherwise opaque term. We build a system that, given the original word and its eggcorn form, finds a semantic path between the two. Based on these paths, we derive a typology that reflects the different classes of semantic reinterpretation underlying eggcorns.


These are non-archival papers at conferences.

A large-scale online study of dialect variation in the US Northeast: Crowdsourcing with Amazon Mechanical Turk. Chaeyoon Kim, Sravana Reddy, Ezra Wyschogrod, and James Stanford. In NWAV 2016.

Due to the Founder Effect and the early English colonies, the US Northeast has some of the smallest dialect sub-regions in North America. Can these fine-grained distinctions be observed in an online crowd-sourced survey? Moreover, Carver predicts that new features emerge along the same lines as previous generations. Is this happening in Eastern New England? We used Amazon Mechanical Turk for two online crowdsourcing tasks: a self-reporting survey and an audio-recording task. We analyze the data graphically and statistically and compare with prior work.

Automatic speech recognition in sociophonetics. Sravana Reddy, James N. Stanford, and Michael Lefkowitz. Workshop (tutorial) in NWAV 2015. link

Is the Future Almost Here? Large-Scale Completely Automated Vowel Extraction of Free Speech. Sravana Reddy and James N. Stanford. In NWAV 2014.


Automatic Speech Recognition (ASR) is reaching farther into everyday life through applications like Apple’s Siri. Likewise, sociolinguists have been considering new technologies for vowel formant extraction, including semi-automated alignment/extraction techniques like the Penn Aligner and Forced Alignment Vowel Extraction (FAVE). With humans transcribing recordings into sentences, these semi-automated methods produce effective results. But sociolinguistics may be on the brink of another transformative technology: large-scale, completely automated vowel extraction without any need for human transcription. It would then be possible to quickly extract vowels from virtually limitless hours of recordings, such as YouTube, publicly available audio/video archives, and even live-streaming video. How far away is this transformative moment? In the present study, we apply state-of-the-art ASR to a real-world sociolinguistic dataset (U.S. Southern Vowel Shift) as a feasibility test.

A Twitter-Based Study of Newly Formed Clippings in American English. Sravana Reddy, James N. Stanford, and Joy Zhong. In ADS 2014.

slides press

Following Baclawski (2012), this study uses Twitter to examine newly formed clippings among younger speakers, including awks (awkward), adorb (adorable), ridic (ridiculous), hilar (hilarious). We analyzed 94 million tweets from 334,000 U.S. Twitter users who posted during 2013 (cf. Eisenstein et al. 2010; Bamman et al. 2012). We find that while women and men both use truncated forms, women are the leaders of the newer, primarily adjectival forms. These recently coined forms are also more common in tweets from urban locations. We compare our results to classic principles (Labov 2001), illustrating how large-scale Twitter analyses can be valuable in American dialectology.

A Document Recognition System for Early Modern Latin. Sravana Reddy and Gregory Crane. In DHCS 2006.

Large-scale digitization of manuscripts is facilitated by high-accuracy optical character recognition (OCR) engines. The focus of our work is on using these tools to digitize Latin texts. Many of the texts in the language, especially the early modern, make heavy use of special characters like ligatures and accented abbreviations. Current OCRs are inadequate for our purpose: their built-in training sets do not include all these special characters, and further, post-processing of OCR output is based on data and methods specific to the domain language, most of the current systems do not implement error-correction tools for Latin. This abstract outlines the development of a document recognition system for medieval and early modern Latin texts. We first evaluate the performance of the open source OCR framework, Gamera, on these manuscripts. We then incorporate language modeling functions to sharpen the character recognition output.


Learning Pronunciations from Unlabeled Evidence. 2012. Doctoral Dissertation, The University of Chicago. front matter

Part of Speech Induction Using Non-negative Matrix Factorization. 2009. Masters' Thesis, The University of Chicago.

Unsupervised part-of-speech induction involves the discovery of syntactic categories in a text, given no additional information other than the text itself. One requirement of an induction system is the ability to handle multiple categories for each word, in order to deal with word sense ambiguity. We construct an algorithm for unsupervised part-of-speech induction, treating the problem as one of soft clustering. The key technical component of the algorithm is the application of the recently developed technique of non-negative matrix factorization to the task of category discovery, using word contexts and morphology as syntactic cues.

Students Advised

Ian Stewart (Senior Thesis at Dartmouth). Now a PhD student at Georgia Tech.

  • Now We Stronger Than Ever: African-American Syntax in Twitter. In Proceedings of the Student Research Workshop at EACL 2014. paper

Emily Ahn (Senior Thesis at Wellesley). Now a student at Carnegie Mellon LTI.

  • A Computational Approach to Foreign Accent Classification. thesis

Teaching & Service

I was the local organizer for the North American Computational Linguistics Olympiad (NACLO) at Dartmouth, and one of the co-chairs of the demo session at NAACL 2016. I also review for *ACL conferences and workshops, and various journals.

I organized the Wellesley CS Colloquium in my last year.

Tools and Data

This is a collection of various resources I collected/created for my work that may be useful.

Python Autograder with HTML Output (under development). Sravana Reddy and Daniela Kreimerman. 2016.
Autograder for the Introductory CS class at Wellesley.


Transcriptions for the CSLU Foreign-Accented English Corpus. Emily Ahn and Sravana Reddy. 2016.
The CSLU Foreign-Accented Speech Corpus is a great source of speech data from non-native English speakers. We crowdsourced transcriptions for 7 of the 23 native languages on Mechanical Turk, and are making them available here.


DARLA (Dartmouth Linguistic Automation). Sravana Reddy and James Stanford, with assistance from Irene Feng. 2015-2016.
DARLA is a suite of automated analysis programs tailored to research questions in sociophonetics.

website code available on request

Chicago Rhyming Poetry Corpus. Morgan Sonderegger and Sravana Reddy. 2011.
A collection of rhyming poetry in English and French, manually annotated with rhyme schemes.