research and tinkering

Influence Pathways

A platform for accelerating knowledge exploration and opportunity discovery by constructing influence graphs across scientific publications. The IE component is powered by Odin. This is an ongoing, collaborative project with Marco Valenzuela, Mihai Surdeanu, and Zechy Wong in partnership with the Bill and Melinda Gates Foundation.

Odin (open-domain informer)

A rule-based information extraction language based on linguistic features. This is a collaborative project with Marco Valenzuela. Odin allows you to define traversal patterns over syntactic dependency graphs.

path-o-gen (forthcoming)

Method using distant supervision and NNLMs to generate partially lexicalized, syntax-based information extractors that capture relations and events.

why-this (private contract)

A web-based framework for active learning of relevant, interpretable features for document classification.

I was contracted to develop the (non-public) prototype (UI and library) for an NLP startup.


reach is an information extraction system for the biomedical domain that is designed to read scientific literature and extract cancer signaling pathways.

A few things that reach can currently do:
  • recognition of biochemical entities (proteins, chemicals, etc.)
  • grounding entities to knowledge bases (ex. Uniprot)
  • perform coreference resolution for both entities and interactions
  • detect and consolidate duplicate mentions of an entity or event
  • find causal precedence relations between events
The system makes extensive use of Odin.


I am a contributor to Mihai Surdeanu's Scala library for natural language processing.


curl -H "Content-Type: application/json" -X POST -d '{"text": "My name is Inigo Montoya. You killed my father. Prepare to die."}' http://localhost:8888/annotate

Actor-based RESTful services for interacting with processors. Built using spray.


from processors import *

# specify a port to expose the server.
proc = Processor(port=8886)

# Start the server.

doc = proc.annotate("My name is Inigo Montoya.")

A 2.X and 3.X compatible Python library for interacting with processors using the processors-server.


The successor to Jeff Berry's AutoTrace, autotres is a Python 3.x library for training deep belief networks to automatically detect and trace tongue contours in ultrasound images. Built with Benjamin Martin using Lasagne.

tongue tracing

A html5 + javascript method for tracing tongue contours in ultrasound images captured during speech. Written for use by Diana Archangeli's Arizona Phonological Imaging Lab

web services for psycholing research

Some simple web services built for Ken Forster's lab to help speed up the development of psycholinguistic experiments. The site uses a MongoDB store of word2vec embeddings trained on the English Gigaword to quickly perform vector-based comparisons of words.


A simple python bot that uses praw to scour reddit for posts made by celebrities. The retrieved text is then run through an NLP pipeline using nltk for the heavy lifting + some custom post-processing for sentence boundary and tokenization correction. I used kikimimi to start the Snoop corpus.

AZP2FA (forced aligner)

A forced aligner using Python and htk. A fork of the Penn Phonetics Lab Forced Aligner (P2FA).