Projects
research and tinkering

Odin (open-domain informer)


A rule-based information extraction language based on linguistic features. This is a collaborative project with Marco Valenzuela. Odin allows you to define traversal patterns over syntactic dependency graphs.

reach

reach is an information extraction system for the biomedical domain that is designed to read scientific literature and extract cancer signaling pathways.

A few things that reach can currently do:
  • recognition of biochemical entities (proteins, chemicals, etc.)
  • grounding entities to knowledge bases (ex. Uniprot)
  • perform coreference resolution for both entities and interactions
  • detect and consolidate duplicate mentions of an entity or event
  • find causal precedence relations between events
The system makes extensive use of Odin.

Causal connectivity


A visualization of a causal graph of events automatically extracted from biomedical literature using Odin rules. This is an ongoing, collaborative project with Marco Valenzuela and Mihai Surdeanu.

processors


I am a contributor to Mihai Surdeanu's Scala library for natural language processing.

processors-server

curl -H "Content-Type: application/json" -X POST -d '{"text": "My name is Inigo Montoya. You killed my father. Prepare to die."}' http://localhost:8888/annotate

Actor-based RESTful services for interacting with processors. Built using spray.

py-processors

from processors import *

# specify a port to expose the server.
proc = Processor(port=8886)

# Start the server.
proc.start_server(path/to/processors-server.jar)

doc = proc.annotate("My name is Inigo Montoya.")

A 2.X and 3.X compatible Python library for interacting with processors using the processors-server.


autotres


The successor to Jeff Berry's AutoTrace, autotres is a Python 3.x library for training deep belief networks to automatically detect and trace tongue contours in ultrasound images. Built with Benjamin Martin using Lasagne.

tongue tracing


A html5 + javascript method for tracing tongue contours in ultrasound images captured during speech. Written for use by Diana Archangeli's Arizona Phonological Imaging Lab


web services for psycholing research


Some simple web services built for Ken Forster's lab to help speed up the development of psycholinguistic experiments. The site uses a MongoDB store of word2vec embeddings trained on the English Gigaword to quickly perform vector-based comparisons of words.


kikimimi


A simple python bot that uses praw to scour reddit for posts made by celebrities. The retrieved text is then run through an NLP pipeline using nltk for the heavy lifting + some custom post-processing for sentence boundary and tokenization correction. I used kikimimi to start the Snoop corpus.

AZP2FA (forced aligner)


A forced aligner using Python and htk. A fork of the Penn Phonetics Lab Forced Aligner (P2FA).