GOAL: intelligently identifying the part-of-speech of words in sentence
SYNPOSIS: Individual project | Spring 2018 | Java
KEY SKILLS: Hidden Markov Models | Viterbi Algorithm | Parsing files, nested Maps, Graphs, Priority Qs
VALIDATION: 96.4% accuracy on the Brown corpus (35090/36,394 words correct)
Unfortunately, this program requires reading files with sizes beyond a web-browser’s capability.
Thus, instead of a trinket, I will be adding a simple video here of the tagger in action soon...
Key Process:
- Given a testing corpus, sentences with the parts of speech already tagged. Parsed the testing corpus using a map to compute the probability of (a) a word being a particular part of speech (b) parts-of-speeches it transitions to
- Combined this information into a graph creating a Hidden Markov model
- Implemented the Viterbi algorithm. In a sentence, for each word there are multiple options for parts-of-speeches to follow it. The Viterbi algorithm uses the information you give it to calculate which transition is most likely, thereby indicating what is the most likely sequence of parts-of-speeches.
- Used Scanner to create a user interface that tags each word the user types.