Taaltechnologien

Web site for the course Taaltechnologien for AI students.

The course is base of the book SPEECH and LANGUAGE PROCESSING by Daniel Jurafsky and James H. Martin

Exercises

  1. (chapter 4): Transcribe 12 Dutch words of your choosing into Worldbet. (postscript version at CSLU.CSE.OGI)
  2. (chapter 5): Implement the edit distance algorithm from figure 5.05 but include back-pointers to allow tracing the best alignment (use 'S', 'I', 'D' to indicate Substitution, Insertion, and Deletion). Test it on a few examples. NOTE: there are errors in the pseudo-code, e.g., the iterations should start at 1, not 0, and you should initialize the 0 column and row of the distance matrix
  3. (chapter 6): Implement the backoff smoothing (equation 6.37) based on Katz discounting (equation 6.29) in well documented Pseudo-code or a working program. Note that
    1) the new, smoothed, zero count becomes N1/N0, i.e., the original number of unique word N-grams divided by the number of unseen word N-grams.
    2) In the backoff procedure, count only over the Ngrams actually seen.
    That is Sum over all words Wn, keep previous words fixed: P(Wn|Wn-N+1..Wn-1) becomes Sum over all seen N-grams with Wn: Counts(Wn-N+1 ... Wn) and divide them by the sum over all N-grams, seen and unseen. Use smoothed counts. Do the same with the backoff probabilities, but use the smoothed counts of the N-1 grams while still counting over the seen N-grams.
  4. (chapter 7&8): POS-tagger implements a (toy) HMM POS tagger. It can use a POS bigram table to tag a sentence (note, it is extremely slow). This POS-tagger uses a Viterbi search to determine the lowest "cost" path optimizing the path with minimal summ cost (i.e., Cost = -log2(P(word|tag)) + m * -log2(P(Tag|PreviousTag))). In this formula m is the language match factor. Answer the following questions:
  5. Student Presentations of recent papers

Materials

Course texts from SPEECH and LANGUAGE PROCESSING