Gustavo Lacerda

I was thinking that there would be a small number k of matching words for each disemvoweled word, making it easy to exhaustively test all k^2 bigram frequencies. Meaning, we wouldn't need to worry about letter bigrams.

Edited Date: 2009-03-09 01:44 am (UTC)

From:

Yeah, here's my horrible ugly code. pdist.Pw is the unigram model, and pdist.cPw the bigram.

import re

import pdist

sample = """\
f t's tr tht r spcs s ln n th nvrs, thn 'd hv t sy tht th nvrs md rthr lw nd sttld fr vry lttl.

Grg Crln"""

def disemvowel(t):
    #return re.sub(r'[aeiouyAEIOUY]', '', t)
    return re.sub(r'[aeiouAEIOU]', '', t)

def reemvowel1(t):
    t = t.lower()
    candidates = dis.get(t, [t])
    #print '1', t, [(c, pdist.Pw.get(c)) for c in candidates]
    return max(candidates, key=pdist.Pw.get)

def reemvowel2(s, t):
    if s != '': s = s.lower()
    t = t.lower()
    candidates = dis.get(t, [t])
    #print '2', (s, t), [(c, pdist.cPw(c, s)) for c in candidates]
    return max(candidates, key=lambda x: pdist.cPw(x, s))

print disemvowel("it's")

dis = {}
for k in pdist.Pw.iterkeys():
    dis.setdefault(disemvowel(k), []).append(k)

print sample
print 
print ' '.join(map(reemvowel1, re.findall(r"['\w]+", sample)))

rv = ['']
for t in re.findall(r"['\w]+", sample):
    rv.append(reemvowel2(rv[-1], t))

print 
print ' '.join(rv[1:])

From:

Er, the strikethrough is from the special token for starting a sentence (a left-angle, capital-S, right-angle).

From:

hmm... I think you can use the < pre > tag, or format comments as HTML.

'< S>'; this is a comment

fail! Inserting a space is the only trick I know. :-P

From:

> and < are the greater-than and less-than symbols.

From:

does LJ offer a way to automatically convert "<" into "<" and so forth?

From:

The web client doesn't appear to. You could paste it into emacs or vi and just do a global search & replace.

From:

I meant "<" into "& l t ;".

How are you writing your escape strings?

From:

Are you asking how I get an ampersand to show up? With &.

Or are you asking how I get these magical strings to appear? By knowing them, then I type them by hand.

From:

I'll post a reformatted version after I fix the code -- to use bigrams properly I shouldn't assume the previous word was guessed correctly. Should use the Viterbi algorithm instead.

From:

That does fix it -- in that we now do better with bigrams than unigrams. It's still a long way from human level.

From:

Here's my homework:

import math
import re

# pdist.Pw is the unigram model, and pdist.cPw the bigram.
import pdist

# A special token our models take to signify the start of a sentence.
sentence_token = '<' + 'S' + '>'

y_is_a_vowel = False

def disemvowel(t):
    if y_is_a_vowel:
        return re.sub(r'[aeiouyAEIOUY]', '', t)
    return re.sub(r'[aeiouAEIOU]', '', t)

emvowel_dict = {}
for token in pdist.Pw.iterkeys():
    emvowel_dict.setdefault(disemvowel(token), []).append(token)

def all_emvowelings(t):
    t = t.lower()
    return emvowel_dict.get(t, [t])

def emvowel1(t):
    return max(all_emvowelings(t), key=pdist.Pw.get)

def greedy_emvowel2(tokens):
    rv = [sentence_token]
    for t in tokens:
        prev = rv[-1]
        best = max(all_emvowelings(t),
                   key=lambda candidate: pdist.cPw(candidate, prev))
        rv.append(best)
    return rv[1:]

def viterbi_emvowel2(tokens):
    logprob, words = max(emvoweling(tuple(tokens)))
    return words[1:]

@pdist.memo
def emvoweling(tokens):
    if not tokens:
        return [(0.0, [sentence_token])]
    def extend((logprob, words), c):
        return (logprob + math.log10(pdist.cPw(c, words[-1])),
                words + [c])
    def extend2(prevresult, tween, c):
        return extend(extend(prevresult, tween) if tween else prevresult,
                      c)
    prevs = emvoweling(tokens[:-1])
    return [max(extend(prev, c)
                for prev in prevs)
# This is to try inserting missing 'a' and 'I' words; but it never 
# seems to judge them worth inserting, in practice:
#                for tween in ['a', 'i', None])
            for c in all_emvowelings(tokens[-1])]

def try_on(sample):
    print sample
    tokens = re.findall(r"['\w]+", sample)
    print 
    print ' '.join(map(emvowel1, tokens))
    print 
    print ' '.join(greedy_emvowel2(tokens))
    print 
    print ' '.join(viterbi_emvowel2(tokens))
    # Let's look into why Viterbi over bigrams is so slow here:
    options = [len(all_emvowelings(t)) for t in tokens]
    print options
    print sum(options), 'steps for greedy_emvowel2'
    print sum(prev * next for prev, next in zip([1] + options, options)), \
        'steps for viterbi_emvowel2'
    # So it turns out to be common in our corpus for very short words
    # to have 100-200 emvowelings. This is expensive since we're
    # quadratic in that number (while of course linear in the length
    # of the input).

samples = \
["""f t's tr tht r spcs s ln n th nvrs, thn 'd hv t sy tht th nvrs md rthr lw nd sttld fr vry lttl.

Grg Crln""",
 """Bttr t rmn slnt nd b thght fl thn t spk t nd rmv ll dbt.

brhm Lncln""",
 """ghty prcnt f sccss s shwng p.

Wdy lln""",
 """d nt wnt ppl t b grbl, s t svs m th trbl f lkng thm.

Jn stn""",
 """n rl lf, ssr y, thr s n sch thng s lgbr.

Frn Lbwtz""",
 """Sbd yr pptts, my drs, nd y'v cnqrd hmn ntr.

Chrls Dckns""",
 """A simple project for an NLP class would be to make a decent corpus-based re-emvoweler. This one really isn't very good, and could easily be improved by considering bigram frequencies."""]

for sample in samples:
    print '---------------------------------------------'
    try_on(disemvowel(sample))
    print

From:

Looks like a unigram model there. I just hacked one up and got almost exactly the same result on their sample input. Surprisingly the bigram model does worse -- maybe it's a cherrypicked input?

From:

roseandsigil.livejournal.com

That word should really assimilate to "envowel".

From:

roseandsigil.livejournal.com

Wait...I'm wrong. Huh. "Envowel" sounds a lot better to me than "emvowel", though. Weird.

From:

Ironically, you seem to be following a Portuguese spelling rule (always use "n" except before "b" and "p").

Edited Date: 2009-03-09 04:11 am (UTC)

From:

Oh, you have that too?

I keep being surprised when romance languages are the same.

From:

There's more: Portuguese and French both realize those "n"s and "m"s as [~], i.e. mere nasalization.

Spanish and Italian, I think, usually pronounces them as "n" or "m" (exception: "tengo").

From:

roseandsigil.livejournal.com

So, I was talking about more than just different orthography for nasalizing vowels. I actually find [nv] easier to pronounce than [mv] for some reason.

From: