gusl: (Default)
[personal profile] gusl
A simple project for an NLP class would be to make a decent corpus-based re-emvoweler. This one really isn't very good, and could easily be improved by considering bigram frequencies.

(no subject)

Date: 2009-03-08 11:44 pm (UTC)
From: [identity profile] altamira16.livejournal.com
What a marvelous idea! You are right, that one isn't very good. It mapped a t to to instead of it Since it is a pronoun and to is a preposition, it should be easier to figure out between t = "to" and t="it." Figuring out if t = "tie" would be a little more complicated.

(no subject)

Date: 2009-03-08 11:47 pm (UTC)
From: [identity profile] gfish.livejournal.com
Are you sure that isn't how that one works? I'd be rather surprised...

Would be interesting to play with a multi-level system, first using intraword letter bigrams to guess the vowel, then checking that result with interword bigrams to confirm the resulting sequence of words makes sense, like a good spell-checker does.

(no subject)

Date: 2009-03-09 01:44 am (UTC)
From: [identity profile] gustavolacerda.livejournal.com
I was thinking that there would be a small number k of matching words for each disemvoweled word, making it easy to exhaustively test all k^2 bigram frequencies. Meaning, we wouldn't need to worry about letter bigrams.
Edited Date: 2009-03-09 01:44 am (UTC)

(no subject)

Date: 2009-03-09 01:52 am (UTC)
From: [identity profile] darius.livejournal.com
Yeah, here's my horrible ugly code. pdist.Pw is the unigram model, and pdist.cPw the bigram.

import re

import pdist

sample = """\
f t's tr tht r spcs s ln n th nvrs, thn 'd hv t sy tht th nvrs md rthr lw nd sttld fr vry lttl.

Grg Crln"""

def disemvowel(t):
    #return re.sub(r'[aeiouyAEIOUY]', '', t)
    return re.sub(r'[aeiouAEIOU]', '', t)

def reemvowel1(t):
    t = t.lower()
    candidates = dis.get(t, [t])
    #print '1', t, [(c, pdist.Pw.get(c)) for c in candidates]
    return max(candidates, key=pdist.Pw.get)

def reemvowel2(s, t):
    if s != '': s = s.lower()
    t = t.lower()
    candidates = dis.get(t, [t])
    #print '2', (s, t), [(c, pdist.cPw(c, s)) for c in candidates]
    return max(candidates, key=lambda x: pdist.cPw(x, s))

print disemvowel("it's")

dis = {}
for k in pdist.Pw.iterkeys():
    dis.setdefault(disemvowel(k), []).append(k)

print sample
print 
print ' '.join(map(reemvowel1, re.findall(r"['\w]+", sample)))

rv = ['']
for t in re.findall(r"['\w]+", sample):
    rv.append(reemvowel2(rv[-1], t))

print 
print ' '.join(rv[1:])

(no subject)

Date: 2009-03-09 01:53 am (UTC)
From: [identity profile] darius.livejournal.com
Er, the strikethrough is from the special token for starting a sentence (a left-angle, capital-S, right-angle).

(no subject)

Date: 2009-03-09 02:00 am (UTC)
From: [identity profile] gustavolacerda.livejournal.com
hmm... I think you can use the < pre > tag, or format comments as HTML.
'< S>'; this is a comment
fail! Inserting a space is the only trick I know. :-P

(no subject)

Date: 2009-03-09 02:22 am (UTC)
From: [identity profile] bhudson.livejournal.com
&gt; and &lt; are the greater-than and less-than symbols.

(no subject)

Date: 2009-03-09 02:24 am (UTC)
From: [identity profile] gustavolacerda.livejournal.com
does LJ offer a way to automatically convert "<" into "<" and so forth?

(no subject)

Date: 2009-03-09 02:25 am (UTC)
From: [identity profile] bhudson.livejournal.com
The web client doesn't appear to. You could paste it into emacs or vi and just do a global search & replace.

(no subject)

Date: 2009-03-09 02:26 am (UTC)
From: [identity profile] gustavolacerda.livejournal.com
I meant "<" into "& l t ;".

How are you writing your escape strings?

(no subject)

Date: 2009-03-09 02:56 am (UTC)
From: [identity profile] bhudson.livejournal.com
Are you asking how I get an ampersand to show up? With &amp;.

Or are you asking how I get these magical strings to appear? By knowing them, then I type them by hand.

(no subject)

Date: 2009-03-09 02:44 am (UTC)
From: [identity profile] darius.livejournal.com
I'll post a reformatted version after I fix the code -- to use bigrams properly I shouldn't assume the previous word was guessed correctly. Should use the Viterbi algorithm instead.

(no subject)

Date: 2009-03-09 03:24 am (UTC)
From: [identity profile] darius.livejournal.com
That does fix it -- in that we now do better with bigrams than unigrams. It's still a long way from human level.

(no subject)

Date: 2009-03-09 05:06 am (UTC)
From: [identity profile] darius.livejournal.com
Here's my homework:

import math
import re

# pdist.Pw is the unigram model, and pdist.cPw the bigram.
import pdist

# A special token our models take to signify the start of a sentence.
sentence_token = '<' + 'S' + '>'

y_is_a_vowel = False

def disemvowel(t):
    if y_is_a_vowel:
        return re.sub(r'[aeiouyAEIOUY]', '', t)
    return re.sub(r'[aeiouAEIOU]', '', t)

emvowel_dict = {}
for token in pdist.Pw.iterkeys():
    emvowel_dict.setdefault(disemvowel(token), []).append(token)

def all_emvowelings(t):
    t = t.lower()
    return emvowel_dict.get(t, [t])

def emvowel1(t):
    return max(all_emvowelings(t), key=pdist.Pw.get)

def greedy_emvowel2(tokens):
    rv = [sentence_token]
    for t in tokens:
        prev = rv[-1]
        best = max(all_emvowelings(t),
                   key=lambda candidate: pdist.cPw(candidate, prev))
        rv.append(best)
    return rv[1:]

def viterbi_emvowel2(tokens):
    logprob, words = max(emvoweling(tuple(tokens)))
    return words[1:]

@pdist.memo
def emvoweling(tokens):
    if not tokens:
        return [(0.0, [sentence_token])]
    def extend((logprob, words), c):
        return (logprob + math.log10(pdist.cPw(c, words[-1])),
                words + [c])
    def extend2(prevresult, tween, c):
        return extend(extend(prevresult, tween) if tween else prevresult,
                      c)
    prevs = emvoweling(tokens[:-1])
    return [max(extend(prev, c)
                for prev in prevs)
# This is to try inserting missing 'a' and 'I' words; but it never 
# seems to judge them worth inserting, in practice:
#                for tween in ['a', 'i', None])
            for c in all_emvowelings(tokens[-1])]

def try_on(sample):
    print sample
    tokens = re.findall(r"['\w]+", sample)
    print 
    print ' '.join(map(emvowel1, tokens))
    print 
    print ' '.join(greedy_emvowel2(tokens))
    print 
    print ' '.join(viterbi_emvowel2(tokens))
    # Let's look into why Viterbi over bigrams is so slow here:
    options = [len(all_emvowelings(t)) for t in tokens]
    print options
    print sum(options), 'steps for greedy_emvowel2'
    print sum(prev * next for prev, next in zip([1] + options, options)), \
        'steps for viterbi_emvowel2'
    # So it turns out to be common in our corpus for very short words
    # to have 100-200 emvowelings. This is expensive since we're
    # quadratic in that number (while of course linear in the length
    # of the input).

samples = \
["""f t's tr tht r spcs s ln n th nvrs, thn 'd hv t sy tht th nvrs md rthr lw nd sttld fr vry lttl.

Grg Crln""",
 """Bttr t rmn slnt nd b thght fl thn t spk t nd rmv ll dbt.

brhm Lncln""",
 """ghty prcnt f sccss s shwng p.

Wdy lln""",
 """d nt wnt ppl t b grbl, s t svs m th trbl f lkng thm.

Jn stn""",
 """n rl lf, ssr y, thr s n sch thng s lgbr.

Frn Lbwtz""",
 """Sbd yr pptts, my drs, nd y'v cnqrd hmn ntr.

Chrls Dckns""",
 """A simple project for an NLP class would be to make a decent corpus-based re-emvoweler. This one really isn't very good, and could easily be improved by considering bigram frequencies."""]

for sample in samples:
    print '---------------------------------------------'
    try_on(disemvowel(sample))
    print

(no subject)

Date: 2009-03-09 01:50 am (UTC)
From: [identity profile] darius.livejournal.com
Looks like a unigram model there. I just hacked one up and got almost exactly the same result on their sample input. Surprisingly the bigram model does worse -- maybe it's a cherrypicked input?

(no subject)

Date: 2009-03-09 04:07 am (UTC)
From: [identity profile] roseandsigil.livejournal.com
That word should really assimilate to "envowel".

(no subject)

Date: 2009-03-09 04:07 am (UTC)
From: [identity profile] roseandsigil.livejournal.com
Wait...I'm wrong. Huh. "Envowel" sounds a lot better to me than "emvowel", though. Weird.

(no subject)

Date: 2009-03-09 04:11 am (UTC)
From: [identity profile] gustavolacerda.livejournal.com
Ironically, you seem to be following a Portuguese spelling rule (always use "n" except before "b" and "p").
Edited Date: 2009-03-09 04:11 am (UTC)

(no subject)

Date: 2009-03-09 05:02 am (UTC)
From: [identity profile] bhudson.livejournal.com
Oh, you have that too?

I keep being surprised when romance languages are the same.

(no subject)

Date: 2009-03-09 05:10 am (UTC)
From: [identity profile] gustavolacerda.livejournal.com
There's more: Portuguese and French both realize those "n"s and "m"s as [~], i.e. mere nasalization.

Spanish and Italian, I think, usually pronounces them as "n" or "m" (exception: "tengo").

(no subject)

Date: 2009-03-09 08:19 am (UTC)
From: [identity profile] roseandsigil.livejournal.com
So, I was talking about more than just different orthography for nasalizing vowels. I actually find [nv] easier to pronounce than [mv] for some reason.

(no subject)

Date: 2009-03-09 08:29 am (UTC)
From: [identity profile] gustavolacerda.livejournal.com
Yes. That is no coincidence. Lip mechanics explains what you find easy, and also the evolution of orthography.

English words such as "emphatic" can feel a bit unnatural, especially for Romance speakers.

(no subject)

Date: 2009-03-09 07:14 pm (UTC)
From: [personal profile] chrisamaphone
omphaloskepsis!

(no subject)

Date: 2009-03-09 04:53 pm (UTC)
From: [identity profile] wjl.livejournal.com
I think you are not crazy -- m and v don't exactly have the same place of articulation, and it's much easier to go from the palatal n to the labiodental v than it is to go from bilabial m to labiodental v.

February 2020

S M T W T F S
      1
2345678
9101112131415
16171819202122
23242526272829

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags