A simple project for an NLP class would be to make a decent corpus-based re-emvoweler. This one really isn't very good, and could easily be improved by considering bigram frequencies.
What a marvelous idea! You are right, that one isn't very good. It mapped a t to to instead of it Since it is a pronoun and to is a preposition, it should be easier to figure out between t = "to" and t="it." Figuring out if t = "tie" would be a little more complicated.
Are you sure that isn't how that one works? I'd be rather surprised...
Would be interesting to play with a multi-level system, first using intraword letter bigrams to guess the vowel, then checking that result with interword bigrams to confirm the resulting sequence of words makes sense, like a good spell-checker does.
I was thinking that there would be a small number k of matching words for each disemvoweled word, making it easy to exhaustively test all k^2 bigram frequencies. Meaning, we wouldn't need to worry about letter bigrams.
I'll post a reformatted version after I fix the code -- to use bigrams properly I shouldn't assume the previous word was guessed correctly. Should use the Viterbi algorithm instead.
import math
import re
# pdist.Pw is the unigram model, and pdist.cPw the bigram.
import pdist
# A special token our models take to signify the start of a sentence.
sentence_token = '<' + 'S' + '>'
y_is_a_vowel = False
def disemvowel(t):
if y_is_a_vowel:
return re.sub(r'[aeiouyAEIOUY]', '', t)
return re.sub(r'[aeiouAEIOU]', '', t)
emvowel_dict = {}
for token in pdist.Pw.iterkeys():
emvowel_dict.setdefault(disemvowel(token), []).append(token)
def all_emvowelings(t):
t = t.lower()
return emvowel_dict.get(t, [t])
def emvowel1(t):
return max(all_emvowelings(t), key=pdist.Pw.get)
def greedy_emvowel2(tokens):
rv = [sentence_token]
for t in tokens:
prev = rv[-1]
best = max(all_emvowelings(t),
key=lambda candidate: pdist.cPw(candidate, prev))
rv.append(best)
return rv[1:]
def viterbi_emvowel2(tokens):
logprob, words = max(emvoweling(tuple(tokens)))
return words[1:]
@pdist.memo
def emvoweling(tokens):
if not tokens:
return [(0.0, [sentence_token])]
def extend((logprob, words), c):
return (logprob + math.log10(pdist.cPw(c, words[-1])),
words + [c])
def extend2(prevresult, tween, c):
return extend(extend(prevresult, tween) if tween else prevresult,
c)
prevs = emvoweling(tokens[:-1])
return [max(extend(prev, c)
for prev in prevs)
# This is to try inserting missing 'a' and 'I' words; but it never
# seems to judge them worth inserting, in practice:
# for tween in ['a', 'i', None])
for c in all_emvowelings(tokens[-1])]
def try_on(sample):
print sample
tokens = re.findall(r"['\w]+", sample)
print
print ' '.join(map(emvowel1, tokens))
print
print ' '.join(greedy_emvowel2(tokens))
print
print ' '.join(viterbi_emvowel2(tokens))
# Let's look into why Viterbi over bigrams is so slow here:
options = [len(all_emvowelings(t)) for t in tokens]
print options
print sum(options), 'steps for greedy_emvowel2'
print sum(prev * next for prev, next in zip([1] + options, options)), \
'steps for viterbi_emvowel2'
# So it turns out to be common in our corpus for very short words
# to have 100-200 emvowelings. This is expensive since we're
# quadratic in that number (while of course linear in the length
# of the input).
samples = \
["""f t's tr tht r spcs s ln n th nvrs, thn 'd hv t sy tht th nvrs md rthr lw nd sttld fr vry lttl.
Grg Crln""",
"""Bttr t rmn slnt nd b thght fl thn t spk t nd rmv ll dbt.
brhm Lncln""",
"""ghty prcnt f sccss s shwng p.
Wdy lln""",
"""d nt wnt ppl t b grbl, s t svs m th trbl f lkng thm.
Jn stn""",
"""n rl lf, ssr y, thr s n sch thng s lgbr.
Frn Lbwtz""",
"""Sbd yr pptts, my drs, nd y'v cnqrd hmn ntr.
Chrls Dckns""",
"""A simple project for an NLP class would be to make a decent corpus-based re-emvoweler. This one really isn't very good, and could easily be improved by considering bigram frequencies."""]
for sample in samples:
print '---------------------------------------------'
try_on(disemvowel(sample))
print
Looks like a unigram model there. I just hacked one up and got almost exactly the same result on their sample input. Surprisingly the bigram model does worse -- maybe it's a cherrypicked input?
So, I was talking about more than just different orthography for nasalizing vowels. I actually find [nv] easier to pronounce than [mv] for some reason.
I think you are not crazy -- m and v don't exactly have the same place of articulation, and it's much easier to go from the palatal n to the labiodental v than it is to go from bilabial m to labiodental v.
(no subject)
Date: 2009-03-08 11:44 pm (UTC)(no subject)
Date: 2009-03-08 11:47 pm (UTC)Would be interesting to play with a multi-level system, first using intraword letter bigrams to guess the vowel, then checking that result with interword bigrams to confirm the resulting sequence of words makes sense, like a good spell-checker does.
(no subject)
Date: 2009-03-09 01:44 am (UTC)(no subject)
Date: 2009-03-09 01:52 am (UTC)import re import pdist sample = """\ f t's tr tht r spcs s ln n th nvrs, thn 'd hv t sy tht th nvrs md rthr lw nd sttld fr vry lttl. Grg Crln""" def disemvowel(t): #return re.sub(r'[aeiouyAEIOUY]', '', t) return re.sub(r'[aeiouAEIOU]', '', t) def reemvowel1(t): t = t.lower() candidates = dis.get(t, [t]) #print '1', t, [(c, pdist.Pw.get(c)) for c in candidates] return max(candidates, key=pdist.Pw.get) def reemvowel2(s, t): if s != '': s = s.lower() t = t.lower() candidates = dis.get(t, [t]) #print '2', (s, t), [(c, pdist.cPw(c, s)) for c in candidates] return max(candidates, key=lambda x: pdist.cPw(x, s)) print disemvowel("it's") dis = {} for k in pdist.Pw.iterkeys(): dis.setdefault(disemvowel(k), []).append(k) print sample print print ' '.join(map(reemvowel1, re.findall(r"['\w]+", sample))) rv = [''] for t in re.findall(r"['\w]+", sample): rv.append(reemvowel2(rv[-1], t)) print print ' '.join(rv[1:])(no subject)
Date: 2009-03-09 01:53 am (UTC)(no subject)
Date: 2009-03-09 02:00 am (UTC)(no subject)
Date: 2009-03-09 02:22 am (UTC)(no subject)
Date: 2009-03-09 02:24 am (UTC)(no subject)
Date: 2009-03-09 02:25 am (UTC)(no subject)
Date: 2009-03-09 02:26 am (UTC)How are you writing your escape strings?
(no subject)
Date: 2009-03-09 02:56 am (UTC)Or are you asking how I get these magical strings to appear? By knowing them, then I type them by hand.
(no subject)
Date: 2009-03-09 02:44 am (UTC)(no subject)
Date: 2009-03-09 03:24 am (UTC)(no subject)
Date: 2009-03-09 05:06 am (UTC)import math import re # pdist.Pw is the unigram model, and pdist.cPw the bigram. import pdist # A special token our models take to signify the start of a sentence. sentence_token = '<' + 'S' + '>' y_is_a_vowel = False def disemvowel(t): if y_is_a_vowel: return re.sub(r'[aeiouyAEIOUY]', '', t) return re.sub(r'[aeiouAEIOU]', '', t) emvowel_dict = {} for token in pdist.Pw.iterkeys(): emvowel_dict.setdefault(disemvowel(token), []).append(token) def all_emvowelings(t): t = t.lower() return emvowel_dict.get(t, [t]) def emvowel1(t): return max(all_emvowelings(t), key=pdist.Pw.get) def greedy_emvowel2(tokens): rv = [sentence_token] for t in tokens: prev = rv[-1] best = max(all_emvowelings(t), key=lambda candidate: pdist.cPw(candidate, prev)) rv.append(best) return rv[1:] def viterbi_emvowel2(tokens): logprob, words = max(emvoweling(tuple(tokens))) return words[1:] @pdist.memo def emvoweling(tokens): if not tokens: return [(0.0, [sentence_token])] def extend((logprob, words), c): return (logprob + math.log10(pdist.cPw(c, words[-1])), words + [c]) def extend2(prevresult, tween, c): return extend(extend(prevresult, tween) if tween else prevresult, c) prevs = emvoweling(tokens[:-1]) return [max(extend(prev, c) for prev in prevs) # This is to try inserting missing 'a' and 'I' words; but it never # seems to judge them worth inserting, in practice: # for tween in ['a', 'i', None]) for c in all_emvowelings(tokens[-1])] def try_on(sample): print sample tokens = re.findall(r"['\w]+", sample) print print ' '.join(map(emvowel1, tokens)) print print ' '.join(greedy_emvowel2(tokens)) print print ' '.join(viterbi_emvowel2(tokens)) # Let's look into why Viterbi over bigrams is so slow here: options = [len(all_emvowelings(t)) for t in tokens] print options print sum(options), 'steps for greedy_emvowel2' print sum(prev * next for prev, next in zip([1] + options, options)), \ 'steps for viterbi_emvowel2' # So it turns out to be common in our corpus for very short words # to have 100-200 emvowelings. This is expensive since we're # quadratic in that number (while of course linear in the length # of the input). samples = \ ["""f t's tr tht r spcs s ln n th nvrs, thn 'd hv t sy tht th nvrs md rthr lw nd sttld fr vry lttl. Grg Crln""", """Bttr t rmn slnt nd b thght fl thn t spk t nd rmv ll dbt. brhm Lncln""", """ghty prcnt f sccss s shwng p. Wdy lln""", """d nt wnt ppl t b grbl, s t svs m th trbl f lkng thm. Jn stn""", """n rl lf, ssr y, thr s n sch thng s lgbr. Frn Lbwtz""", """Sbd yr pptts, my drs, nd y'v cnqrd hmn ntr. Chrls Dckns""", """A simple project for an NLP class would be to make a decent corpus-based re-emvoweler. This one really isn't very good, and could easily be improved by considering bigram frequencies."""] for sample in samples: print '---------------------------------------------' try_on(disemvowel(sample)) print(no subject)
Date: 2009-03-09 01:50 am (UTC)(no subject)
Date: 2009-03-09 04:07 am (UTC)(no subject)
Date: 2009-03-09 04:07 am (UTC)(no subject)
Date: 2009-03-09 04:11 am (UTC)(no subject)
Date: 2009-03-09 05:02 am (UTC)I keep being surprised when romance languages are the same.
(no subject)
Date: 2009-03-09 05:10 am (UTC)Spanish and Italian, I think, usually pronounces them as "n" or "m" (exception: "tengo").
(no subject)
Date: 2009-03-09 08:19 am (UTC)(no subject)
Date: 2009-03-09 08:29 am (UTC)English words such as "emphatic" can feel a bit unnatural, especially for Romance speakers.
(no subject)
Date: 2009-03-09 07:14 pm (UTC)(no subject)
Date: 2009-03-09 04:53 pm (UTC)