Monday, November 9, 2009

The H Algorithm

There's an interesting property about the letter h in the English language. Namely, you can insert quite a few h's into any word and not change the pronunciation. Let's take the word 'name': if we spell it nhahmheh, it's still recognizable and pronounceable as the same word.

What happens if we make an algorithm which figures out how many instances of h that could be found in a word without changing its sound?

Assume vowels do not change sound when followed by an h. Well, part of that has to do with not adding multiple h's, so the word isn't inrecognizable. If not, we're cognitively golden with the vowels- but only for a familiar word. And only for now. Alright. Let N(w) be the number of h's in the word. Let L(w) be the number of letters in the word. So, here's the first part of the algorithm:


N(w) = L(w)


This assumes that you can just add an 'h' after every letter in the word. If the word suddenly got an h appended in front of it, the first 'h' would have to be pronounced. However, this algorithm is so imperfect that it hurts. What about existing h's in the word, such as in 'the'? This is simple, just subtract one h for every existing h. Therefore, 'the' becomes 'theh.' So, let H(w) be the number of preexisting h's in the word:


N(w) = L(w) - H(w)


What about consonants which would change sound if followed by an h, such as sh or ch? We'd have to do a huuuuge comparison function. I'm going to pass w, along with all of the consonants which would change sound significantly:


N(w) = L(w) - H(w) - ?(w, c, soft g, p, q, s, t)


And, lastly, let's assume that vowels do change sound. So, in general, long vowels subtract one h, but all others do not. (Check me on this one.) I mean, the 'a' in 'name' is a long a, and it would change, but the alternate a's in 'and' and 'all' would not. 'Y' does not count. Y is a consonant here.


N(w) = L(w) - H(w) - ?(w, c, soft g, p, q, s, t, long a, long e, long i, long o, long u)


which is just


N(w) = L(w) - ?(w, c, soft g, h, p, q, s, t, long a, long e, long i, long o, long u)


To be fair, we do lose information here: the nature of the vowel before the h was added. Going back to the example of 'theh,' it can be pronounced a couple of ways, either the traditional 'the' or with the 'e' having the same sound as it does in 'empty.' Drat. This is even worse- how do you make an algorithm for determining long vowels? Soft g is a bit easier, but long vowels? Is there a rule?

Does anyone feel like making a program? How about stress-testing the algorithm?

No comments:

Post a Comment