A few days ago, at a job interview, someone asked the following question.
How do you find an English word’s anagrams from a file of 10,000 arbitrary words? Anagram is a random order of certain set of characters, i.e. Tea, Ate, Eat, eta are all anagrams of each other.
I tried very hard explaining to my interviewer that the best solution for solving this problem is using common sense.
While one can always perform a character sort function on each word of those 10,000 words and then sticking them in hash table (the traditional way), i.e. Hash #1 – Tea, Ate, Hash #2 – Nose, Ones. Then, match the word with those hashes. There are better ways. Imagine performing 10,000 or more sort activities, it is such a waste of CPU time, plus storing 10,000 words in the memory just for the matching process. What an inefficient design, especially if you need to process matching across multiple 10,000 words files+. Huge Hashes, long execution time!
Anyway, my solution is based on the business rules. It is obvious that 5 letters words will never be anagrams of a 4 letters word, i.e. SNOWS is not same as SNOW. Even though they look very much similar. Hence, the following:
- Character Counting – Finding the length of the input word. This is as simple as doing a <word>.length() function. With that, when you read in the 10,000 words file, any words that doesn’t have the correct length can be thrown out. It will be a waste of CPU time to process and store them.
- Sum of character’s decimal value – Sorting function is a slow process, involving shifting characters of a word (an array, in fact). Why even wasting time, if there is a better way? We all know each character has a certain ASCII / Unicode representation. For example, the letter “A” is 65 in Decimal representation. The letter “s” is 115 in Decimal representation. Hence, the sum of decimals of “As” is 180. Doing a sum on decimal representation of characters is probably faster than sorting / character shifting of a word. (see http://www.asciitable.com/)
- Regexp of Vowels + Prime number. (improved method) It is true after performing step #2, there could be collisions, i.e. Both “As” and “Gm” will have 180 as the sum of their decimal values. Again, common sense helps us solving this problems. In English, there are 5 vowels, “A”, “E”, “I”, “O”, and “U”. “A” – 65, “E” – 69, “I” – 73”, “O” – 81, “U” – 85. The numerical distance between each is even number based. Between “A” and “I”, it is 4. Between “E” and “I” is also 4. Between “I” and “O”, it is 8. And, between “O” and “U” it is also “4”. As well as, between “A” and “O”, it is 16. All those distances are based on 4, an even number. If you have studied math before, you should know that prime number is not divisible, except it’s self and number one, i.e. 11, 3, 5, and 7 are all prime numbers. To solve the mis-summing problem posed by step 2, you simply do a regexp counting of vowels. (See http://rubular.com/) with [aeiou]. In our case, we jag out the multiplication. i.e. Sum of (Vowels in Decimal representation * A prime number (for example 11)) + Sum (Constant in Decimal). I bet you the odds of seeing a collision are less than one in a billion. So, this will be our improved algorithm.
A lot of programmers like to solve problem in programmatic ways. But, common senses (business rules) are far more efficient at solving complex problems. In our case, imagine you have to do anagram matching across one hundred 10,000 words files. What speed gains you will get by applying business rules. 1. Character counting, 2) following formula => Sum of (Vowels in Decimal representation * A prime number (for example 11)) + Sum (Constant in Decimal). Newton’s method is not precise. But, after a few iterations, the numerical solution to a problem is fairly precise. Same logic applies here. (http://en.wikipedia.org/wiki/Newton’s_method)
There are multiple ways to solving a problems. All paths lead to Rome. Except, some paths are shorter and faster.
Well, my job interview is still a mystery. So far, I am still waiting for a response from that company. It’s tougher than I thought finding a job nowadays. I guess that’s an exception to common senses. :-)
Love to hear your comments!!!