Sunday, September 10, 2006

The Expansion Problem

Given an initial corpus of text for training and a target text, find the minimum number of letters the text can be reduced to while maintaining enough information to be reconstructed without ambiguity.

For example, the training text might be a few hundred articles from the New York Times. The target text might be this sentence. Perhaps we could encode the target text:

t targ tex mi b thi sent.
There encoding and decoding of these sentences are deeply connected. There are also multiple levels of information. The first two can be represented by a Markov chain algorithm:
  1. Words: "thereb" encodes "thereby", because nothing else starts with "thereb"
  2. Grammar: "tha man" encodes "that man" because "that" is most likely to precede "man" and start with "tha".
  3. Context: More general than grammar, when talking about flowers we might talk about colors as well, etc. This is a long-distance relationship between words.
The problem has a huge search space, but might be efficiently implemented with some creative heuristics. The applications to natural language compression is obvious, but I think this would be nice in a real-time system that expanded your text as you typed (say, an email). For keyboard-based input, the slowdown in watching for text expansion would probably outweigh the benefits, but in situations where entering input is an expensive operation, this would be ideal (I think some cell phones implement a version of this that considers the first-level, word frequency, for key disambiguation). There are also creative applications if you apply this to music (a language, just generally limited to poetic communication).

1 comment:

Kyle said...

Jason LaPorte emailed me a really cool solution to part of this problem that uses Radix Trees.