The Expansion Problem
Given an initial corpus of text for training and a target text, find the minimum number of letters the text can be reduced to while maintaining enough information to be reconstructed without ambiguity.
For example, the training text might be a few hundred articles from the New York Times. The target text might be this sentence. Perhaps we could encode the target text:
t targ tex mi b thi sent.There encoding and decoding of these sentences are deeply connected. There are also multiple levels of information. The first two can be represented by a Markov chain algorithm:
- Words: "thereb" encodes "thereby", because nothing else starts with "thereb"
- Grammar: "tha man" encodes "that man" because "that" is most likely to precede "man" and start with "tha".
- Context: More general than grammar, when talking about flowers we might talk about colors as well, etc. This is a long-distance relationship between words.
1 comment:
Jason LaPorte emailed me a really cool solution to part of this problem that uses Radix Trees.
Post a Comment