Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Could you somehow use it in reverse? What I mean is, is it possible to get a random text generator for a certain language and then use it to determine, whether a given text is in that language or not?


Yes.

Given the string so far, see how probable it is that the generator would generate the next character, p(x_n | x_<n). Running through the whole string you can build up the log probability of the whole string: log p(x) = \sum_n log p(x_n | x_<n). Comparing the log probabilities under different models gives you a language classifier. For a first stab at the one-class problem, compare the log probability to what the model typically assigns to strings it randomly generates.

For more on information theory, modelling and inference you might like: http://www.inference.phy.cam.ac.uk/mackay/itila/book.html


I recently did exactly this to discriminate English text from gibberish.

https://github.com/rrenaud/Gibberish-Detector


It sounds like you are talking about a naive Bayesian classifier. PG wrote a couple of articles on his experience with these for spam filtering (http://www.paulgraham.com/spam.html and http://www.paulgraham.com/better.html). They're probably a decent high-level introduction to the area.

For a more in-depth, yet very accessible discussion, I would recommend "Speech and Language Processing" by Jurafsky & Martin (http://books.google.com/books/about/SPEECH_AND_LANGUAGE_PROC...). It's considered by many to be the Bible of NLP.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: