Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Glad to see work like this being shared!

There are some well known text classification datasets, e.g. the Reuters news dataset from David Lewis of Bell Labs:

  http://www.daviddlewis.com/resources/testcollections/reuters21578/
More background here:

  https://link.springer.com/content/pdf/bbm%3A978-3-642-04533-2%2F1.pdf
Here's a result from ReelTwo's Classification System circa 2003 (Based on a bayesian learner; related to the U Waikato WEKA ML system) if you'd be up for comparison:

  https://web.archive.org/web/20040606002449/http://www.reeltwo.com/datasets.html
10 categories 2,535 documents 15 build time (~170 docs/sec; these were short news abstracts; see pdf for example) 0.9121 F-measure

Build Time is the time to load, model and evaluate (using Leave-One-Out evaluation) a dataset on a WinXP/1GHz Celeron/256MB computer. F-Measure is the micro-averaged F-Measure across all categories in the dataset.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: