Hey HN!
I'm one of the two developers behind Hackerhunt. As much as I love Hacker News and it's ranking algorithm for the front page news, it has a downside for the Show HN submissions. A lot of cool and useful stuff people have actually made themselves gets lost in /shownew without a real chance to get to the right audience. That's where the idea of a curated and categorised, à-la-Product-hunt, list was born.
This is a very early proof of concept and any suggestions on how to make it better are welcome!
Yes, we're analyzing the titles. They're vectorized and fed into a LSTM net which was trained on a manually tagged training set.
The set is not very big yet but yields good enough results for the initial proof of concept.
The samples were selected randomly at the beginning. Tags were thought up on the go, trying to generalize into broader categories.
After the initial tagging, we counted the number of samples for each category and grouped similar underrepresented tags together + added additional samples for tagging that could go into the smaller categories just by filtering those who matched specific keywords and further hand-picking them.
We initially tried training the classifier only with GitHub based samples and using the user-given tags from there. Although we grouped the tag base into a reasonable number of distinct categories, the way how GitHub users tag their projects turned out to be just too inconsistent and often unrelated to the titles, so manual tagging was seen as a better option for getting decent results fast enough.
If you have any more specific questions feel free to drop me a mail to arturs@finch.io
For those of us in industries at the fringes (but still of interest) to HN, want to have an 'Other' category? :-) We make scientific/genetic software - and I'm not sure where we might have fit in your sidebar.
Getting to the front page via 'Show HN' was very helpful to us. It'd be nice (for others too) to be able to both replicate that success, and soften the blow when you get a grand total of 2 upvotes.
This is a very early proof of concept and any suggestions on how to make it better are welcome!