I don't have a doctorate and I'm not an academic. Let me try to be helpful.
IDF is old, like 1950s-1970s for most of the work. You have to put things in context. Digital storage was for surveys and structured documents; things that could be computed upon. Wasting computers for literature and opinions wasn't part of the application space. Why would you be storing newspaper opinion columns on your univac?
So if we are looking at say, demographic population surveys, essentially almanac data, then IDF serves you well. If I was searching "Duluth", then the document concerning the measurements taken in Duluth will have that word in it frequently.
Academically there's something called a "Zipfian" distribution and some math to back it up, but I have very little confidence in my mathematical formalism skills (been working on it for 20 years, still not very good) so excuse me for skipping over it.
I think, as a whole, we've understated the human interface problem in search. There's some technical tools that if are exposed and used can dramatically improve results.
For instance, a swap operator.
In some of my search systems I permit a syntax:
"s(phrase[0]:...:phrase[n])"
That's because sometimes you can have a phrase say "x y z" where "z y x" and "x z y" are common but have different meanings and you only want a subset.
Then you have polysemy and homonomy ... so sometimes you want an exact terms, sometimes you want to fuzzy search, sometimes you want a collection of exact terms. So we can do exact with say "=", collection with a "|" and so on...
's(=x:y0|y1|y2:z)"'
Now pretend you also want a range in there:
's(=x:y0|y1a..y1z|y2:z)'
and so on. This syntax is also quite fast to process. You can permute all the elements without too much effort, look them up, and aggregate the results in very little time but we are really starting to get into a RegEx like query system which means a decent amount of technical knowledge is needed and that's where we have the interface problem.
These systems are relatively easy to code and run on modest hardware (as in, something you can easily fit under a desk) but require more from the user - a level that frankly you just aren't going to get, let's be real here. I've found lots of otherwise competent programmers who struggle with say regex and bpf, I think I have some natural talent there that is not super common.
This is an example of why I think search is maybe, 50% a human interface problem.
IDF is old, like 1950s-1970s for most of the work. You have to put things in context. Digital storage was for surveys and structured documents; things that could be computed upon. Wasting computers for literature and opinions wasn't part of the application space. Why would you be storing newspaper opinion columns on your univac?
So if we are looking at say, demographic population surveys, essentially almanac data, then IDF serves you well. If I was searching "Duluth", then the document concerning the measurements taken in Duluth will have that word in it frequently.
Academically there's something called a "Zipfian" distribution and some math to back it up, but I have very little confidence in my mathematical formalism skills (been working on it for 20 years, still not very good) so excuse me for skipping over it.
The pros/cons appear in "application" sections when describing a technique. You can find these on wikipedia. Here's wikipedia's category on it: https://en.wikipedia.org/wiki/Category:Information_retrieval...
I think, as a whole, we've understated the human interface problem in search. There's some technical tools that if are exposed and used can dramatically improve results.
For instance, a swap operator.
In some of my search systems I permit a syntax:
That's because sometimes you can have a phrase say "x y z" where "z y x" and "x z y" are common but have different meanings and you only want a subset.Then you have polysemy and homonomy ... so sometimes you want an exact terms, sometimes you want to fuzzy search, sometimes you want a collection of exact terms. So we can do exact with say "=", collection with a "|" and so on...
Now pretend you also want a range in there: and so on. This syntax is also quite fast to process. You can permute all the elements without too much effort, look them up, and aggregate the results in very little time but we are really starting to get into a RegEx like query system which means a decent amount of technical knowledge is needed and that's where we have the interface problem.These systems are relatively easy to code and run on modest hardware (as in, something you can easily fit under a desk) but require more from the user - a level that frankly you just aren't going to get, let's be real here. I've found lots of otherwise competent programmers who struggle with say regex and bpf, I think I have some natural talent there that is not super common.
This is an example of why I think search is maybe, 50% a human interface problem.