maxbachmann's comments

maxbachmann · on March 30, 2020

For everyone wondering about the actual runtime difference between RapidFuzz and FuzzyWuzzy I added a couple first benchmarks based upon the benchmarks FuzzyWuzzy is using: https://github.com/rhasspy/rapidfuzz/blob/master/Benchmarks....

maxbachmann · on March 30, 2020

In the last release I added a module `rapidfuzz.levenshtein` which allows calculating a normal levenshtein distance and a weighted version where substitutions have a weight of 2 (this one is actually used by FuzzyWuzzy for ratio calculations). But when there is interest in other weighting options I can add a version that allows to pass in custom costs for insertions, deletions, and substitutions

germanjoey · on March 30, 2020

I am very interested in a version that allow you to pass in a custom substitution cost! It has a great use for OCR applications!

maxbachmann · on March 30, 2020

Well then I will add it :)

maxbachmann · on March 31, 2020

`rapidfuzz.levenshtein.weighted_distance` does now support the three parameters `insert_cost`, `delete_cost` and `replace_cost`.

edraferi · on March 30, 2020

Yeah that would be pretty cool. I’d be interested to use that to implement phonetic distance scoring.

maxbachmann · on March 30, 2020

Yes api and results stay mostly the same. A small difference is that return values are always float -> e.g. a ratio of 94.664 and not rounded to 95 as it is done with FuzzyWuzzy so the results can be compared in a better way.

maxbachmann · on March 30, 2020

Well it is completely reimplemented in C++ and I actually changed the algorithm in some places so it still gives the same results with a smaller time complexity. But it is still a derived work. Just not from the GPL licensed version: https://github.com/rhasspy/rapidfuzz/blob/master/LICENSE

josegonzalez · on March 30, 2020

Former fuzzywuzzy maintainer here.

We relicensed fuzzywuzzy mostly because I am not a lawyer and really just wanted to stop having the same argument with the same three people about whether what we were doing was legal. I actually think that we were violating the license by not including the license header on the included file, but were fine with the rest of the project being MIT, but again, not a lawyer, and the emails from the same ~3 people were becoming quite annoying.

Link here: https://github.com/seatgeek/fuzzywuzzy/issues/130

That said, if their interpretation is right, then unless you clean-roomed the reimplementation, that derivative is also GPL.

Again, not a conversation I personally care for, and I no longer work for SeatGeek so it's no longer really on me :)

maxbachmann · on March 30, 2020

Thats why I based it on a version before python-Levenshtein was even added to the project. The Levensthein part is just the normal levenshtein stuff that is a standard task at university I suppose, so it was definetly faster to implement this on my own instead of reading ugly C code. From my tests it is about as fast as the one used by python-Levenshtein (a little bit slower), but therefore I can read it ...

maxbachmann · on March 30, 2020

FuzzyWuzzy was MIT Licensed in the beginning, but then in 2011 decided to start using python-Levenshtein which is GPLv2 licensed and therefore the whole project got GPLv2 licensed. RapidFuzz is based on this version of FuzzyWuzzy: https://github.com/rhasspy/fuzzywuzzy and implements similar algorithms in C++

edraferi · on March 30, 2020

Good answer. Nice to see licensing issues taken seriously.

maxbachmann · on March 30, 2020

This was the main reason to write it. I wrote a small script searching github for projects that use FuzzyWuzzy and then sorted them based on their license into three lists a) GPL License b) Incompatible License c) no license. From these three only a) is licensed correctly. From 1k repositories only about 100 used the GPL, 300 an incompatible License and 600 no license at all. So I figured there has to be some interest in having a more open licensed alternative ;)

Hamuko · on March 30, 2020

Fair enough.