3/31/2024 0 Comments Super vectorizer 2 for windows 10The following table gives an example: Company Nameįor the human reader it is obvious that both Mc Donalds and Mac Donald’s are the same company. A similar problem occurs when you want to merge or join databases using the names as identifier. This is a problem, and you want to de-duplicate these. Databases often have multiple entries that relate to the same entity, for example a person or company, where one entry has a slightly different spelling then the other. Update: run all code in the below post with one line using string_grouper: Name MatchingĪ problem that I have witnessed working with databases, and I think many other people with me, is name matching. Using this approach made it possible to search for near duplicates in a set of 663,000 company names in 42 minutes using only a dual-core laptop. Using TF-IDF with N-Grams as terms to find similar strings transforms the problem into a matrix multiplication problem, which is computationally much cheaper. Traditional approaches to string matching such as the Jaro-Winkler or Levenshtein distance measure are too slow for large datasets.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |