Address matching can be a long, tedious, challenging process that doesn’t always yield great results. It’s also restricted by very specific formatting conventions and yet inputs are often open for the user, resulting in a variety of different formats and spellings of the same address. Show
Nội dung chính
Manual address matching and legacy software are extremely time-consuming and difficult to match addresses with any degree of flexibility. With programming languages, you can run scripts that will automatically execute tasks, saving you time and increasing
accuracy. One of the most popular methods of scripting is Python, which can be leveraged to perform better address matching.
By the end of this article, you’ll know what address matching is and the best methods to use Python to match addresses. What is Python address matching?Python address matching is simply address matching using the Python programming language. As a high-level and general-purpose programming language, Python is widely used because of its code readability. Using Python for address matching automates much of the process, increasing your ability to accurately match addresses. With Python, you can compare multiple records, automating their processing and greatly increasing the speed at which you can do this. As you can imagine, manually matching records takes significant time. With Python, you can set up your data sets, establish rules for comparison, and then compare the data sets to match addresses. Address matching is actually the process of matching an address in a database to an actual location on a map, sometimes known as geocoding. In many cases, address matching using Python is actually address validation and address standardization, where you compare addresses to ensure they are accurate, and to match addresses for deduplication. To better understand this, we explain why address validation isn’t a substitution for address matching. Address matching problems Python solvesAddress matching can be a tedious, time-consuming process. In some cases, it’s barely possible to do manually. Fortunately, scripting languages like
Python solve a number of the shortcomings of direct address matching.
Why you should use Python for address matchingThere is no particular reason to use Python specifically for address matching, as there are alternative methods. Mainly, Python would be your go-to if you like to use Python or if your project requires it because of the other programs you use or the SDK you currently use.
A lot of this decision comes down to whether you are using an SDK, and need to either use a language that already works with the set of tools you currently use, or if you are using a direct API. An SDK may require Python, or it may be the easiest language to use, as all tools will communicate more effectively. Downsides of using Python for address matchingDespite Python being a
viable solution for many of the common address matching problems you’d run into, there are still shortcomings. We cover the main problems with using Python for address matching, and we’ll also discuss why Placekey’s tool offers a superior solution without any of the downsides of Python address matching.
There are a number of ways to use Python for address matching that allow you to find exact and partial matches. As Python is a rather general coding language that has many applications, you can use it in a variety of ways depending on the output you are looking for. We will start with some of the more basic methods, and work our up to more complex ways of address matching. Method 1: Using deterministic address matchingOne of the most basic ways to match addresses using Python is by comparing two strings for an exact match. It’s important to note that this won’t account for spelling mistakes, missing words, and when parts of the address are entered in different orders. To determine whether something is absolutely true or false.
Expected output: Method 2: Using data preprocessing for a better matchComparing for an exact match is very limiting, as all characters must match exactly, including their case. To increase your chances of a match, you should first convert your addresses to lower case. Oftentimes, users forget to input characters in the correct case, especially in the case of street names like ‘McCormick’, where users fail to input the middle C in upper case. Without converting to lowercase, we get false result:
Expected output: By first converting to lowercase, we get this:
Expected output: By first converting both of the strings you are comparing to lower case, a complete match can be made without the character case affecting it. With how often errors are made inputting addresses, this can be extremely useful and save manual review. Method 3: Using fuzzy logic for partial truthsThe next method to use is the Levenshtein distance, which will allow you to account for partial matches, rather than only exact matches. This is made possible using fuzzy logic, which can account for partial truth. This allows you to determine the likelihood of a match between 1 (exact match) and 0 (not an exact match). The Levenshtein distance of two strings (a and b; of length |a| and |b| respectively), can be calculated using the following formula: You can use the Levenshtein distance to assess addresses for a true match, instead determining how likely they are to be a match. This will allow you to better assess addresses, while accounting for input errors, misspellings, word
order, and more. You can input your strings for comparison, using the Levenshtein distance to get a partial match score.
Expected output:
This gives a very high likelihood of a match. However, there are ways we can increase our chances of determining a true match. If we combine this with method 2, and first convert our address to lowercase, it increases the likelihood of a match. You can also first convert both addresses to lower case during this check
Expected output:
As you can see, the Levenshtein distance used with strings in lowercase gives us a higher rating, and a better indication of a match. You can also perform the above using the Levenshtein package within Python:
Expected output: It is essential to preprocess your data prior to analyzing, as it can have a significant impact on your results, improving accuracy. Over time, you can get better at it, determining this
difference with greater accuracy, efficiency, and automation. Method 4: Using fuzzymatcherThis Python package enables fuzzy matching between two panda dataframes using sqlite3’s Full Text Search. Once matches have been detected, it determines their match score using probabilistic record linkage. You can use the match quality scores to determine the likelihood of a true match. First, you need to install fuzzymatcher. To do this, you
will need a build of sqlite that includes FTS4. To install fuzzymatcher, enter the following: Once installed, you can pair two address tables to find a match score rating. Image Credit: GitHub fuzzymatcherLink those address lists by doing the following: Image Credit: GitHub fuzzymatcherThis will return a report of the fuzzy match rating. You can then determine if these are true matches or not, and further investigate only relevant results. Image Credit: GitHub fuzzymatcherYou can then automatically identify addresses by setting a threshold that uses the partial match to determine if an address matches. For example, you can set a threshold of 0.8, and any address with a score higher than this will be determined a match. Method 5: Using Python Record Linkage ToolkitYou can easily link records easily using Python Record Linkage Toolkit, helping you deduplicate records and manage your data effectively. It uses Python’s pandas, which is a flexible data analysis and manipulation library built specifically for Python and can be used to match addresses bad on parameters. You receive a match score, that helps you determine the likelihood of a true match between
strings. First, install the library using pip, and then set up an explicit index column to read the data: Image Credit: Practical Business PythonYou can then define linkage rules, allowing you to leverage Record Linkage Toolkit’s complex configuration capabilities. Create an indexer object by doing the following: This can then be used to evaluate the potential matches. The most likely problem is that there can be a high number of paired records, some of which will be incorrect. This also results in too much unrelated information being captured. Instead, determine how many comparisons will be run, so you can potentially restrict your comparison and save processing time. Image Credit: Practical Business PythonOnce the data sets have been set up and the comparison criteria have been defined, you can run the comparison logic using: Image Credit: Practical Business PythonThere are a number of ways to refine your
comparison and speed up the processing, saving you time. Specify your search for the information you need most, and set up thresholds to automate address matching. Use blockers to eliminate certain elements from the search, so that you can refine your search a lot. For example, you can set up a blocker on the city to ensure you are only matching addresses within a specific municipality. Image Credit: Practical Business PythonWith your search refined to the relevant data sets, you will be compute this much faster. You can address matches quickly and efficiently, without matching addresses you don’t need. Next, you need to ensure that you account for spelling mistakes, allowing for flexibility in user input. To do this, set up a SortedNeighborhood algorithm to check misspellings (i.e. “Tenessee” vs “Tennessee”). Image Credit: Practical Business PythonOnce done all this, you will end up with a features DataFrame. Column labels are based on the elements you set up for comparison, so that you can easily see them displayed in the table. A 1 notes a full match; a 0 notes a full negative. Image Credit: Practical Business PythonWith this table, you can determine how many addresses matches there are. Image Credit: Practical Business PythonWe now know exactly how many matches there are, and how many of the sets didn’t match. The rows with the most matches are likely to be full matches, so you can work your way down, ignoring ones without any matches at all. You can compare data sets and then add quality scores to determine the likelihood of a match. Image Credit: Practical Business PythonFind identical address matches, and pull individual records and identify if they are a match. Image Credit: Practical Business PythonNow you are able to confirm a match and merge data sets. Method 6: Using this FuzzyWuzzy package for PythonWhile the Levenshtein and
Damerau-Levenshtein distances are extremely useful, they struggle to match address information that is input out of order. While there is standardized formatting for input, some people still make errors and different systems format information differently. For instance, users can input the same address in the two ways below:
In the case above, the Levenshtein and Damerau-Levenshtein distances would need to
have very low thresholds for addresses to be matched, as many characters would need to be transposed. That would result in many false positives, and ultimately result in poor results. FuzzyWuzzy solves this problem by first tokenizing strings and preprocessing them (by removing punctuation and converting them to lowercase). This makes it possible to compare addresses that aren’t ordered the same. Image Credit: DataCampThis also lets you pair data that was input in entirely different formats. From formatted fields to a sentence, you can extract the meaningful portions of data, and then easily compare the elements separately. Image Credit: DataCampBeing able to match addresses that are out of order will help you match addresses with greater accuracy. You can then deduplicate your data more effectively, combining records. Now you know how to use deterministic data matching (where you get a yes or no match) and probabilistic data matching
(where you get a partial match score, indicating the likelihood of a full match) for address matching in Python. Use one or any combination of the methods above, dependent on how flexible your address matching needs to be. Alternatively, learn how Placekey’s universal address identifier can be used to identify, match, compare addresses. With better physical location accuracy and the ability to track multiple POIs at a single address (at the same time and over time), Placekey is a more
useful address identification system to use. Even better, it doesn’t require any coding. |