Address matching can be very challenging to execute effectively, with high accuracy and few errors. It can also take a lot of manual time. Instead, leverage SQL for address matching, allowing you to automate much of the process.
Address matchings software is often inadequate, and manual address matching simply takes too long. If you use SQL, we will teach you how to do SQL address matching by covering the following:
- What is SQL address matching?
- Address matching problems SQL can solve
- Why you should use SQL for address matching
- Downsides of using SQL for address matching
- Top 5 methods & tools for SQL address matching
Once done, you’ll know how to match addresses using SQL and be able to choose the methods that suit you best.
What is SQL address matching?
SQL address matching is address matching using the SQL programming language. SQL is a domain-specific language built for relational database management, and can be used effectively for address matching, validation, and standardization. Using SQL, you can automate address matching, saving time and improving accuracy.
As you can imagine (or may already know), manually address matching is extremely time consuming, while still being prone to error. Using SQL, you can automate much of this process, improving the speed at which you can match addresses while increasing accuracy. Set up rules for comparisons, clean data according to your parameters, and run data comparisons to match addresses.
Address matching problems SQL can solve
There are a number of issues with traditional address matching that SQL can solve. We cover these below:
- Automating manual processes - Instead of requiring manual comparisons, you can use SQL to automatically run address matches, comparing tables and determining full and partial matches.
- Provide a match score - SQL will compare addresses, determining the likelihood of a full match. With this information, you can confirm address matches and even automate this process by setting a threshold.
- Determine a match regardless of formatting irregularities - Preprocess address data and use the Levenshtein distance and other strategies to match addresses, regardless of case, word order, misspellings, and other formatting and input issues.
Why you should use SQL for address matching
There are a number of methods to use for address matching, including Python. Typically, the choice comes down to what you like to use or what your project demands. Below are the main reasons to use SQL for address matching:
- I know and prefer SQL: You already use or are familiar with SQL, and you prefer it to alternative programming languages available.
- It is the only SDK available to me: Either because of the other solutions you currently use or because of the functionality you desire, SQL may be the only SDK available.
- My company or project demands it: If your company exclusively uses SQL, or you require it for this specific project, you may be restricted to using SQL for address matching.
In many cases, this decision will come down to whether you are already using an SDK and want to use a solution that can integrate with the solutions you already use. Alternatively, you can use a direct API, which in many cases will offer greater flexibility (although this also requires more work).
Downsides of using SQL for address matching
Although SQL can solve many of the address matching issues you may run into, it is not perfect. Below, we outline the main issues with using SQL for address matching and discuss why Placekey’s universal identifier is a superior solution to address matching.
- Must have experience with SQL: To effectively use SQL for address matching, you will need to have experience using SQL. If no one on your team already uses SQL, it will be a steep learning curve to adopt SQL for address matching.
- Significant processing required: It takes a significant amount of time to process this level of data, as you will be comparing two records. With longer records and larger comparisons, this can take a substantial amount of time. With a tool like Placekey, there is no need for processing at all!
- Data must be preprocessed for quality results: To ensure your comparisons run smoothly in SQL, you will first need to preprocess - or clean - your data. To do this, you will need to standardize formats and correct misspellings, segment data, and more, all of which takes significant time.
- Difficult to account for phonetic differences: Users often input addresses based on how they sound, which leads to various formatting and spelling errors. Although SQL can use the Levenshtein distance to account for differences, there is still room for error.
Top 5 methods & tools for SQL address matching
SQL is flexible enough to perform a variety of relational data functions, which can be used to find exact and partial address matches. Below are the top 5 methods for using SQL for address matching.
Method 1: Preprocessing and cleaning your data for better results
In general, cleaning and standardizing elements of your data prior to processing will ensure more accurate results, as well as reduce unrelated information. The methods below are all ways of preprocessing your data for better results, and you can use them individually or collectively, depending on your needs.
Using the COMPRESS function
The COMPRESS function in SQL will remove extra spaces, unnecessary punctuation, trailing blanks, and other small errors that would otherwise cause addresses to not match. By cleaning your data in this way, you ensure direct comparisons of data, without extra characters that aren’t meant to be there. It is an ideal feature to use to account for input errors.
- s = removes space characters (blank, tab, feed, etc.)
- t = removes trailing blanks
- p = removes punctuation marks
Using the TRANWRD function
The TRANWRD function standardizes different formatted elements of addresses, such as the street label, direction, or suffix. Removing or replacing these elements makes it easier to match addresses, as the strings being compared are less likely to have formatting differences and errors.
For example, “St”, St.”, and “Street” will all be standardized to a single version, ensuring that these differences don’t cause an address match to fail. Similarly, you can use this to ensure directions will also match (“N” and “North”). By standardizing these parts of the address first, you ensure that a more clean comparison can be made, and reduce the likelihood of a matching error due to formatting.
- source = source string
- target = string of characters that are searched to be replaced
- replacement = string of characters that will replace the target
Using the COMPGED function
The COMPGED function measures the difference between two strings using the Levenshtein distance, allowing you to determine the likelihood of a match. The Levenshtein distance is the fewest number of character edits needed to match two strings of data. The output is the number of deletions or insertions required to transform the string to be an exact match.
- COMPGED (string-1, string-2, <cutoff>,<modifiers>)
Method 2: Using T-SQL for fuzzy matching
Transact-SQL (T-SQL) is Microsoft and Sybase’s SQL extension that allows interaction with relational databases. It offers additional functionality, such as procedural programming and enhanced support functions for string processing and more. T-SQL is a tool that is a core component of using Microsoft SQL Server.
You can use the T-SQL algorithm to perform fuzzy matching, comparing two strings and returning a score between 1 and 0 (with 1 being an exact match). With this method, you can use fuzzy logic for address matching, which helps you account for partial matches. This will expand your ability to match addresses, which is extremely useful as addresses are often input incorrectly due to formatting and misspelling.
SOUNDEX takes an alphanumeric string of data, and converts it to a four-character code based on how it sounds in English. This code is a combination of the first character of character_expression (converted to uppercase) and three numbers that represent the letters in the expression. Vowels are ignored unless they are the first letter in the string, and zeroes are added at the end of codes to ensure all codes are four-characters long.
This helps you compare data based on phonetics, as people often spell addresses the way they think they sound. This leads to spelling mistakes that can cause your address data to fail a match test, even though they are the same. Using the DIFFERENCE function, you can determine the probability of a match between two strings.
This is the lowest difference score, as the vowels and duplicate characters are not counted. Alternatively, you can get a greater difference with results that have more variance:
While this may not be as closely related to address matching as you may think, many people input addresses based on how the word sounds; this is especially true of street names. SOUNDEX can be extremely useful for determining matches for street names.
Using the Levenshtein distance algorithm and the GetSimilarityScore, you can determine a match score between 1 and 0, with 1 being an exact match. You can then also integrate a threshold to limit your search results to the most relevant and likely matches. You will need to adjust your threshold depending on how wide you want your net to be. The lower the threshold, the more results you will need to comb through, and the more false positives you’ll end up with.
If we take the principles from Fuzzy Search, we know you need to compare two tables of data against each other. Using a full-text search will allow you to compare single addresses against other data sets in bulk, without needing to do a table comparison. By doing this, you create a Full-Text index, which can be used to build tables with the following columns:
The problem with this is that you are always taking a single address, and searching for a match. While this will help you find matches with some degree of error when inputting your search query, it limits you to searching each address at a time, as it compares one record against an entire collection of records.
Placekey solves the problems of SQL address matching by using a universal system for unique address identification (represented using an alphanumeric code). With this system, POIs can be encoded as Placekeys, removing the need for address matching altogether.
Placekeys are a combination of physical location and address record data, so address matching, address validation, and address standardization have already been done for each Placekey!