How to Clean and Deduplicate Addresses ("De-Duping")

Summary

Placekey is geospatially-aware universal place identifier useful for joining together disparate geospatial, address, and point-of-interest datasets
Placekey is an open and free to use
Here we demonstrate how to obtain Placekeys for point-of-interest or address-based data using the placekey-py python library and Placekey API

Take a look at our Google Colab notebook for this tutorial to access the code and run it yourself!

‍

What do you need?

Your Placekey API key --> Register Here - it is FREE and approval is automatic after signup!
Link to Address 3k subset [3k rows]

‍

Have questions or ideas?

Join thousands of geospatial data scientists in the SafeGraph Slack Community to get help and share ideas
Follow Placekey on Github
Tweet @placekeyio
Join the community with the SafeGraph Slack

‍

Setup

This section is authenticating your Google Drive and download/reading in the appropriate files from their respective Google Drive locations

# special libraries to allow file access
from google.colab import drive as mountGoogleDrive 
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
import pandas as pd

# These commands allow you to read directly from SafeGraph's public GoogleDrive containing Census Data and Sample Data
auth.authenticate_user()  # Authenticate and create the PyDrive client. 
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
print("You are fully authenticated and can edit and re-run everything in the notebook. Enjoy!")

Expected outcome:

You are fully authenticated and can edit and re-run everything in the notebook. Enjoy!

def pd_read_csv_drive(id, drive, dtype=None):
  downloaded = drive.CreateFile({'id':id}) 
  downloaded.GetContentFile('Filename.csv')  
  return(pd.read_csv('Filename.csv',dtype=dtype))

def get_drive_id(filename):
    
    drive_ids = {
                  'business_3k' : '1T2nqXa4MhYQM1SnLExZORicGLOoFwCUk'
                 }
    return(drive_ids[filename])

places_df = pd_read_csv_drive(get_drive_id('business_3k'), drive=drive, dtype={'Zip':str}) # It is very important to keep columns like Zip and NAICS as string to make sure Pandas doesn't mistake it for integers and drop leading zeros
places_df.head()

Install the Placekey library using Pip and then import it

!pip install placekey

import numpy as np
import json
from placekey.api import PlacekeyAPI

‍

Add your API Key below

Note:

You will need to add your own API key in the cell below for this tutorial to work (Free and Easy!)
The commented out file path variable called data_path is for non CoLab use
Setting dtypes is super important when working with address data to ensure leading zeros don't lead to inaccurate data

placekey_api_key = "hF6yVUCjc6lP0IK6RHw3KU5dWGFlnuBc"
pk_api = PlacekeyAPI(placekey_api_key)

‍

Column Mapping

In the cell below you are telling the API, in a JSON format, which columns in your dataframe relate to the API query. Note: Country 'US' is hardcoded below. While it is not required, the more data you give Placekey, the easier the match process will be.

orig_df = places_df.copy() #Only required because of auto read in for CoLab*

The optional step below is only if you don't already have a unique identifier for your data

The unique identifier ensures that we can remerge on the correct rows later

# Optional

li = []

for x in range(orig_df.shape[0]):
    li.append(f"{x}")

orig_df['id_num'] = li

orig_df['iso_country_code'] = 'US'

print(orig_df.shape)
orig_df.head()

query_id_col = "id_num"  # this column in your data should be unique for every row
column_map = {query_id_col:"query_id", 
                "Address": "street_address",
                "City": "city",
                "State": "region",
                "Zip": "postal_code",
                "iso_country_code": "iso_country_code",
                # "LAT": "latitude",
                # "LON": "longitude"
              }

df_for_api = orig_df.rename(columns=column_map)
cols = list(column_map.values())
df_for_api = df_for_api[cols]

df_for_api.head()

‍

Check for Null, None, NaN, etc rows

This will alert you to any incomplete data and help you determine if you should drop those rows or not.

For instance, if you notice you have 5,000 rows missing a latitude, you could save yourself some time by dropping those columns (that was not the case for this data)

df_clean = df_for_api.copy()
possible_bad_values = ["", " ", "null", "Null", "None", "nan", "Nan"] # Any other dirty data you need to clean up?
for bad_value in possible_bad_values:
  df_clean = df_clean.replace(to_replace=bad_value, value=np.nan)

print("FYI data missing from at least 1 column in the following number of rows:")
print(df_clean.shape[0] - df_clean.dropna().shape[0])
print("Some examples of rows with missing data")
df_clean[df_clean.isnull().any(axis=1)].head()


data_jsoned = json.loads(df_clean.to_json(orient="records"))
print("number of records: ", len(data_jsoned))
print("example record:")
print(data_jsoned[0])

‍

Check Placekey Request

The cell below will query your first input to ensure everything is formatted correctly before moving on to the batch processing

See matching behavior below

single_place_example = data_jsoned[0]
print("input: \n",single_place_example)
print("\nresult: \n",pk_api.lookup_placekey(**single_place_example))

‍

Warning: this next step takes 1-2min to run in completion

This file is 3.5k rows of data processed in batches of 1,000 rows at a time

While the code runs, you can read up on how Placekey tiers the matching process:

Matching Behaviour

Our matching algorithm tries a few different queries, sequentially, and returns the best match of the first query to return with high enough score to feel assured it's a true match. Here are the queries it does, in order:

If the address you've sent in is valid, then we search for a POI at that address placekey with a name that exactly case-insensitively matches the location_name you've sent in. If this does not match (or if the address you sent in wasn't valid) but you've sent in a latitude and longitude with your query, then we search for that location_name and a fuzzy street address within 1km of your coordinates. If this still does not match but you've sent in a postal code, then we search specifically for a POI in that postal code and look for a location_name match and a fuzzy street address match If none of the above match and you have sent in a city and a region, then we require a strict match on city/region, a match on poi name, and a fuzzy match on street address. Finally, if none of the above match, we stop searching for POI and perform an address match.

Remerge on your unique ID

You can now see your newly added Placekeys as well as an 'error' column that will leave some form of breadcrumbs to allow your to troubleshoot if the query comes back blank.

Error Codes

We use standard HTTP status codes to communicate success or failure.

Code Title Description 200 OK The request was successful

400 Bad Request The request is invalid. Read the message or error fields in the response for information on how to correct it.

401 Unauthorized Your API key is invalid. Check that you haven't removed it and that you've used the right header: apikey

429 Too Many Requests You have exceeded the permitted rate-limit. Check your dashboard to see how many requests have been made recently.

50x Internal Server Error An error occurred within our API. If this occurs, you may need to contact us to resolve

Determine number of unique Placekeys

Placekey is able to map all of the addresses to their respective Placekeys no matter how the user input was typed in.

Below you will see how many of these address inputs were similar, yet different enough it cause a headache to try and do this without Placekey

# Check the value counts of our Placekey column to find out if we have any repeating locations
vc = df_join_placekey.placekey.value_counts()

vc[vc > 1]

We can take a look at the top 2 and see how their addresses inputs stack up:

"222@3bt-b8t-swk"
"224@3bt-byd-r49"

Note: Since we do not have POI names, just addresses, our Placekey on returns the Address Encoding from the WHAT portion of our Placekey

#Checking PK - zzw-225@3bt-b8t-rff
df_join_placekey[df_join_placekey['placekey'] == '222@3bt-b8t-swk']

#Checking PK - 22b-222@5z6-2rv-89z
df_join_placekey[df_join_placekey['placekey'] == '224@3bt-byd-r49']

Next we need to drop the duplicates. Luckily using Pandas and our newly Placekeyd data, it is easier than ever

df_join_placekey = df_join_placekey.drop_duplicates(subset=['placekey'])
df_join_placekey['placekey'].value_counts()

df_join_placekey.head()

‍

Placekey Tutorials