Our data analytics team received a request from the marketing team to find the number of customers in three “neighborhoods”. As the lead data engineer of the team, the first thing I did was to break down this vague request into specific problems and use Snowflake to solve them one by one. Here are the steps that I came up with:
Identify the center points of the neighborhoods in terms of latitude and longitude coordinates.
Find the latitude and longitude of all customer addresses (Geocoding)
Draw non-overlapping circles over the center point of each neighborhood. Business decides how wide each circle can be (the radius) before they start to overlap.
Identify the distance of each customer address to the center points (Geodistance)
If the distance is less than the radius (i.e. inside the circle) of that neighborhood then include that customer (Geofencing).
The first step was straightforward. Open a map, confirm the center point of neighborhood with the business, note the corresponding latitude and longitude coordinates. Done.
For the second task, I needed a geocoding service which takes an address as input and returns the latitude and longitude as the output. My first instinct was to search the Snowflake Marketplace to see if there is an app we can quickly install and use. Thankfully, I found one, read its documentation, and installed it.
The marketplace app required all addresses to be on a single table along with an ID column. So, I created a small sample of addresses and tested the app. Both the speed and accuracy were good. I tested the method with a few thousand records before scaling it up to the full set of 1.4 million unique addresses.
At this point, I realized that the quality of our customer address data was really bad. This was impacting the output quality of the geocoding service. These are Irish addresses that are stored in operational databases of multiple legal entities, each formatting it their own way. Our data pipeline was only concerned bringing the addresses from the source systems and linking them. There were no quality checks on the addresses and no standardization as they were never used for any data analytics use case until now. So, I decided to parse the address and break it down to:
house number and street
town/city
county
Eircode (aka, postal or zip code)
country (mostly Ireland, but few other countries that the customer moved to)
SELECT SNOWFLAKE.CORTEX.COMPLETE(‘llm_id’, ‘prompt’)
llama2–70b-chat
because I saw it high up in multiple benchmarks (this domain moves so fast, so it is probably not the best model today). It wasn’t the best model for this particular task anyway. Despite so many prompt engineering efforts, I could not force it to give me a machine-readable (JSON) output. It would always add comments and explanations. It was too “chatty”. I assumed that is by design as chat
is literally in its name. So, I gave up and tried mistral-7b
. It performed much better. Adding
```
json
```` to the end of the prompt did the trick as it made it clear to model that I wanted it to _complete_ the prompt with JSON output. Which brings us to the LLM function that is the best for this type of task: `SNOWFLAKE.CORTEX.COMPLETE`. There are also functions for summarization, translation, sentiment analysis etc.
## Get Gani Simsek’s stories in your inbox
Join Medium for free to get updates from this writer.
Subscribe
Subscribe
Below is the excerpt from the Snowflake notebook where I was doing this analysis. Notice that I had to break down the prompt to multiple variables which I concatenated later, because [Snowflake SQL text variables](https://docs.snowflake.com/en/sql-reference/sql/set) have a maximum size of 256 bytes.
`
```
SET llm = 'mistral-7b';
SET address_prompt_1 = 'The following is an Irish postal address. ';
SET address_prompt_2 = 'Your job is to extract address components
and return the output in this JSON format:
"houseAndStreet": "",
"townOrCity": "",
"county": "",
"postalCode": "",
"country": ""
} ';
SET address_prompt_3 = 'If a part of the address is missing,
set the value of the key to null. The county must be one these 26 counties:
[Antrim, Carlow, Cavan, ..., ]';
SET address_prompt_4 = 'Do not add comments. The output has to be
machine-readable.
```json
SELECT
address_id,
address,
SNOWFLAKE.CORTEX.COMPLETE($llm,
CONCAT(
$address_prompt_1,
address,
$address_prompt_2,
$address_prompt_3,
$address_prompt_4
FROM MY_DB.MY_SCHEMA.ADDRESS
`
```
Now that I had perfectly formatted and standardized addresses, I retried geocoding with the larger sample, and I could see the improvement already. Mistral did a really good job. In some cases, it went beyond parsing and found missing data in the address by “knowing or inferring” the fact that a particular town is in a particular county and added the county name to the output. Amazing!
I processed the full set of addresses and identified the latitude and longitude for each one. This is a reusable and clean data set. I saved it so that operational teams can refer to it to standardize customer addresses.
The third step was also straightforward. I sat with the business team and used a map to calculate the aerial distance between a few locations that they wanted to include in the results. We agreed on the radius of each neighborhood.
For the fourth step, I used Snowflake [geospatial functions](https://docs.snowflake.com/en/sql-reference/data-types-geospatial) `ST_MAKEPOINT` and `ST_DISTANCE` to convert the latitude/longitude to a geography object and calculate the geodesic distance between the objects.
```
CREATE TABLE MY_DB.MY_SCHEMA.DISTANCE_TO_POINTS AS
SELECT address_id, lon, lat,
ST_MAKEPOINT(lon, lat) AS coordinates,
ST_MAKEPOINT(-6.25, 53.40) AS neighborhood_1,
ST_MAKEPOINT(-6.37, 53.28) AS neighborhood_2,
ST_MAKEPOINT(-8.46, 51.89) AS neighborhood_3,
ST_DISTANCE(coordinates, neighborhood_1)/1000 AS dist_to_1_kms,
ST_DISTANCE(coordinates, neighborhood_2)/1000 AS dist_to_2_kms,
ST_DISTANCE(coordinates, neighborhood_3)/1000 AS dist_to_3_kms
FROM MY_DB.MY_SCHEMA.ADDRESS_WITH_LON_LAT
```
Finally, I counted the distances that were within the radius of each neighborhood and shared the results with the marketing team.
```
SET radius_1 = 10; --kilometers
SET radius_2 = 6;
SET radius_3 = 8;
WITH eligible_addresses AS (
SELECT *,
CASE
WHEN dist_to_1_kms < $radius_1 THEN 'N1'
WHEN dist_to_2_kms < $radius_2 THEN 'N2'
WHEN dist_to_3_kms < $radius_3 THEN 'N3'
END AS neighborhood
FROM MY_DB.MY_SCHEMA.DISTANCE_TO_POINTS
WHERE dist_to_1_kms < $radius_1
OR dist_to_2_kms < $radius_2
OR dist_to_3_kms < $radius_3
SELECT neighborhood, COUNT(*) AS num_address
FROM eligible_addresses
GROUP BY neighborhood
ORDER BY num_address DESC
```
As a bonus, I created an interactive Streamlit map in the Snowflake notebook to visualize the results.
Press enter or click to view image in full size

Here is the code I used to generate the visualization.
```
import streamlit as st
from snowflake.snowpark.context import get_active_session
session = get_active_session()
query = """
SELECT ADDRESS_ID, LAT, LON
FROM MY_DB.MY_SCHEMA.DISTANCE_TO_POINTS
WHERE dist_to_1_kms < 10
OR dist_to_2_kms < 6
OR dist_to_3_kms < 8
"""
data = session.sql(query).collect()
st.map(data)
```
Unbelievably simple. I don’t even need to tell `st.map` which columns are latitude/longitude because it is smart enough to figure them out by itself if you named your columns as lat[itude] and lon[gitude]. Check out other options here [https://docs.streamlit.io/develop/api-reference/charts/st.map](https://docs.streamlit.io/develop/api-reference/charts/st.map)
This was a challenging but fun use case that our team could deliver in a few hours thanks to so many native tools at our disposal when the data is in Snowflake.
Business, governance, and adoption-focused material. Real-world implementations, case studies, and industry impact.