Processing Unstructured Data with Snowflake Cortex AI

Our data analytics team received a request from the marketing team to find the number of customers in three “neighborhoods”. As the lead data engineer of the team, the first thing I did was to break down this vague request into specific problems and use Snowflake to solve them one by one. Here are the steps that I came up with:

Identify the center points of the neighborhoods in terms of latitude and longitude coordinates.
Find the latitude and longitude of all customer addresses (Geocoding)
Draw non-overlapping circles over the center point of each neighborhood. Business decides how wide each circle can be (the radius) before they start to overlap.
Identify the distance of each customer address to the center points (Geodistance)
If the distance is less than the radius (i.e. inside the circle) of that neighborhood then include that customer (Geofencing).

The first step was straightforward. Open a map, confirm the center point of neighborhood with the business, note the corresponding latitude and longitude coordinates. Done.

For the second task, I needed a geocoding service which takes an address as input and returns the latitude and longitude as the output. My first instinct was to search the Snowflake Marketplace to see if there is an app we can quickly install and use. Thankfully, I found one, read its documentation, and installed it.

The marketplace app required all addresses to be on a single table along with an ID column. So, I created a small sample of addresses and tested the app. Both the speed and accuracy were good. I tested the method with a few thousand records before scaling it up to the full set of 1.4 million unique addresses.

At this point, I realized that the quality of our customer address data was really bad. This was impacting the output quality of the geocoding service. These are Irish addresses that are stored in operational databases of multiple legal entities, each formatting it their own way. Our data pipeline was only concerned bringing the addresses from the source systems and linking them. There were no quality checks on the addresses and no standardization as they were never used for any data analytics use case until now. So, I decided to parse the address and break it down to:

house number and street
town/city
county
Eircode (aka, postal or zip code)
country (mostly Ireland, but few other countries that the customer moved to)

Trying to parse the addresses via the traditional methods (i.e., regular expressions and rules) quickly turned out to be infeasible despite the plethora of regex functions in Snowflake. So, I thought to myself, given that these addresses are unstructured textual data, and I need a structured output, this sounds like an excellent job for a Large Language Model (LLM). Once again, I didn’t want to extract the addresses, send them one by one to some external API. Snowflake Cortex already has a bunch of LLMs ready to use and using them is as simple as calling a SQL function.

text


SELECT SNOWFLAKE.CORTEX.COMPLETE(‘llm_id’, ‘prompt’)

It is worth noting that not all models are available in all regions of all cloud providers. So, I evaluated the ones that were available in our Azure region. First, I tried llama2–70b-chat because I saw it high up in multiple benchmarks (this domain moves so fast, so it is probably not the best model today). It wasn’t the best model for this particular task anyway. Despite so many prompt engineering efforts, I could not force it to give me a machine-readable (JSON) output. It would always add comments and explanations. It was too “chatty”. I assumed that is by design as chat is literally in its name. So, I gave up and tried mistral-7b. It performed much better. Adding

text


```

json

```` to the end of the prompt did the trick as it made it clear to model that I wanted it to _complete_ the prompt with JSON output. Which brings us to the LLM function that is the best for this type of task: `SNOWFLAKE.CORTEX.COMPLETE`. There are also functions for summarization, translation, sentiment analysis etc.

## Get Gani Simsek’s stories in your inbox

Join Medium for free to get updates from this writer.

Subscribe

Subscribe

Below is the excerpt from the Snowflake notebook where I was doing this analysis. Notice that I had to break down the prompt to multiple variables which I concatenated later, because [Snowflake SQL text variables](https://docs.snowflake.com/en/sql-reference/sql/set) have a maximum size of 256 bytes.

`

```

SET llm = 'mistral-7b';

SET address_prompt_1 = 'The following is an Irish postal address. ';

SET address_prompt_2 = 'Your job is to extract address components

and return the output in this JSON format:

"houseAndStreet": "",

"townOrCity": "",

"county": "",

"postalCode": "",

"country": ""

} ';

SET address_prompt_3 = 'If a part of the address is missing,

set the value of the key to null. The county must be one these 26 counties:

[Antrim, Carlow, Cavan, ..., ]';

SET address_prompt_4 = 'Do not add comments. The output has to be

machine-readable.

```json

SELECT

address_id,

address,

SNOWFLAKE.CORTEX.COMPLETE($llm,

CONCAT(

$address_prompt_1,

address,

$address_prompt_2,

$address_prompt_3,

$address_prompt_4

FROM MY_DB.MY_SCHEMA.ADDRESS

`

```

Now that I had perfectly formatted and standardized addresses, I retried geocoding with the larger sample, and I could see the improvement already. Mistral did a really good job. In some cases, it went beyond parsing and found missing data in the address by “knowing or inferring” the fact that a particular town is in a particular county and added the county name to the output. Amazing!

I processed the full set of addresses and identified the latitude and longitude for each one. This is a reusable and clean data set. I saved it so that operational teams can refer to it to standardize customer addresses.

The third step was also straightforward. I sat with the business team and used a map to calculate the aerial distance between a few locations that they wanted to include in the results. We agreed on the radius of each neighborhood.

For the fourth step, I used Snowflake [geospatial functions](https://docs.snowflake.com/en/sql-reference/data-types-geospatial) `ST_MAKEPOINT` and `ST_DISTANCE` to convert the latitude/longitude to a geography object and calculate the geodesic distance between the objects.

```

CREATE TABLE MY_DB.MY_SCHEMA.DISTANCE_TO_POINTS AS

SELECT address_id, lon, lat,

ST_MAKEPOINT(lon, lat) AS coordinates,

ST_MAKEPOINT(-6.25, 53.40) AS neighborhood_1,

ST_MAKEPOINT(-6.37, 53.28) AS neighborhood_2,

ST_MAKEPOINT(-8.46, 51.89) AS neighborhood_3,

ST_DISTANCE(coordinates, neighborhood_1)/1000 AS dist_to_1_kms,

ST_DISTANCE(coordinates, neighborhood_2)/1000 AS dist_to_2_kms,

ST_DISTANCE(coordinates, neighborhood_3)/1000 AS dist_to_3_kms

FROM MY_DB.MY_SCHEMA.ADDRESS_WITH_LON_LAT

```

Finally, I counted the distances that were within the radius of each neighborhood and shared the results with the marketing team.

```

SET radius_1 = 10; --kilometers

SET radius_2 = 6;

SET radius_3 = 8;

WITH eligible_addresses AS (

SELECT *,

CASE

WHEN dist_to_1_kms < $radius_1 THEN 'N1'

WHEN dist_to_2_kms < $radius_2 THEN 'N2'

WHEN dist_to_3_kms < $radius_3 THEN 'N3'

END AS neighborhood

FROM MY_DB.MY_SCHEMA.DISTANCE_TO_POINTS

WHERE dist_to_1_kms < $radius_1

OR dist_to_2_kms < $radius_2

OR dist_to_3_kms < $radius_3

SELECT neighborhood, COUNT(*) AS num_address

FROM eligible_addresses

GROUP BY neighborhood

ORDER BY num_address DESC

```

As a bonus, I created an interactive Streamlit map in the Snowflake notebook to visualize the results.

Press enter or click to view image in full size

![](https://miro.medium.com/v2/resize:fit:700/1*VzxQPzq1zsL3fbZ6ZQOlFA.png)

Here is the code I used to generate the visualization.

```

import streamlit as st

from snowflake.snowpark.context import get_active_session

session = get_active_session()

query = """

SELECT ADDRESS_ID, LAT, LON

FROM MY_DB.MY_SCHEMA.DISTANCE_TO_POINTS

WHERE dist_to_1_kms < 10

OR dist_to_2_kms < 6

OR dist_to_3_kms < 8

"""

data = session.sql(query).collect()

st.map(data)

```

Unbelievably simple. I don’t even need to tell `st.map` which columns are latitude/longitude because it is smart enough to figure them out by itself if you named your columns as lat[itude] and lon[gitude]. Check out other options here [https://docs.streamlit.io/develop/api-reference/charts/st.map](https://docs.streamlit.io/develop/api-reference/charts/st.map)

This was a challenging but fun use case that our team could deliver in a few hours thanks to so many native tools at our disposal when the data is in Snowflake.

Content

Processing Unstructured Data with Snowflake Cortex AI

Practitioner

Applications & Impact

Tags