Looker Blog : Data Matters

How to Query Amazon Athena Geospatial Data

Zachary Aragosa, Enterprise Sales Engineer

Nov 14, 2018

Amazon’s Athena database supports a wide array of geospatial functionality that allows for building complex analysis with any data containing geographies or individual locations. With Looker, you can query data directly from Athena and leverage all of their geospatial functionality to give users the ability to work with massive geospatial data sets.

A particularly powerful geospatial data set that is available to the public is OpenStreetMap data. Since this data is available in a public S3 bucket, you’ll be able to very easily pull this data into Athena and start querying it. While it’ll be easy to get up and running right away running basic SELECT queries, the complex structure of the dataset means that writing manual SQL queries every time you have a new question is going to be a challenge.

This is where combining Athena’s geospatial functionality with Looker’s LookML modeling layer becomes incredibly powerful. To demonstrate some of this functionality, I thought it would be interesting to explore the natural environment of Southern California by county.

Geospatial Data Example with Amazon Athena

OpenStreetMap data contains a tag for “Nature” that describes mountains, beaches, capes, trees, etc. around the world. Here’s an example:

Location Nature Type Name Tags
52 -0.062714,-78.8328985 beach Playa del Río Saloya {natural=beach, surface=sand, name=Playa del Río Saloya}
53 -0.0647157,-77.9743917 wetland río volteado {natural=wetland, name=río volteado}
54 -0.0657521,-80.1585446 beach La Boca de tabuga {natural=beach, name=La Boca de tabuga}
55 -0.0665063,-78.3998779 peak Gualagüincha {is_in:country_code=EC, natural=peak, gns_uni=-1372397, gns_classification=MT, name=Gualagüincha, source=GNS, is_in:state=Provincia de Pichincha, is_in:country=Ecuador}


In order to compare which county has the most “Nature”, I loaded up the boundaryshapes of each SoCal county into Athena. The boundary shape field can be read as a type ‘Polygon’ in Athena and represents a set of points.


name boundaryshape
1 San Francisco 00 00 00 00 03 05 00 00 00 45 d6 1a 4a ed a0 5e c0 23 a0
2 Madera 00 00 00 00 03 05 00 00 00 84 d8 99 42 e7 22 5e c0 c0 79
3 San Mateo 00 00 00 00 03 05 00 00 00 56 0c 57 07 40 a1 5e c0 65 69


Identifying Nature

To limit the OpenStreetMap data down to just its natural elements and improve query performance, we can focus our dataset by creating a Derived Table in Looker to shrink the data set considerably.

view: nature {
    derived_table: {
        sql:
            SELECT
                tags,
                id,
                lat,
                lon,
            FROM planet
            WHERE type = 'node' and 
            tags['natural'] is not null
        ;;
    }
}

Joining Geographies

To figure out which of our natural points fall within in each county, we can join the counties table with the nature table using Athena’s ST_CONTAINS function. ST_CONTAINS returns TRUE when a specific nature point falls within the boundaries of each county’s polygon:

explore: nature {
    join: counties {
        relationship: many_to_one
        sql_on: st_contains(${counties.boundaryshape}, ST_POINT(${nature.lon}, ${nature.lat})) ;;
    }
}

Unnesting Data

Since our nature data is highly nested, we can use Athena’s un-nesting functions to parse out the details about each point of interest and codify that logic in LookML.

dimension: type {
    type: string
    sql: ${TABLE}.tags['type'] ;;
    description: "Nature Type as Defined by OpenStreetMaps"
}

Diving into Nature

From here, we can start to analyze the geographic features of southern California and was able to compare the number of beaches by county. Turns out Looker’s headquarters in Santa Cruz County has the second most beaches:

amazon_athena

San Bernardino County, not surprisingly, contains the most peaks over 1000 feet in elevation.

amazon_athena

Drilling into this data, I found that the highest peak in San Bernardino is San Gorgonio Mountain at 3,502’.


Nature Name Nature Elevation
1 San Gorgonio Mountain 3,502
2 Jepson Peak 3,416
3 Anderson Peak 3,304
4 Charlton Peak 3,291
5 Shields Peak 3,261


Creating a More Complex Geospatial Data Analysis

Say you were interested in planning a hiking vacation in California, an important decision for your trip will be picking a hotel that is closest to the highest peaks you intend to climb. Luckily, we can pull hotel information from OpenStreetMap by filtering our data on WHERE tags['tourism'] = 'hotel'.


ID Tags Lon Lat
1 595970835 {addr:housenumber=715, addr:country=US, name=The Beach Cottages, tourism=hotel, source=SanGIS Addresses Public Domain (http://www.sangis.org/), addr:street=Thomas Avenue, addr:postcode=92109, addr:city=San Diego} -117.255362 32.79297
2 595978669 {is_in:country_code=US, addr:housenumber=1558, is_in:state_code=CA, addr:country=US, name=Sheraton Carlsbad Resort and Spa, tourism=hotel, source=SanGIS Addresses Public Domain (http://www.sangis.org/), addr:street=East Balboa Court, is_in:state=California, addr:postcode=92008, is_in:country=United States of America, addr:city=Carlsbad} -117.311829 33.134073
3 596004616 {addr:housenumber=4767, addr:country=US, name=Capri by the Sea by All Seasons Resort Lodging, tourism=hotel, source=SanGIS Addresses Public Domain (http://www.sangis.org/), addr:street=Ocean Boulevard} -117.2583404 32.8004225
4 596022506 {rooms=23, internet_access=wlan, addr:state=CA, addr:country=US, internet_access:fee=no, tourism=hotel, stars=3, source=SanGIS Addresses Public Domain (http://www.sangis.org/), addr:postcode=92101, addr:city=San Diego, addr:housenumber=505, smoking=no, name=Found Hotel San Diego, addr:street=West Grape Street} -117.16751 32.725654


In Athena, I used the ST_BUFFER() function to find the natural elements nearby each hotel in the data set.

explore: hotel {
    join: nature {
        relationship: many_to_one
        sql_on: st_contains(ST_BUFFER(ST_POINT(${hotel.lon},${hotel.lat}), .25),
                            ST_POINT(${nature.lon}, ${nature.lat})) ;;
    }
}

Using LookML, you can create a metric that will allow you to count only the natural elements that have peaks greater than 1000 feet high.

measure: high_peak_count {
    type: count
    filters: {
        field: nature_type
        value: "peak"
    }
    filters: {
        field: elevation
        value: ">1000"
    }
    drill_fields: [detail*]
}

Then, we could use Looker’s explore section to build a query that counts the number of high peaks within a short drive from each hotel.

It turns out that one region in California in particular, the Big Sur area, offers 4 hotels that are each within a short drive of over 20 high peaks. The Big Sur Lodge seems like it might be a great place to take a hiking vacation.

amazon_athena

We can drill into the count of high peaks and plot the exact locations of each peak.

amazon_athena

To pinpoint exactly which mountains are closest to my hotel, we can use the ST_DISTANCE function to calculate distance as a crow flies between two points. (In degrees):

dimension: distance_hotel_to_nature {
    type: number
    value_format_name: decimal_2
    sql: ST_DISTANCE(ST_POINT(${nature.lon}, ${nature.lat} ), ST_POINT(${lon}, ${lat})) ;;
}


Name Elevation Distance Hotel to Nature
1 Manuel Peak 1,074 0.03
2 Port Summit 1,050 0.04
3 Pico Blanco 1,126 0.07
4 Ventana Double Cone 1,477 0.09
5 Mount Olmstead 1,094 0.09


Conclusion

Geospatial datasets like OpenStreetMap can help users answer many important and challenging questions about our environment. However, historically, geospatial data was notoriously cumbersome and clunky to work with. Data scientists were always required to meticulously transform the data for every new question asked by a non-technical user. Now, with a combination of Amazon Athena’s geospatial functionality and LookML, analysts can create an environment that enables non-technical end users to explore these types of data sets and answer questions on their own.

Next Previous

Subscribe for the Latest Posts