Geo Distance in RapidMiner and Python

In my previous post, I showed how you can use the Enrich by Webservice operator and OpenStreetMaps to do reverse geocoding lookups. This post will show how to calculate Geo Distance in RapidMiner between two latitude and longitude points. First using a RapidMiner and then using the GeoPy Python module.

This was a fun because it touched on my civil engineering classes. I used to calculate distances from latitude and longitude in my land surveying classes.

My first step was to select a home” location, which was 1 Penn Plaza, NY NY. Then I downloaded the latest list of earthquakes from the USGSwebsite. The last step was to calculate the distance from home to each earthquake location.

The biggest time suck for me was building all the formulas in RapidMiner’s Generate Attribute (GA) operator. That took about about 15 minutes. Then I had to backcheck the calculations with a website to make sure they matched. RapidMiner excelled in the speed of building and analyzing this process but I did notice the results were a bit off from the GeoPy python process.

There was a variance of about +/- 4km in each distance. This is because I hard coded in the earth’s diameter as 6371000 km for the RapidMiner process, but the diameter of the Earth changes based on your location. This is because the earth isn’t a sphere but more of an ellipsoid and the diameter isn’t uniform. The GeoPy great_circle calculation accounts for this by adjusting the calculation.

For a proof of concept, both work just fine.

Get the Geo Distance in RapidMiner Process

There were a few snags in my python code that took me longer to finish and I chalk this up to my novice ability at writing python. I didn’t realize that I had to create a tuple out of the lat/long columns and then use a for loop to iterate over the entire tuple list. But this was something that my friend solved in 5 minutes. Otherwise than that, the python code works well. Here’s the XML of the process:

 <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.5.002">
      <operator activated="true" class="process" compatibility="6.5.002" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="open_file" compatibility="6.5.002" expanded="true" height="60" name="Open File" width="90" x="45" y="30">
            <parameter key="resource_type" value="URL"/>
            <parameter key="url" value=""/>
            <description align="center" color="transparent" colored="false" width="126">Open Earthquake USGS URL</description>
          <operator activated="true" class="read_csv" compatibility="6.5.002" expanded="true" height="60" name="Read CSV" width="90" x="179" y="30">
            <parameter key="column_separators" value=","/>
            <list key="annotations"/>
            <list key="data_set_meta_data_information"/>
            <description align="center" color="transparent" colored="false" width="126">Read CSV file</description>
          <operator activated="true" class="select_attributes" compatibility="6.5.002" expanded="true" height="76" name="Select Attributes" width="90" x="313" y="30">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="latitude|longitude|mag"/>
            <description align="center" color="transparent" colored="false" width="126">Select Magnitude, Lat, and Long</description>
          <operator activated="true" class="filter_examples" compatibility="6.5.002" expanded="true" height="94" name="Filter Examples" width="90" x="447" y="30">
            <list key="filters_list">
              <parameter key="filters_entry_key" value=""/>
            <description align="center" color="transparent" colored="false" width="126">Filter for quakes &amp;gt; mag 4</description>
          <operator activated="true" class="rename" compatibility="6.5.002" expanded="true" height="76" name="Rename" width="90" x="581" y="30">
            <parameter key="old_name" value="latitude"/>
            <parameter key="new_name" value="Latitude"/>
            <list key="rename_additional_attributes">
              <parameter key="longitude" value="Longitude"/>
            <description align="center" color="transparent" colored="false" width="126">Rename columns</description>
          <operator activated="true" class="generate_attributes" compatibility="6.5.002" expanded="true" height="76" name="Generate Attributes" width="90" x="715" y="30">
            <list key="function_descriptions">
              <parameter key="Rad_Lat" value="Latitude*(pi/180)"/>
              <parameter key="Rad_Long" value="Longitude*(pi/180)"/>
              <parameter key="Lat_Home" value="40.750938"/> 
              <parameter key="Long_Home" value="-73.991594"/>
              <parameter key="Rad_Lat_Home" value="Lat_Home*(pi/180)"/>
              <parameter key="Rad_Long_Home" value="Long_Home*(pi/180)"/>
              <parameter key="Rad_Diff_Lat" value="(Latitude-Lat_Home)*(pi/180)"/>
              <parameter key="Rad_Diff_Long" value="(Longitude-Long_Home)*(pi/180)"/>
              <parameter key="a" value="(sin(Rad_Diff_Lat/2))^2 + cos(Rad_Lat) * cos(Rad_Lat_Home) * (sin(Rad_Diff_Long/2))^2"/>
              <parameter key="c" value="2 * atan2(sqrt(a), sqrt(1-a) )"/>
              <parameter key="distance_km" value="(6371000*c)/1000"/>
              <parameter key="distance_miles" value="distance_km*0.621371"/>
            <description align="center" color="transparent" colored="false" width="126">Make lots of calculations&lt;br/&gt;</description>
          <connect from_op="Open File" from_port="file" to_op="Read CSV" to_port="file"/>
          <connect from_op="Read CSV" from_port="output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Rename" to_port="example set input"/>
          <connect from_op="Rename" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>


Here’s the python process:

    import pandas as pd
    from geopy.geocoders import Nominatim
    from geopy.distance import great_circle
    geolocator = Nominatim()
    location = geolocator.geocode("1 Penn Plaza, NY, NY")
    home = (location.latitude, location.longitude) #Set Home Location
    earthquake = pd.read_csv('') #Read CSV file
    selection = (earthquake['mag'] >= 4) #Filter for earthquakes > mag 4
    earthquake = earthquake[selection].dropna(how = 'any', subset = ['latitude', 'longitude']).drop(['time', 'depth', 'mag', 'magType', 'nst','gap','dmin', 'rms','net', 'id','updated','place','type'], axis=1)
    earthquake = earthquake.convert_objects(convert_numeric=True)
    earthquake.describe(include='all') #not necessary but I like to see a description of the data I'm pushing downstream
    earthquake['combined'] = zip(earthquake.latitude, earthquake.longitude) #create tuple from pandas dataframe
    print earthquake.combined #double check the list
    print [great_circle(home, (lt,lng)) for (lt,lng) in earthquake.combined] #brackets are a short form of loop


%d bloggers like this: