Thursday, October 24, 2013

Post 3: Geocoding

Goals and Objectives
For this portion of the project, our goal was to download frac sand facility location data from a county website (www.tremplocounty.com), as well as a statewide website (WisconsinWatch.org), which was in the form of an Excel file.  We then had to normalize the data and geocode the assigned locations. After all the results were geocoded, matching locations that were geocoded by different users were compared to demonstrate error that can occur in the geocoding process.

Methods
Geocoding involves assigning coordinates or an address for a specific data point.  We attempted to geocode the locations of all known frac sand facilities (proposed and functioning) in Wisconsin based off data provided by WisconsinWatch.org.  The data were in the form of an Excel table with various types of addresses given.  In order to allow the geocoder to run the given addresses, we first had to normalize the table.  We then ran the program to see which locations could be matched and which locations would require further investigation and manual placement or address format changes to be located properly.

For addresses that had a full street address and could be matched, we simply selected the appropriate location and an address was automatically assigned to the point. However, this only worked on 4 of the 14 points, so alternative methods were required. For addresses that were given in the form of a generic or PLSS description, I used Google Earth and a PLSS coordinate converter to attempt to determine a street address, which could be matched in the Geocoder.  This worked for three of my data points. For the remaining seven locations, I used the "pick the address from the map" feature in the Geocoder to manually place the point.  I then manually updated the address information in the table as well.

Results
The initial Excel file downloaded from WisconsinWatch.org was very cluttered with confusing and unnecessary information (Fig. 1).
Figure 1.  The original table had many inconsistencies and additional information that required normalization.
In order to make the data usable, we normalized the table.  This entailed making sure all attributes were functionally dependent on the unique primary key (UNIQUE ID) and minimizing the storage of data to ensure data integrity.  To do this, we separated the address into separate fields of facility address, city, zip code, and state categories (Fig.2 ). We then eliminated all fields except for "UNIQUE ID," "Facility Type," "Facility Address," "City," "Community (City)," "Zip Code," and "State." This removed the excess fields and and made all categories directly dependent on the "UNIQUE ID" field. Finally, we made sure that all facility addresses were in the same format.
Figure 2.  After normalizing the data, information is much more clear and usable.
After normalizing the data and filling in any informational gaps using Google Earth and the mining company websites, the frac sand facility locations were all geocoded properly using both geocoder matching and manual placement (Fig. 3).
Figure 3.  After normalization of the data and filling of informational gaps, all 14 frac sand facilities were properly geocoded.
Geocoding results varied significantly depending on the user.  Errors occurred at all stages of the geocoding process and were both inherent and operational in nature.  For example, our original source data were inherently flawed as the setup for providing the information was insufficient for collecting all the necessary address information.  In addition, there was operational error during our data compilation stage.  There were clearly errors made during attribute data input, as some of the addresses placed the facilities in communities that were incorrect or only near the correct city.  Also, each user digitized the locations differently.  This likely resulted from different user interpretation of Google Earth images regarding frac sand facility locations. The differences varied from a few a hundred to over twenty thousand meters (Fig. 4).
Figure 4.  Insufficient source data and differences in digitization resulted in significant variation in users' geocoding                         results. These two points, which are supposed to be the same location, were generated by two different users and are more than 2700 meters apart.
To assess how significantly these sources of error affected the final data, we used the point distance tool. This provided the distance to the point nearest my designated location for a given facility (Fig. 5).  Of the five facility locations I compared, only one matched exactly.  However, this point was matched directly by the Geocoder. The other four points that I compared, which were placed manually, varied from a separation of ~400 meters to ~25,000 meters.  Unfortunately, this data processing tool has inherent error, as it does not find points based on matching primary keys, but uses proximity alone.
Figure 5.  The point distance tool was used to assess the impact of inherent and operational error on the final data by                       measuring the distance from a specified facility location (INPUT_FID) to the nearest point generated by another user (NEAR_FID).
Conclusion
Though the original data table was quite unclear and lacking in information, we were able to generate adequate geocoding results by using normalization and some help from Google Earth and additional research. However, the final product was imperfect, containing significant inherent and operational error. This problem is not uncommon.  In order to verify complete accuracy of location placement, each point should be compared to reference data with a higher degree of accuracy.  If no such data is available for the frac sand facility locations, coordinates should be manually collected at each site and then compared to reference data points in the area, ensuring the accuracy of each location.

No comments:

Post a Comment