Sunday, June 22, 2008

scraping data from excel files

I am currently still in the midst of writing my excel scraper to scrape data from excel files. There are two types of encrypted excel files that I need to scrape from. Both of them contain similar data but are structured differently.

The first problem that I encountered is that the data position are not standard and scattered all around the place, for example in the first type of file the data starts reading from line 10 to 31 while the second type of file starts reading from 11 to 31. The form looks neat and good to the human eye but is a coordinate hunting nightmare programmatically! While both of the files contain similar data, I found that one method to scrape data is insufficient because of the tiny differences in the location of the data. I cannot use odbc methods to run sql statements on them as they are encrypted, so I am stuck doing coordinate targeting.

After much doodling around with positions of data, I finally decided to write a Form object that takes a record_type an argument. When I instantiate a record_type of say 'A' type it will then load up the profile of the 'A' type object which is a dictionary that contains all the information of where most of the data are on the form. This dictionary is my implementation of a profile. So from there my object will know where all the important bits of information are based on the profile from the dictionary.

Anyone have done something similar in a better way?
Post a Comment