Sunday, June 22, 2008

scraping data from excel files

I am currently still in the midst of writing my excel scraper to scrape data from excel files. There are two types of encrypted excel files that I need to scrape from. Both of them contain similar data but are structured differently.

The first problem that I encountered is that the data position are not standard and scattered all around the place, for example in the first type of file the data starts reading from line 10 to 31 while the second type of file starts reading from 11 to 31. The form looks neat and good to the human eye but is a coordinate hunting nightmare programmatically! While both of the files contain similar data, I found that one method to scrape data is insufficient because of the tiny differences in the location of the data. I cannot use odbc methods to run sql statements on them as they are encrypted, so I am stuck doing coordinate targeting.

After much doodling around with positions of data, I finally decided to write a Form object that takes a record_type an argument. When I instantiate a record_type of say 'A' type it will then load up the profile of the 'A' type object which is a dictionary that contains all the information of where most of the data are on the form. This dictionary is my implementation of a profile. So from there my object will know where all the important bits of information are based on the profile from the dictionary.

Anyone have done something similar in a better way?

6 comments:

Siew Kam Onn said...

Check out opyExcelerator for python-Excel programming

lowks said...

thanks siew I checked that out by I don't think it works for encrypted excel files

Joost Moesker said...

I recently build a tool to integrate supplier stock lists into our ERP back end. I used xlrd to read the files and join the worksheets into one stream using the itertool. Next i build a pipeline of generic filtering operations which can be applied to clean-up the data.

You can start out by stripping all the empty rows using:

After which you strip the empty columns by transposing the data and applying the same function.

transpose = lambda stream: izip(*stream)
strip_null_columns = lambda stream: transpose(ifilter(any, transpose(stream)))
strip_null_rows = lambda stream: ifilter(any, stream)
strip_null_space = lambda stream: strip_null_columns(strip_null_rows(stream))

Next you can extract the header by reading the first row containing only strings and filter all intermediate headers from the stream.

Of course this approach only works when the source data is basically tabular with some visual formatting.

Samat Jain said...

The "best" way I think to do this is buy a copy of Microsoft Excel and Microsoft Windows, and set up some kind of web service that uses Excel's COM interfaces to read the files in question.

djay said...

I think I did something similar with scraping tabular data from pdfs. I had to guess positions. If your interested in that code let me know.

lowks said...

djay:

Yes of course I am interested. Any help I can get would be good.