I am currently still in the midst of writing my excel scraper to scrape data from excel files. There are two types of encrypted excel files that I need to scrape from. Both of them contain similar data but are structured differently.
The first problem that I encountered is that the data position are not standard and scattered all around the place, for example in the first type of file the data starts reading from line 10 to 31 while the second type of file starts reading from 11 to 31. The form looks neat and good to the human eye but is a coordinate hunting nightmare programmatically! While both of the files contain similar data, I found that one method to scrape data is insufficient because of the tiny differences in the location of the data. I cannot use odbc methods to run sql statements on them as they are encrypted, so I am stuck doing coordinate targeting.
After much doodling around with positions of data, I finally decided to write a Form object that takes a record_type an argument. When I instantiate a record_type of say 'A' type it will then load up the profile of the 'A' type object which is a dictionary that contains all the information of where most of the data are on the form. This dictionary is my implementation of a profile. So from there my object will know where all the important bits of information are based on the profile from the dictionary.
Anyone have done something similar in a better way?
Sunday, June 22, 2008
Friday, June 20, 2008
surrounded by win32com's python and not liking it
I am using the win32com library to write an application to populate data that are in encrypted excel files that are kept in a specific folder into a mysql database. Most of the part is there except that I do not find much of win32com library documentation around. One part of my code really irks me and it's the excel visible part. Everytime it is processing an excel file, it has to open up the file and close it up again. When I set the visible part of the code to 0 or false, my application doesn't run correctly anymore. Some of my code are as follows :
import win32com.client
xl=win32com.client.Dipatch('Excel.Application')
xl.Visible=1
Grr ! I wonder why this thing doesn't work if you set the Visible to 0. Can't this work in the background? Woe !
import win32com.client
xl=win32com.client.Dipatch('Excel.Application')
xl.Visible=1
Grr ! I wonder why this thing doesn't work if you set the Visible to 0. Can't this work in the background? Woe !
Labels:
python
Friday, June 13, 2008
Enterprise project lessons
Being in the technical lead for an enterprise project teaches you a thing or two. I was wearing those shoes recently for one of our projects recently and here is what I learnt :
1. If you are going to be using a framework of any sort you better know it inside out, left to right and top to bottom. When funky modifications are asked for or if your application starts to slow down you better know exactly or roughly where to start tweaking. This is the reason too why my partner feels that the bigger the project, the closer to the metal the framework used has go to be.
2. If the scope of the project is large and you suspect you might not have enough developers, never never never choose to use a tool which leaves you with one choice when it come to developers. For our case we had to give in and go the php route for this particular project. In my company we have shunned php in the past choosing not to use for any projects, but we had to bow in to pressure this time because it was just too hard trying to find capable and dependable python developers where we are. Once we caved in and accepted php, we found a pool of capable php programmers.
3. I can never over emphasize the next one. Never never never never under estimate the scope of a project especially if it pays a lot. When the customer says "I just need a web page to capture data" always ask for more information. It might not just be the application that is complex, it might even turn out to be the setup itself that is going to pose a challenge. In our case, we had to deploy multiple instances of our application in a few remote places to allow data entry people to key data. At the end of the day we had to merge the data all back into a central database.
4. For any project always look for a capable PM. It might lower your bottom line, but trust me it's worth it. A good PM will save you 3 am calls while at the same time allow you to multi task in the back ground and do what you do best. Fortunately for us, we have found such a candidate in my old business associate though we cannot use him for this project fearing our customer might just freak out seeing a new PM on the job !
5. Always have a old school unix or Linux who knows all the old text mangling tricks loaded and ready. In this case it was my partner. Even when doing seemingly unrelated stuff like web application never under estimate the usefulness of these guys. We had to move about 50,000 records trapped in excel into the database and his skills proved invaluable ! (Not that I am that much behind now, but I had to concentrate on getting the web end up)
In the end though, through all the late nights and sweat, this project has thought invaluable lessons in project management as well as resource planning.
1. If you are going to be using a framework of any sort you better know it inside out, left to right and top to bottom. When funky modifications are asked for or if your application starts to slow down you better know exactly or roughly where to start tweaking. This is the reason too why my partner feels that the bigger the project, the closer to the metal the framework used has go to be.
2. If the scope of the project is large and you suspect you might not have enough developers, never never never choose to use a tool which leaves you with one choice when it come to developers. For our case we had to give in and go the php route for this particular project. In my company we have shunned php in the past choosing not to use for any projects, but we had to bow in to pressure this time because it was just too hard trying to find capable and dependable python developers where we are. Once we caved in and accepted php, we found a pool of capable php programmers.
3. I can never over emphasize the next one. Never never never never under estimate the scope of a project especially if it pays a lot. When the customer says "I just need a web page to capture data" always ask for more information. It might not just be the application that is complex, it might even turn out to be the setup itself that is going to pose a challenge. In our case, we had to deploy multiple instances of our application in a few remote places to allow data entry people to key data. At the end of the day we had to merge the data all back into a central database.
4. For any project always look for a capable PM. It might lower your bottom line, but trust me it's worth it. A good PM will save you 3 am calls while at the same time allow you to multi task in the back ground and do what you do best. Fortunately for us, we have found such a candidate in my old business associate though we cannot use him for this project fearing our customer might just freak out seeing a new PM on the job !
5. Always have a old school unix or Linux who knows all the old text mangling tricks loaded and ready. In this case it was my partner. Even when doing seemingly unrelated stuff like web application never under estimate the usefulness of these guys. We had to move about 50,000 records trapped in excel into the database and his skills proved invaluable ! (Not that I am that much behind now, but I had to concentrate on getting the web end up)
In the end though, through all the late nights and sweat, this project has thought invaluable lessons in project management as well as resource planning.
Sunday, June 8, 2008
python unicode sucks
I have had some bad experiences with python unicode on my current project and suffice to say I a tad nervous about using python for my next project. The unicode support is just horrendous and is very limited. I tried googling for unicode+ and nearly everytime it came back with multiple hits ! I read about a version of python in cvs that has unicode support built in but then I noticed the word 'cvs' **shrudder** My experiences might not be that complete so I would appreciate input from other senior snake wranglers out there.
Thursday, June 5, 2008
django pluggables

I have struck goldmine ! Bored today with another plone innard hacking session I began hunting around on the internet for a django plugins site. I came across this site called http://www.djangoplugables.com/ ! This is exactly what I am looking for with a cute little plug in socket as a logo! I am astounded by the speed in which django has grown !
I have coded some applications with django the past and I can say I like the feel of using it. It is has what I need in terms of a web application framework and is closer to the python metal than plone or zope will ever hope to be. Don't get me wrong, I still have a soft spot for zope and I truly hope that zope 3's modularity will be more elegant than the monolithic monstrosity that zope 2.x is.
Today while looking at the pluggable site, has given me much more confidence of adopting django as one of the stable technologies to deliver web applications in my company's offerings. Coming from a background that have used both I can attest to why a senior python programmer would love django. I hope though the django community get it right and do not start on the path of sticking stuff together to create another zope 2.x !
Wednesday, June 4, 2008
index vs. metadata
Newbies hacking up scripts for the first time in a zope or plone environment might be confused at some of the terminology, heck I know I was ! Anyway, one that took me some time to grok not to mention numerous emails bugging people on the mailing list to understand is the difference between portal catalog indexes and metadatas. I will just say it in my own words here as I understand it now.
Index are keywords you add to portal catalog to allow you to create query based on them. Eg. query['Title'] = 'lowkster'. Here, 'Title' would be an index. The easiest way to see all the index in your portal_catalog would be to use the zmi. Go to the root of your site using the zmi and then access your portal catalog using the link provided by the left hand side menu. Here youwill find a tab called 'Indexes'. Here are the listing of all the Indexes in your portal's catalog. You can also add a new index in your portal catalog using this page.
Metadatas on the other hand are stuff that you want to show in your query results. For example you query your portal catalog and in the result is a metadata called 'Books'. You can then in your page template display this by calling it as such '. To set one of the fields in your content type as a metadata field, you have to add the variable 'isMetaData=True' to your field definition. You can also add a new metadata to the portal catalog using the zmi.
Anyway, forgive me if my way of explaining indexes and metadatas is wrong. This is just my way of understanding the difference between the two. Feel free to correct me if I am wrong.
Index are keywords you add to portal catalog to allow you to create query based on them. Eg. query['Title'] = 'lowkster'. Here, 'Title' would be an index. The easiest way to see all the index in your portal_catalog would be to use the zmi. Go to the root of your site using the zmi and then access your portal catalog using the link provided by the left hand side menu. Here youwill find a tab called 'Indexes'. Here are the listing of all the Indexes in your portal's catalog. You can also add a new index in your portal catalog using this page.
Metadatas on the other hand are stuff that you want to show in your query results. For example you query your portal catalog and in the result is a metadata called 'Books'. You can then in your page template display this by calling it as such '. To set one of the fields in your content type as a metadata field, you have to add the variable 'isMetaData=True' to your field definition. You can also add a new metadata to the portal catalog using the zmi.
Anyway, forgive me if my way of explaining indexes and metadatas is wrong. This is just my way of understanding the difference between the two. Feel free to correct me if I am wrong.
Subscribe to:
Posts (Atom)