Sunday, November 7, 2010

CSVs, unicodes and BOMs

Nothing like a little project specs to make the ugliness start climbing out of the woodwork. I was tasked to write this parser to massage some data destined to be imported into a database. As I worked with the csv library a few things started to be painfully evident to me.

There was no support for unicode in the csv module, and as I gathered from here, it doesn't look as though there will be anytime in the near future ...

Just tinkered a little around with a the neighbor over the fence to see what they offered ... here are both of them side by side ...

>>> data = reader(open("data.csv","r"))
>>> for item in data:
...     print item

 'NZ(\xe8\xaf\xba\xe7\xbb\xb4\xe4\xbf\xa1) >Food(\xe9\xa3\x9f\xe5\x93\x81\xe8\xa1\x8c\xe4\xb8\x9a\xe7\xbb\x84)', '\xe8\xaf\xba\xe7\xbb\xb4\xe4\xbf\xa1\xe9\xa3\x9f\xe5\x93\x81\xe8\xa1\x8c\xe4\xb8\x9a\xe7\xbb\x84\xe6\x9c\x80\xe7\xbb\x88\xe7\x94\xa8\xe6\x88\xb7', 'TRUE']                          
['NZ', 'NZ(\xe8\xaf\xba\xe7\xbb\xb4\xe4\xbf\xa1) >Technical(\xe5\xb7\xa5\xe4\xb8\x9a\xe8\xa1\x8c\xe4\xb8\x9a\xe7\xbb\x84)', '\xe8\xaf\xba\xe7\xbb\xb4\xe4\xbf\xa1\xe5\xb7\xa5\xe4\xb8\x9a\xe8\xa1\x8c\xe4\xb8\x9a\xe7\xbb\x84\xe6\x9c\x80\xe7\xbb\x88\xe7\x94\xa8\xe6\x88\xb7', 'TRUE']                     
['NZ', 'NZ(\xe8\xaf\xba\xe7\xbb\xb4\xe4\xbf

irb(main):002:0> require 'csv'                                                                      
=> true                                                                                             
irb(main):003:0>'data.csv','r') do |row|                                     
irb(main):004:1* puts row                                                                           
irb(main):005:1> end
诺维信) >Technical(工业行业组) >Starch(淀粉糖行业)                                               

Seems like ruby has got the unicode out of the box ... hmm wonder how much effort it would take have that unicode support? Those reading my last 2 posts are probably wondering if I pro ruby or something. I am not. I love Python and use it for most of my tasks and I love recent developments in it ... just that there are a few "nigglies" with it. I want it to improve beyond these few little bumps.

That being said I love the functionality I get when I use DictReader from Python's csv module as it let's me address my columns by name. Useful in situations where you have to reorder the columns. Seems that you can do that too with Ruby with a little work.

The next one that got my goat was this thing about BOMs or Byte Order Marks fecal matter left by Excel on the headers. Not really a problem with Python but rather Excel, still had to deal with that. Ended up sanitizing my headers and stripping off the damn BOM. There was a library to do that but I thought that was a bit of an overkill.

Anyway back to the grindstone ....


Unknown said...

can't just cast the value in unicode?
wrap the item into unicode(value)

lowkster said...

sweemeng: You can, just that wouldn't it be nice if it was all done it the background ?

m said...

use py3k?

lowkster said...

m: unfortunately not yet.

Paul said...

You want if you're reading files and want Unicode to come out of the read* methods.