There was no support for unicode in the csv module, and as I gathered from here, it doesn't look as though there will be anytime in the near future ...
Just tinkered a little around with a the neighbor over the fence to see what they offered ... here are both of them side by side ...
>>> data = reader(open("data.csv","r")) >>> for item in data: ... print item ... 'NZ(\xe8\xaf\xba\xe7\xbb\xb4\xe4\xbf\xa1) >Food(\xe9\xa3\x9f\xe5\x93\x81\xe8\xa1\x8c\xe4\xb8\x9a\xe7\xbb\x84)', '\xe8\xaf\xba\xe7\xbb\xb4\xe4\xbf\xa1\xe9\xa3\x9f\xe5\x93\x81\xe8\xa1\x8c\xe4\xb8\x9a\xe7\xbb\x84\xe6\x9c\x80\xe7\xbb\x88\xe7\x94\xa8\xe6\x88\xb7', 'TRUE'] ['NZ', 'NZ(\xe8\xaf\xba\xe7\xbb\xb4\xe4\xbf\xa1) >Technical(\xe5\xb7\xa5\xe4\xb8\x9a\xe8\xa1\x8c\xe4\xb8\x9a\xe7\xbb\x84)', '\xe8\xaf\xba\xe7\xbb\xb4\xe4\xbf\xa1\xe5\xb7\xa5\xe4\xb8\x9a\xe8\xa1\x8c\xe4\xb8\x9a\xe7\xbb\x84\xe6\x9c\x80\xe7\xbb\x88\xe7\x94\xa8\xe6\x88\xb7', 'TRUE'] ['NZ', 'NZ(\xe8\xaf\xba\xe7\xbb\xb4\xe4\xbf
irb(main):002:0> require 'csv' => true irb(main):003:0> CSV.open('data.csv','r') do |row| irb(main):004:1* puts row irb(main):005:1> end 诺维信) >Technical(工业行业组) >Starch(淀粉糖行业) 诺维信淀粉糖行业最终用户 TRUE ...
Seems like ruby has got the unicode out of the box ... hmm wonder how much effort it would take have that unicode support? Those reading my last 2 posts are probably wondering if I pro ruby or something. I am not. I love Python and use it for most of my tasks and I love recent developments in it ... just that there are a few "nigglies" with it. I want it to improve beyond these few little bumps.
That being said I love the functionality I get when I use DictReader from Python's csv module as it let's me address my columns by name. Useful in situations where you have to reorder the columns. Seems that you can do that too with Ruby with a little work.
The next one that got my goat was this thing about BOMs or Byte Order Marks fecal matter left by Excel on the headers. Not really a problem with Python but rather Excel, still had to deal with that. Ended up sanitizing my headers and stripping off the damn BOM. There was a library to do that but I thought that was a bit of an overkill.
Anyway back to the grindstone ....
5 comments:
can't just cast the value in unicode?
wrap the item into unicode(value)
sweemeng: You can, just that wouldn't it be nice if it was all done it the background ?
use py3k?
m: unfortunately not yet.
You want codecs.open if you're reading files and want Unicode to come out of the read* methods.
Post a Comment