Sunday, June 8, 2008

python unicode sucks

I have had some bad experiences with python unicode on my current project and suffice to say I a tad nervous about using python for my next project. The unicode support is just horrendous and is very limited. I tried googling for unicode+ and nearly everytime it came back with multiple hits ! I read about a version of python in cvs that has unicode support built in but then I noticed the word 'cvs' **shrudder** My experiences might not be that complete so I would appreciate input from other senior snake wranglers out there.

13 comments:

ajung said...

What a pointless blog posting. You're ranting about unicode and its worse functionality in Python without providing facts....sorry but the content and value of this posting is close to zero.

Anonymous said...

I have completely different experience. Python has the best unicode support I have seen so far. The only problem I have met so far that python 2.3 does not support some japanese and chinese encodings (big5 most probably), but python 2.5 does not have this problem. I'm not sure what are you missing but there is nothing exotic about python:

1. 'string'.decode('utf-8') - you will get unicode string

2. u'string' - already unicode string

3. u'string'.encode('utf-8') - unicode string will be encoded as utf-8 string.

4. You might need to add some indication about file encoding in beginning of file. E.g.:

# -*- coding: utf-8 -*-

Hope that helps ;-) Please feel free to e-mail me at dalius at sandbox.lt if you have any questions.

Anonymous said...

Unicode support in Python can be confusing but it's all there and it's very full featured. The single most important thing to understand is the difference between a regular string and a unicode string. What really helped me was I stopped thinking about regular strings as strings at all, and instead started calling them "bytestrings" in my head. A bytestring consists of bytes; a unicode string consists of codepoints. To turn a unicode string in to a bytestring you .encode() it with an encoding (just stick to utf8). To turn a utf8 bytestring in to a unicode string you .decode() it.

This is the tutorial I found most useful:

http://evanjones.ca/python-utf8.html

Anonymous said...

What is limited about it?

The confusing thing is that most people tend to think of unicode as an encoding, and ascii as something normal. Python does not. In Python unicode is normal, and ascii is an encoding.

Hence, to convert TO unicode you *decode* and to convert FROM unicode you *encode*. Most people get that wrong in the beginning.

After that, unicode support in python is very easy to use, and I have had no trouble at all, as long as you remember to decode early and encode late.

stan said...

The "CVS" thing you stumbled on might be Python 3.0. As was mentioned by the other commenters, Python 2.x already has unicode support, but it is a separate type from the standard string type. In Python 3.0, the default string type is unicode, simplifying the problem. Python 3.0, however, is a clean, backward-incompatible break from Python 2.x, so you should not use it for a while (and certainly not without reading up on it in detail first).

And, the Python code repository has migrated from CVS to Subversion. :)

Simon said...

The default character encoding in Python 2.x is ASCII. This means that when your application is interacting with the world, you need to make sure that the text is converted to UTF-8.
In Python 3, the default character encoding will be UTF-8.

For now, you can change the site wide encoding from ASCII to UTF-8. This might break something, so if you see something strange it's nice to report it.

The file is /usr/lib/python2.5/site.py, look into def setencoding().

Kumar McMillan said...

I agree, Unicode in Python does "suck" in that it's confusing and cumbersome to deal with. However, it lets you work in Unicode just fine once you get it. And Python 3.0 will fix most of the suck.

I put a presentation together recently to help developers (mainly my colleagues at work) understand all the major nuances and gotchas of Unicode in Python without hitting them over the head with all the technical details:

http://farmdev.com/talks/unicode/

Larry said...

I too had Unicode problems but it was *not* Python. I was using SQLAlchemy with sqlite3 and the hashes of my data before and after entry into the DB didn't agree. As it turned out, all my ACSII/utf-8 strings got converted to Unicode upon entry into the DB.

I found that simple str(x) converted me back to utf-8 until I discovered how to tell sqlite3 to behave. ...disable auto Unicode conversion.

Unknown said...

For me, we should consider next situations when using unicode:
a) The encoding of the source program file.
b) The encoding of an input flat file read by a program and encoding of the output file. I see that some files have ascii and utf-8 encoding at the same time.
c) The encoding of a database.
d) In Windows we should consider the different encodings of a file (program or data) under DOS command prompt, which is cp850, and the same objects under windows, which is cp1250 for South America.
All of these generate confusion.
What do you think?

ajung said...

The problem with unicode in Python 2.X is not Python but the developers. Lots of developers neither know what unicode means nor how to deal with it. In addition a lot of people think that UTF8 and unicode are the same (which is obviously not the case).
Lots of people perform internal processing on utf-8 encoded strings and think that they are working with unicode (no, they don't).

You can work easily and sucessfully with unicode within Python if you use unicode *strings* all over within your model and perform operations based on unicode strings. Conversation to some encoding takes only place on the input/output side of the module.

Unknown said...

Your problem is not with Python's unicode support, but with the fact that you are mixing encoded strings in different encodings and/or unicode strings. That is a recipe for disaster in any language.

In general: when your application receives text from the outside world, *immediately* decode it to unicode strings. At that point you *should* still have access to the encoding information, (i.e. from the browser, the imported document, or the UI.) which you won't have later.

Encoding text is *only* done just before sending it somewhere outside the control your application again (again, a browser, a document, the UI) and *only* at the last moment.

At any other time, your strings should be unicode. (Or ascii, if they don't contain any non-ascii characters.)

If you deviate from this, you *will* run into trouble, and the problem is that if you're developing an english only application, this might be later rather than sooner. But run into it you you *will*, so later is worse than sooner.

Arguably, a statically typed language with explicit type declarations and different types for unicode strings and encoded strings could handle this even better in that it could throw the errors earlier. Python has no way of telling whether you *meant* to use a unicode string but didn't.

Brian McKendrick said...

My favorites are the Python 'apologists' who staunchly defend Python and it's horrid implementation of Python - best described as broken.

It's terrible - the biggest problem for me - how many of the core/standard libs still choke when passed a unicode string rather than ascii. Other languages have handled this much more effectively - most of the time, you don't even have to think about it.

Anonymous said...

I have to agrre with bhmkendrick, he is completely correct except in everything. :)

How many languages can you give examples of that has gone from non-unicode to unicode support completely backwards-compatibly and transparently.

Remember, you said *most*. :)

Sure, it's a bit embarassing that not all of the standard library supports unicode objects. I'll give you that.