Pythonと日本語表示と文字コード其の弐

ここのところは前回と一緒で。

(WindowsXPにココから "Python 2.5.1 Windows installer" をインストールした環境でテストしています。)

まずは、あなたが書いたコードはutf-8で保存する。そして、そのコードの先頭には以下を記入する。
# -*- coding: utf-8 -*-
あなたはエディタに何を使っていますか？　秀丸、メモ帳、vim、meadow、或いは Python Scripter、eclipse ？いずれにしてもファイルを保存する時のエンコードはutf-8にすべし。

今日はもうちょっといじくってみる。

# -*- coding: utf-8 -*-

jstr = "パイソン"
kstr = "パイソン"

print jstr
print kstr

if jstr == kstr:
    print "same"
else:
    print "not same"

表示は文字化けするが二つは same 。これは当たり前か。

jstr を unicodeにしてみる。

# -*- coding: utf-8 -*-　

jstr = u"パイソン"
kstr = "パイソン"

print jstr
print kstr

if jstr == kstr:
    print "same"
else:
    print "not same"

jstr は文字化けせずに表示される。これは前回もやった。
で、ふたつは same ではない。

さらに、kstr を unicode に変換して比べてみる。

# -*- coding: utf-8 -*-

jstr = u"パイソン"
kstr = "パイソン"

kstr = unicode(kstr,'utf-8')　#unicodeに変換。

print jstr
print kstr

if jstr == kstr:
    print "same"
else:
    print "not same"

ふたつとも文字化けせずに表示。そして、ふたつは same 。

jstr を unicode から utf-8 にエンコードして比べる。

# -*- coding: utf-8 -*-

jstr = u"パイソン"
kstr = "パイソン"

jstr = jstr.encode('utf-8')

print jstr
print kstr

if jstr == kstr:
    print "same"
else:
    print "not same"

ふたつとも文字化けする。そして、ふたつは same 。

まとめ。

　utf-8 って unicode だろ？　っていう僕の勝手な思いこみが Python を気持ちよくさせていない原因。
　utf-8 は unicode のエンコーディング（符号化方式）のひとつである。

# -*- coding: utf-8 -*-

jstr = u"日本語"    #これは unicode
kstr = "日本語"     #これは utf-8

lstr = unicode(kstr,'utf-8')  #lstr は unicode
mstr = kstr.decode('utf-8')   #mstr は unicode
nstr = jstr.encode('utf-8')   #nstr は utf-8