python regex not matching decoded unicode string -

March 15, 2011

i've compiled regex using:

number_re = re.compile(ur'(?<![-_\.])\b([0-9]+|[0-9]+[0-9-_\.]*[0-9]+)\b(?![-_\.])'), re.unicode)

and manages match 1990-1991 in following string:

mystring = 'フットボールリーグ1990-1991' match = number_re.search(mystring) >>> <_sre.sre_match @ 0x25e1918> match.group() >>> '1990-1991'

but when string decoded (or when it's passed function)

mystring = 'フットボールリーグ1990-1991'.decode('utf-8') >>> u'\u30d5\u30c3\u30c8\u30dc\u30fc\u30eb\u30ea\u30fc\u30b01990-1991' match = number_re.search(mystring)

the matching no longer occurs, i'm guessing has boundaries '\b' not matching because looks 1 continuous string i'm not sure.

i think i've put unicode requirements (compiled 're.unicode' flag , put 'ur' in regex string. last thing i'm going try python regex library says good, i'd know what's wrong current stuff! :).

\b

there isn't word boundary between letter グ , number 1—they both alphanumerics. when unicode-aware regex being used correctly handled, hence no match. if don't want treat katakana , other non-ascii letters being alphanums, remove re.unicode flag controls behaviour.

when send byte string regex compiled unicode string, automatically decoded. reason seems decoded iso-8859-1 (rather than, say, sys.getdefaultencoding())... don't know why is, implicit encode/decode in general evil avoided.

the utf-8 byte sequence グ, when mis-decoded iso-8859-1, comes out ã[control char]°. degree sign not alphanum match.

Search This Blog

Live

python regex not matching decoded unicode string -

Comments

Post a Comment

Popular posts from this blog

How to mention the localhost in android -

php - Calling a template part from a post -

c# - String.format() DateTime With Arabic culture -