python regex not matching decoded unicode string -
i've compiled regex using:
number_re = re.compile(ur'(?<![-_\.])\b([0-9]+|[0-9]+[0-9-_\.]*[0-9]+)\b(?![-_\.])'), re.unicode)
and manages match 1990-1991 in following string:
mystring = 'フットボールリーグ1990-1991' match = number_re.search(mystring) >>> <_sre.sre_match @ 0x25e1918> match.group() >>> '1990-1991'
but when string decoded (or when it's passed function)
mystring = 'フットボールリーグ1990-1991'.decode('utf-8') >>> u'\u30d5\u30c3\u30c8\u30dc\u30fc\u30eb\u30ea\u30fc\u30b01990-1991' match = number_re.search(mystring)
the matching no longer occurs, i'm guessing has boundaries '\b' not matching because looks 1 continuous string i'm not sure.
i think i've put unicode requirements (compiled 're.unicode' flag , put 'ur' in regex string. last thing i'm going try python regex library says good, i'd know what's wrong current stuff! :).
\b
there isn't word boundary between letter グ
, number 1
—they both alphanumerics. when unicode-aware regex being used correctly handled, hence no match. if don't want treat katakana , other non-ascii letters being alphanums, remove re.unicode
flag controls behaviour.
when send byte string regex compiled unicode string, automatically decoded. reason seems decoded iso-8859-1 (rather than, say, sys.getdefaultencoding()
)... don't know why is, implicit encode/decode in general evil avoided.
the utf-8 byte sequence グ
, when mis-decoded iso-8859-1, comes out ã[control char]°
. degree sign not alphanum match.
Comments
Post a Comment