matching URLs in a markup language via Regex -
i'm working on code matching urls (doesn't have valid) in markup language, can import url or can add between [ ]. give examples:
1-http://en.wikipedia.org/wiki/main_page
2-[http://en.wikipedia.org/wiki/main_page title]
3-[http://en.wikipedia.org/wiki/(main_page) title]
4-(http://en.wikipedia.org/wiki/main_page)
5-[http://en.wikipedia.org/wiki/main_page]
need 3 regexes, 1 urls in brackets (no. 2,3,5) , urls not in bracket(1,4) , 1 of first , second 1 easy did it:
notinside = '\]\s<>"' notatend = '\]\s\.:;,<>"\|\)' regex = r'(?p<url>http[s]?://[^%(notinside)s]*?[^%(notatend)s]' \ r'(?=[%(notatend)s]*\'\')|http[s]?://[^%(notinside)s]*' \ r'[^%(notatend)s])' % {'notinside': notinside, 'notatend': notatend} but problem begins third one, 1 url started parentheses (or bracket) number 4, regex shouldn't match ")" @ end of url people use ")" @ end of url , put in bracket, regex must match ")" number 3. can't write 2 separate regexes third 1 , combine results
another thing: post in free software code, please imply that's ok publish codes in mit license. thank you
description
this regex will:
- match strings urls found inside square brackets, round brackets, , no brackets
- each type of bracketed match captured in different capture group
\[(https?:\/\/(?:(?!\]).)*)\]|\((https?:\/\/(?:(?!\)).)*)\)|(https?:\/\/(?:(?!\s|$|\z).)*)

example
live example: http://www.rubular.com/r/g7o1xdogb5
sample text
1-http://1en.wikipedia.org/wiki/main_page 2-[http://2en.wikipedia.org/wiki/main_page title] 3-[http://3en.wikipedia.org/wiki/(main_page) title] 4-(http://4en.wikipedia.org/wiki/main_page) 5-[http://5en.wikipedia.org/wiki/main_page] matches
[0][0] = http://1en.wikipedia.org/wiki/main_page [0][1] = [0][2] = [0][3] = http://1en.wikipedia.org/wiki/main_page [1][0] = [http://2en.wikipedia.org/wiki/main_page title] [1][1] = http://2en.wikipedia.org/wiki/main_page title [1][2] = [1][3] = [2][0] = [http://3en.wikipedia.org/wiki/(main_page) title] [2][1] = http://3en.wikipedia.org/wiki/(main_page) title [2][2] = [2][3] = [3][0] = (http://4en.wikipedia.org/wiki/main_page) [3][1] = [3][2] = http://4en.wikipedia.org/wiki/main_page [3][3] = [4][0] = [http://5en.wikipedia.org/wiki/main_page] [4][1] = http://5en.wikipedia.org/wiki/main_page [4][2] = [4][3] = alternative
i'm not sure how lookbehinds work in media wiki try this
(?<=\[)https?:\/\/(?:(?!\]).)*(?=\])|(?<=\()https?:\/\/(?:(?!\)).)*(?=\))|https?:\/\/(?:(?!\s|$|\z).)*

given same sample text, put captures group 0
live example: http://www.rubular.com/r/2o9aebq1oz
license , free use
stack overflow policy says: user contributions licensed under cc-wiki attribution required
Comments
Post a Comment