matching URLs in a markup language via Regex -


i'm working on code matching urls (doesn't have valid) in markup language, can import url or can add between [ ]. give examples:
1-http://en.wikipedia.org/wiki/main_page
2-[http://en.wikipedia.org/wiki/main_page title]
3-[http://en.wikipedia.org/wiki/(main_page) title]
4-(http://en.wikipedia.org/wiki/main_page)
5-[http://en.wikipedia.org/wiki/main_page]
need 3 regexes, 1 urls in brackets (no. 2,3,5) , urls not in bracket(1,4) , 1 of first , second 1 easy did it:

notinside = '\]\s<>"' notatend = '\]\s\.:;,<>"\|\)' regex = r'(?p<url>http[s]?://[^%(notinside)s]*?[^%(notatend)s]' \ r'(?=[%(notatend)s]*\'\')|http[s]?://[^%(notinside)s]*' \ r'[^%(notatend)s])' % {'notinside': notinside, 'notatend': notatend} 

but problem begins third one, 1 url started parentheses (or bracket) number 4, regex shouldn't match ")" @ end of url people use ")" @ end of url , put in bracket, regex must match ")" number 3. can't write 2 separate regexes third 1 , combine results

another thing: post in free software code, please imply that's ok publish codes in mit license. thank you

description

this regex will:

  • match strings urls found inside square brackets, round brackets, , no brackets
  • each type of bracketed match captured in different capture group

\[(https?:\/\/(?:(?!\]).)*)\]|\((https?:\/\/(?:(?!\)).)*)\)|(https?:\/\/(?:(?!\s|$|\z).)*)

enter image description here

example

live example: http://www.rubular.com/r/g7o1xdogb5

sample text

1-http://1en.wikipedia.org/wiki/main_page 2-[http://2en.wikipedia.org/wiki/main_page title] 3-[http://3en.wikipedia.org/wiki/(main_page) title] 4-(http://4en.wikipedia.org/wiki/main_page) 5-[http://5en.wikipedia.org/wiki/main_page] 

matches

[0][0] = http://1en.wikipedia.org/wiki/main_page [0][1] =  [0][2] =  [0][3] = http://1en.wikipedia.org/wiki/main_page  [1][0] = [http://2en.wikipedia.org/wiki/main_page title] [1][1] = http://2en.wikipedia.org/wiki/main_page title [1][2] =  [1][3] =   [2][0] = [http://3en.wikipedia.org/wiki/(main_page) title] [2][1] = http://3en.wikipedia.org/wiki/(main_page) title [2][2] =  [2][3] =   [3][0] = (http://4en.wikipedia.org/wiki/main_page) [3][1] =  [3][2] = http://4en.wikipedia.org/wiki/main_page [3][3] =   [4][0] = [http://5en.wikipedia.org/wiki/main_page] [4][1] = http://5en.wikipedia.org/wiki/main_page [4][2] =  [4][3] =  

alternative

i'm not sure how lookbehinds work in media wiki try this

(?<=\[)https?:\/\/(?:(?!\]).)*(?=\])|(?<=\()https?:\/\/(?:(?!\)).)*(?=\))|https?:\/\/(?:(?!\s|$|\z).)*

enter image description here

given same sample text, put captures group 0

live example: http://www.rubular.com/r/2o9aebq1oz

license , free use

stack overflow policy says: user contributions licensed under cc-wiki attribution required


Comments

Popular posts from this blog

How to mention the localhost in android -

php - Calling a template part from a post -

c# - String.format() DateTime With Arabic culture -