perl - Regex to capture <img> tags fails when "src" value is different -


i use regex extract <img src="img.jpg"> tags

here regex

my @accept = $message_body =~ /<img src=\"\s*\">/gi; 

now regex fails when img tag this: <img src="cid:img.jpg">

can 1 tell me why?

description

the greedyness of \"\s*\" says it'll match many non space characters possible before last " appears in string. change \".*?\" match characters upto next ".

i overhaul expression avoid other difficult html edge cases.

this expression will:

  • match img tags have src attribute
  • capture src attribute value
  • avoid messy html edge cases like:
    • like > or looks attribute inside embedded javascript function
    • attributes end src hrefsrc="somevalue"
  • although not used problem because you're looking single attribute, (?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=['"]([^"]*)['"]) construct allows multiple attributes appear in order inside img tag.

<img\b(?=\s)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=['"]([^"]*)['"])(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*\s?>

enter image description here

example

live example: http://www.rubular.com/r/brmdy0ya0s

sample text

note how second image tag has of difficult edge cases.

<img src="cid:img.jpg"> <img hrefsrc="notme.jpg" onmouseover=' src="notthemeeither.jpg" ; if ( 6 > x ) { funrotator(src) ; } ; ' src="cid:difficulttofind.jpg"> 

matches

[0][0] = <img src="cid:img.jpg"> [0][1] = cid:img.jpg  [1][0] = <img hrefsrc="notme.jpg" onmouseover=' src="notthemeeither.jpg" ; if ( 6 > x ) { funrotator(src) ; } ; ' src="cid:difficulttofind.jpg"> [1][1] = cid:difficulttofind.jpg 

Comments

Popular posts from this blog

php - Calling a template part from a post -

Firefox SVG shape not printing when it has stroke -

How to mention the localhost in android -