perl - Regex to capture <img> tags fails when "src" value is different -
i use regex extract <img src="img.jpg">
tags
here regex
my @accept = $message_body =~ /<img src=\"\s*\">/gi;
now regex fails when img tag this: <img src="cid:img.jpg">
can 1 tell me why?
description
the greedyness of \"\s*\"
says it'll match many non space characters possible before last "
appears in string. change \".*?\"
match characters upto next "
.
i overhaul expression avoid other difficult html edge cases.
this expression will:
- match img tags have src attribute
- capture src attribute value
- avoid messy html edge cases like:
- like
>
or looks attribute inside embedded javascript function - attributes end
src
hrefsrc="somevalue"
- like
- although not used problem because you're looking single attribute,
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=['"]([^"]*)['"])
construct allows multiple attributes appear in order inside img tag.
<img\b(?=\s)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=['"]([^"]*)['"])(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*\s?>
example
live example: http://www.rubular.com/r/brmdy0ya0s
sample text
note how second image tag has of difficult edge cases.
<img src="cid:img.jpg"> <img hrefsrc="notme.jpg" onmouseover=' src="notthemeeither.jpg" ; if ( 6 > x ) { funrotator(src) ; } ; ' src="cid:difficulttofind.jpg">
matches
[0][0] = <img src="cid:img.jpg"> [0][1] = cid:img.jpg [1][0] = <img hrefsrc="notme.jpg" onmouseover=' src="notthemeeither.jpg" ; if ( 6 > x ) { funrotator(src) ; } ; ' src="cid:difficulttofind.jpg"> [1][1] = cid:difficulttofind.jpg
Comments
Post a Comment