Saturday, January 31, 2009

Pimp my code ;)

In my previous (hastily written) entry, I tried to show how difficult it was to extract an undetermined URL from a string of text with the traditional FIND / MID functions in Coldfusion, and alluded to using Regular Expressions as a possible solution.

This is going to be a pretty neat trick, particularly since I know less about RegEx than I do about unicorns. ;)

I had hoped to find a copy of Ben Forta's book on the subject during a trip to my local bookstore, but no such luck. Oddly enough, there didn't seem to be any books available on RegEx in the store-- everything had to be ordered online. They did happen to have the O'Reilly Pocket Reference to Regular Expressions, though-- and I found a two page list of "recipes," RegExs put together for specific purposes, such as extracting email addresses, URLs, etc.

So, I copied the recipe for the URL down, convinced I'd found the solution to our problem.

Yes, I know-- naive of me. ;)

The problem with the O'Reilly recipe is that it uses characters that have special meaning in ColdFusion, such as the pound sign. Once I figured out how to escape the pound sign, it told me the parentheses were unbalanced. Troubleshooting the recipe was getting to be a major headache, so I wound up Googling for a RegEx specific to Coldfusion for extracting URLs . . .

and wound up at Ben Forta's blog. (Yeah, I know-- like I should be surprised!?)

Anyways, Mr. Forta offered a RegEx that works with both Javascript and Coldfusion to validate a URL-- which I've used in the REFindNoCase function example below. It's far from perfect, gentlemen-- at this point my code fragment can only find/extract the first URL in a text string, and I'll need to figure out some kind of while loop to make certain every URL has been extracted. But it's a start, right?

We start with our test string, feel free to customize for your own testing purposes:

<cfset tweet = "I like google http://www.google.com better than I like Yahoo ( http://www.yahoo.com ), but that's just me!">

Just displaying the string when the page runs, so everyone can see what we start with.

<cfoutput><p>#tweet#</p></cfoutput>

If you use the last two optional parameters of the REFindNoCase function, you can tell the function at what point in the string it should begin its search (1st character by default) as well as tell it to return a structure that contains two bits of information we need to extract the URL: the position of the first character in the match, and the total length of the matched string.

<cfset results = REFindNoCase("https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?", tweet, 1, "True")>

Then we just use the POS and LEN bits with our MID function to print out the match!

<p><cfoutput>#Mid(tweet,results.POS[1], results.LEN[1])#</cfoutput></p>

We could theoretically pass the extracted string to an API (to TinyURL, for instance), or ROT-13 it, or perform whatever arcane manips we want at this point, and then a simple REPLACE function will be sufficient to insert the modified URL back into the original string-- after all, once we've extracted the URL, we know exactly what it is now.

But, there is one interesting wrinkle I should mention. I've dumped the results variable that the REFindNoCase function generates so we can look inside of it (see below):

<cfdump var="#results#">

See how results actually contains two entries for POS and LEN? That means, I think, that the RegEx is actually finding two matches overlapping, namely http://www.google.com as well as www.google.com.

Replacing the 1 in the indexes of the arrays above with 2 seems to confirm that hypothesis.

No comments: