Acmlm's Board - I2 Archive

Register \| Login
Views: 19364387	Main \| Memberlist \| Active users \| ACS \| Commons \| Calendar \| Online users Ranks \| FAQ \| Color Chart \| Photo album \| IRC Chat	11-02-05 12:59 PM

0 user currently in Programming. | 3 guests

Acmlm's Board - I2 Archive - Programming - Grabbing URL's with PHP?

Add to favorites | "RSS" Feed | Next newer thread | Next older thread

User

Post

windwaker

Ball and Chain Trooper
WHY ALL THE MAYONNAISE HATE
Level: 61

Posts: 1562/1797
EXP: 1860597
For next: 15999

Since: 03-15-04

Since last post: 4 days
Last activity: 6 days

Posted on 05-21-05 05:35 AM

Link | Quote

I'm trying to build something that'll go to another site of mine in PHP, however, I need to be able to grab urls from the page it's viewing (because I need it to differentiate between images and links).

Any ideas?

Ramsus

Octoballoon
Level: 19

Posts: 64/162
EXP: 34651
For next: 1126

Since: 01-24-05
From: United States

Since last post: 39 days
Last activity: 71 days

Posted on 05-21-05 12:46 PM

Link | Quote

Couldn't you just use a regex like /<a href="(http:\/\/.*?)">/ ?

EDIT: In case you're not familiar with using regular expressions in PHP, you'd use the following code with a buffer full of HTML (in this case, $buffer) to get an array of URLs:

<?php

preg_match_all("/<a href=\"(http:\/\/.*?)\">(.*?)<\/a>/", $buffer, $links);
// $links[0] is an array filled with all of the anchor tags
// $links[1] is an array filled just with the URLs from those tags
// $links[2] is an array filled with the names of the links
foreach ($links[1] as $link) {
echo "URL: $link \n";
}
?>

(edited by Ramsus on 05-20-05 08:15 PM)
(edited by Ramsus on 05-20-05 08:19 PM)

windwaker

Ball and Chain Trooper
WHY ALL THE MAYONNAISE HATE
Level: 61

Posts: 1575/1797
EXP: 1860597
For next: 15999

Since: 03-15-04

Since last post: 4 days
Last activity: 6 days

Posted on 05-22-05 04:45 AM

Link | Quote

Ah, I see. I'd used regular expressions, however I'd never really written them on my own; where'd you learn to do that?

Ramsus

Octoballoon
Level: 19

Posts: 70/162
EXP: 34651
For next: 1126

Since: 01-24-05
From: United States

Since last post: 39 days
Last activity: 71 days

Posted on 05-22-05 05:04 AM

Link | Quote

I read a man page one day (a few years ago, I think) and played around with it. Just check out the perlre manual page. It's definitely a must if you do web development, since it is the absolute easiest tool to filter and secure user input with. Perl in taint-mode even requires you use regex's with all input and external variables.

windwaker

Ball and Chain Trooper
WHY ALL THE MAYONNAISE HATE
Level: 61

Posts: 1577/1797
EXP: 1860597
For next: 15999

Since: 03-15-04

Since last post: 4 days
Last activity: 6 days

Posted on 05-22-05 05:49 AM

Link | Quote

:O

Thanks tons man! This's what I've been looking for, for quite a while.

kode54

Level: 4

Posts: 5/7
EXP: 246
For next: 33

Since: 05-09-05

Since last post: 154 days
Last activity: 133 days

Posted on 06-01-05 06:43 AM

Link | Quote

Here's a regular expression I kind of borrowed from Invision Power Board months ago, and later modified. As you can see, it uses case insensitive (i) and also PHP's extended execute (e) attribute, which treats the replacement string as a piece of code to execute instead of merely a replacement. I think you can still catch the echoed output by assigning the return value to an array, but this works as well:

$urls = array();

preg_replace('#(^|\s|"|'."'".')((http|https|news|ftp)://\w+[^\s\[\]"'."'".']+)#ie', '\$urls[] = "\2"', $input_text);

Token 2 (the complete link) in every match will be pushed into $urls. Token 1 is only there for the original code to preserve the preceding whitespace, but I added various quotation marks since I encountered various IRC logs where clients or servers added the quotes or other characters.

It is probably not a good idea to use this string in a redundant manner, rather to process data once and record somewhere that you processed it. Well, since you're processing a web page, that should mean less complexity than what I was doing. (Thousands of lines of IRC logs, all processed from a MySQL server, every time the page is loaded... No I won't demonstrate.)

Also, if you know what you are doing, and you will always be processing properly formed X/HTML content, it may be more secure to parse the pages with the XML extension and locate all anchor tags. Then worry about the Regex if/when you need fulltext scanning.

HyperLamer
<||bass> and this was the soloution i thought of that was guarinteed to piss off the greatest amount of people

Sesshomaru
Tamaranian

Level: 118

Posts: 4702/8210
EXP: 18171887
For next: 211027

Since: 03-15-04
From: Canada, w00t!
LOL FAD

Since last post: 2 hours
Last activity: 2 hours

Posted on 06-01-05 07:18 AM

Link | Quote

Treating user input as code could lead to some nasty security flaws though. You'd need to make sure you sealed up any possible holes.

kode54

Level: 4

Posts: 6/7
EXP: 246
For next: 33

Since: 05-09-05

Since last post: 154 days
Last activity: 133 days

Posted on 06-01-05 07:33 AM

Link | Quote

That does not treat user input as code. It merely processes a token of the data which the expression finds using its own code. I don't think it's vulnerable to double-quotes faking out the processor code either. Even if that were possible, the expression cuts off at the first single or double quote character. There may yet be a vulnerability somewhere in there, but since IPB is using almost the same code, I presume it to be safe.

Just to clarify the preg_replace "e" flag, it specifies that your replacement string, which in this case is a string constant, is a piece of PHP code to be executed once per match. I will have to look it up again, as I am not sure if it means that said code can "echo" or otherwise manipulate the standard output and in turn be piped as the replacement text, which is eventually output as the return value of preg_replace(). My example simply ignores the return and stores the data in an array created outside of the function. In fact, you could do like my code does and further foreach() parse an array of strings, or while() parse a SQL result set.

As I said, Regex isn't the only way. You may want to experiment with the XML extension, it may prove to be faster for handling just anchors and/or img tags than scanning the raw text with a regular expression.

windwaker

Ball and Chain Trooper
WHY ALL THE MAYONNAISE HATE
Level: 61

Posts: 1630/1797
EXP: 1860597
For next: 15999

Since: 03-15-04

Since last post: 4 days
Last activity: 6 days

Posted on 06-01-05 08:47 AM

Link | Quote

Well, while we're on the subject of PHP...

How do you do something like...
$functionname = "mycustomfunction";
execute_function($functionname);

Where execute_function is a function that runs a function called mycustomfunction()?

kode54

Level: 4

Posts: 7/7
EXP: 246
For next: 33

Since: 05-09-05

Since last post: 154 days
Last activity: 133 days

Posted on 06-01-05 10:14 AM

Link | Quote

Here is information on the function handling functions. call_user_func may be what you want, but create_function may also be handy for declaring the function at once, or even custom generating function code on the fly, in the event that you find it more efficient to generate one function to be executed repeatedly.

Add to favorites | "RSS" Feed | Next newer thread | Next older thread

Acmlm's Board - I2 Archive - Programming - Grabbing URL's with PHP?

Page rendered in 0.007 seconds.