New tutorials View all
Encapsulation in PHP
Abstract classes in PHP
An introduction to MySQL databases in PHP
Creating a user login and registration system.
First Lesson of HTML
New forum replies
Confused in The big and general IRC Bot...Jim in Php irc bot logwescooldude3 in Php irc bot logJim in AJAX + loadingJim in Php irc bot say namewescooldude3 in Php irc bot say nameRealShadow in AJAX + loading
New frontpage replies
tutorials: Creating an IRC bot in PHPtutorials: Creating an IRC bot in PHPtutorials: Creating an IRC bot in PHPtutorials: Creating an IRC bot in PHPtutorials: Creating an IRC bot in PHP
New articles
New news
Reading a website in PHP (340 views)
Reading a website with PHP can be done in various ways. In this tutorial i will explain the fsockopen() function to read www.Google.com.
The last few weeks i made tutorials about security. Now lets start with some cool things, sockets. A socket is a connection to another source. This source will be google.com in our tutorial, but the source can be all kinds of things. You can even make a IRC bot by using these techniques.
Before we start the coding, i want to let you know a few things about reading websites with php. Most people who are more experienced with the php language might tell you to use another technique to read a website. Because there are much more easy ways to read a website.
For example lets check out this code:
This example reads the website perfectly.
But be warned, much web servers don't accept this code. Thats why we'll learn it on another way.
Opening a website via fsockopen() isn't that hard. But still we'll do it on the most easy way. There are much extras you can learn with fsockopen(). Think about saving errors in variables etc.
First step
First we need to open a socket to www.google.com, and use port 80 for this.
Ok cool, we have a socket. Now lets check if the socket is connected. Fsockopen() returns FALSE when a connection couldn't be established. So we can easy trow the variable into a if statement.
Second step
Now we need to use some http protocol, we should let the server know what we're looking at. The protocol code we will need for this socket is the following:
GET / HTTP/1.1
Host: www.google.com
Connection: Close
The first line, the GET / part is to indicate what file we want to read. This time we want to read the frontpage so we leave the / like it is. When you want to open another page you will use GET /anotherfolder/anotherfile.html HTTP/1.1.
The second line is simply the host. In this tutorial it's www.google.com. When opening another website, don't add the filename to it.
The last line makes sense, just to close the http request.
When sending these lines to the http server, you need to add carriage returns and new lines to it. With last Connection: Close line, you need to add it twice. Just check the example for how i really mean it.
For those who don't know what newlines (\n) and carriage returns are (\r). They end the line, so the server knows when the end of the line is reached. The carriage return is only needed on a Windows server, a Linux server would work fine without the carriage return.
Third step
The final big step is to actually read the website. Lets cut the talk and check out an example, after that i will explain what I've done.
The making of the $html variable makes sense. Lets go to the while loop.. In this loop the website is read, the reading is done by the fgets() function inside the loop. And the !feof() function checks if the website is totally read, when it isn't it will continue looping. When feof() sees the file is totally read, it will stop looping.
In the fgets() function, notice the 128 number. This is the number of bytes the function will read every time its called. So basically, the site isn't read in one time, thats why we use the feof() function to check how far we are.. And of course the while() loop to continue reading.
In the end we'll close the socket. Now you can do whatever you want with the $html variable, for it contains the html of the website. Note that the http protocol headers are also included, so don't be too surprised when you see some code you don't see when opening a websites code with your browser.
You can download an example from our download page, or directly by clicking here.
Before we start the coding, i want to let you know a few things about reading websites with php. Most people who are more experienced with the php language might tell you to use another technique to read a website. Because there are much more easy ways to read a website.
For example lets check out this code:
| php | |
|
1 2 3 4 5 6 7 8 9 |
<?php // read a website $website = file_get_contents('http://www.combined-minds.net/index.php'); // print the website's code echo $website; ?> |
This example reads the website perfectly.
But be warned, much web servers don't accept this code. Thats why we'll learn it on another way.
Opening a website via fsockopen() isn't that hard. But still we'll do it on the most easy way. There are much extras you can learn with fsockopen(). Think about saving errors in variables etc.
First step
First we need to open a socket to www.google.com, and use port 80 for this.
| php | |
|
1 2 3 4 5 |
<?php $website = fsockopen('www.google.com', 80); ?> |
Ok cool, we have a socket. Now lets check if the socket is connected. Fsockopen() returns FALSE when a connection couldn't be established. So we can easy trow the variable into a if statement.
| php | |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Second step
Now we need to use some http protocol, we should let the server know what we're looking at. The protocol code we will need for this socket is the following:
GET / HTTP/1.1
Host: www.google.com
Connection: Close
The first line, the GET / part is to indicate what file we want to read. This time we want to read the frontpage so we leave the / like it is. When you want to open another page you will use GET /anotherfolder/anotherfile.html HTTP/1.1.
The second line is simply the host. In this tutorial it's www.google.com. When opening another website, don't add the filename to it.
The last line makes sense, just to close the http request.
When sending these lines to the http server, you need to add carriage returns and new lines to it. With last Connection: Close line, you need to add it twice. Just check the example for how i really mean it.
| php | |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
<?php $website = fsockopen('www.google.com', 80); // check if the website is found if(!$website) { echo'Could\'t open google!'; } else { // write to the http server fwrite($website, "GET / HTTP/1.1\r\n"); fwrite($website, "Host: www.google.com\r\n"); fwrite($website, "Connection: Close\r\n\r\n"); } ?> |
For those who don't know what newlines (\n) and carriage returns are (\r). They end the line, so the server knows when the end of the line is reached. The carriage return is only needed on a Windows server, a Linux server would work fine without the carriage return.
Third step
The final big step is to actually read the website. Lets cut the talk and check out an example, after that i will explain what I've done.
| php | |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
<?php $website = fsockopen('www.google.com', 80); // check if the website is found if(!$website) { echo'Could\'t open google!'; } else { // write to the http server fwrite($website, "GET / HTTP/1.1\r\n"); fwrite($website, "Host: www.google.com\r\n"); fwrite($website, "Connection: Close\r\n\r\n"); // a variable for storing the html code $html = ''; // read the website while(!feof($website)) { // store the html into a variable $html .= fgets($website, 128); } // when we're done we'll close the socket fclose($website); } ?> |
The making of the $html variable makes sense. Lets go to the while loop.. In this loop the website is read, the reading is done by the fgets() function inside the loop. And the !feof() function checks if the website is totally read, when it isn't it will continue looping. When feof() sees the file is totally read, it will stop looping.
In the fgets() function, notice the 128 number. This is the number of bytes the function will read every time its called. So basically, the site isn't read in one time, thats why we use the feof() function to check how far we are.. And of course the while() loop to continue reading.
In the end we'll close the socket. Now you can do whatever you want with the $html variable, for it contains the html of the website. Note that the http protocol headers are also included, so don't be too surprised when you see some code you don't see when opening a websites code with your browser.
You can download an example from our download page, or directly by clicking here.
Replies on Reading a website in PHP:
Jump to comment page: 1
By m038 on Wednesday 21 March 2007 8:22
For as far as i have been testing, this method unfortunately does NOT work for google, somehow i always seems to get 302 Moved. But the url off course exists.
Still looking to solve this
Jump to comment page: 1
You are not logged in. Please login or register an account, it just takes 30 seconds.

First of all, welcome to Combined Minds!
About your problem, that is very correct my friend. Google proberbly wants to redirect you to your own language's page.
Try the script on this website, and you will deffeniatly see all the html.
You also see the headers right?
I got a test page located: http://zk.zonax.net/test.php where i run the script. It tries to redirect to google.nl, for the server is located in the Netherlands. That is why you get the 302.
Hope i helped you enough!