Tag Archives: Apache

The httpd Apache Web Server

Search Engine Optimization Checklist

Got a site online? and want to make it really SEO Friendly?

Here is a checklist that you need to consider before you put your site live.

SEO Checklist

  • Find keywords and phrases, and try to include them in the title element.
  • Make sure the keywords and phrases occur in the page, But need not overdo its occurance
  • Use meta tags, to define keywords and description
  • Use keywords and phrases in heading elements like h1, h2, ..
  • Try structuring the content, using heading elements instead of changing font-sizes and colors, they can also be applied to headings
  • Do not keep a splash page, I hate it, so do Search Engines
  • The URL should ideally contain keywords related to that page, Apache URL Rewriting
  • Avoid GET parameters to change content of page
  • Avoid using javascript URLs, if you do, remember to add actual hrefs that lead you to that page or similar content
  • Make sure none of the links are created by JavaScript, they should be available in plain HTML
  • Adhering to semantic content, All menu links should ideally go into a list using li
  • Avoid writing CSS inline, keep them in files and call using link href="style.css" ... tag
  • Use semantic markup like em tags for italics and strong for bold text.
  • Try placing the actual content of the page at the begining…, Use CSS to change its position
  • The image tags should ideally have alt attribute that defines the image, inclusion of keywords in it will be helpful
  • The image filename should ideally describe the image

URL Rewriting for content pages

Many of the content sites (blogs, news sites) that you see these days have a specific url for each page. Eg. News.com

Many of the sites have something like /news/75-news-title.html. You can’t have a actual distinct page for each content. The most common solution to this is URL Rewrite the content.

The idea is to grab the specific content from the URL and then map that content, id, or whatever from the url as a get parameter’s value to a specific page.

In the above scenario, check the URL: /news/75-news-title.html. What is commonly done is the content_id the key by which the content is mapped in the content table is placed in the URL along with the title.

As in this case
Content ID: 75
Title Text: news-title (The hyphens are to make it readable instead of %20 for space)

So lets assume that we have a page called news.php in which we will give a get parameter as newsid. All we got to do now is write the URL Rewrite rule using Apache’s mod_rewrite engine.

We’ll use the RewriteCond and RewriteRule

<IfModule mod_rewrite.c>
    RewriteEngine On

    RewriteCond %{REQUEST_URI} /news/([0-9]+).*\.html$
    RewriteRule (.*) /news.php?newsid=%1 [L]
</IfModule>

Now Let us look how we built the thing.

  • First we used the Server Variable REQUEST_URI, to match the pattern with the request. The variable is referenced using %{SERVER_VARIABLE} format.
  • The RewriteCond is basically a If condition, which means if the condition is true, only then the condition or rules below that statement will be executed. That means the pattern should match for the rule to work
  • The regular expression pattern we made was accepting a integer value after /news/, After that integer value any text can come. But should end with .html. As emphasized by the $ at the end.
  • Now if the Condition works, we need to write the rule for it, so we use RewriteRule. The first argument is .*, which means accept any URL
  • The second argument is the actual mapping of the news.php with the newsid parameter. Check that we’ve used %1 which means the first back reference of the RewriteCond regex pattern
  • Since our pattern was /news/([0-9]+).*\.html$ and had just one class in it, that class i.e. ([0-9]+) should be referenced by %1 in the RewriteRule directive

The magician, .htaccess file

Long before anything like web.config or web.xml was used/invented, Apache had this wonderful file “.htaccess”

This file as you would expect, is a file to control the Web Application’s behaviour. The possibilities with this file are endless… from Password Protected Directories to Complex URL Rewrites, All can be done using this file.

.htaccess

The file’s extension is “htaccess” and has no initial filename. This comes from the *nix’s legacy system of having all the hidden files starting with a period “.”

This file could be placed in any directory of your web application. Lets say your DocumentRoot is /domains/ruturaj.net. Now if you place the .htaccess file in the main DocumentRoot, Any configurations that are present in the .htaccess file are available in all the subfolders of ruturaj.net

So if I put the fol. code in the .htaccess file,

DirectoryIndex rutu-default.php

All the sub directories or folders in directory ruturaj.net will have rutu-default.php page as the default index page.

But to ensure that the .htaccess file is read and implemented, you need to tell Apache.
To tell Apache which is the standard Configuration file, you need to modify the entry in the httpd.conf file. AccessFileName is the parameter which specifies which file is the “.htaccess” file, by default, the value of the parameter is set to “.htaccess”

AccessFileName .htaccess

There is also another parameter, AllowOverride, which tells Apache whether to read and implement the AccessFileName. You need to make the foll. settings in your VirtualHost or Directory mapping as

AllowOverride All

This will enable the implementation of the .htaccess file.

Search Engine Referer Keyword Tracking

You really want to analyze your source of traffic. Most of the times you install, use some of the free softwares available on the net. But If you are a programmer… You will want to know how to track these visitors, Search engine keywords, etc…

Here I’ll be showing the programmer’s point of view to develop a solution.

To track most of the important aspects of search engine referals, are the HTTP_REFERER and the HTTP_USER_AGENT variables.

I’m assuming you have Apache as the web server and PHP as the scripting language with my favourite MySQL as the database server.

There are two ways that you can track the above content

  • Apache access logs
  • Database logging

Keyword Hits
So the final result would be like

Keyword Hit Count
keyowrd 1 100
keyowrd 2 70
keyowrd 3 60

Search Engine Referers
Search Engine hit counts

Search Engine Ref. Count
Google 100
Yahoo 70
MSN 60

In this tutorial, I’ll be focussing on the MySQL logging. So lets begin with it.

VirtualHosts little bit more…

You can make a domain run on a different port than 80, which is the default port of HTTP, in the previous examples of VirtualHost Configurations, I’ven’t specified the port, which implicitly is 80.

If you want to run the website on a different port, you need to make sure Apache is listning on that port. To do that, you set a directive Listen

Listen 8080

Alternatively you can also specify the IP on which it should listen.

Listen 67.66.65.64:8080

Now if you want to run a Name-based VirtualHost on a specific, you make sure that you set the NameVirtualHost directive to a specific port as well.

NameVirtualHost 67.66.65.64:8080

Once you’ve set the NameVirtualHost, you need to set the actual VirtualHost configuration as well.
there is just once change to be made…

<VirtualHost 67.66.65.64:8080>
...
</VirtualHost>>

Important: You should note that all the domains, ruturaj.net, www.ruturaj.net, yourname.com, should always resolve an IP address on which NameVirtualHost is defined. Without which, the configuration does not make any sense.

Setting VirtualHosts

VirtualHosts
The most important part of setting Apache is setting the hosts, or VirtualHosts. The term “VirtualHost” comes from the fact that one single host or comptuer is hosting many hostnames. Apache was the one to start of with this type of hosting, in this Apache picks up the Host header from a standard HTTP request to translate the website associated for that host. This type of hosting is known as the Name-based virtual hosting, which is the most common of all the hosting types. The other one is the IP-based hosting which requires each domain to have a separate IP.

What I will show you is how to set up a name based virtualhost.

Now, A simple GET request for my page root would be as

GET / HTTP/1.1
Host: www.ruturaj.net

Now apache picks up “www.ruturaj.net” from the request header and then translates it to the virtual host that is mapped to www.ruturaj.net

Lets assume you have an IP 67.66.65.64, that you need to set up for virtual hosting, then first, you need to tell Apache that this IP is used for Namebased Virtual hosting.

NameVirtualHost 67.66.65.64

Now that you have done with setting the IP for virtual hosting, you need to configure the VirtualHosts.

Let us take ruturaj.net as the domain that needs to be set. So here it goes

<VirtualHost 67.66.65.64>
  ServerName ruturaj.net
  ServerAlias www.ruturaj.net
  DocumentRoot /www/domains/ruturaj.net
  CustomLog logs/ruturaj.net-access_log combined
  ErrorLog logs/ruturaj.net-error_log
  DirectoryIndex index.php
  ServerAdmin ruturaj@ruturaj.net
</VirtualHost>

Now let us review the configurations

  • ServerName: this is the main servername, it should be domain name
  • ServerAlias: this is an alias, eg www.ruturaj.net should mean same as ruturaj.net on HTTP
    You can set anything like default.ruturaj.net as well. Just make sure that default.ruturaj.net points to 67.66.65.64
  • DocumentRoot: This is the main directory that points to ruturaj.net domain, this is the file system path to the directory
  • CustomLog: This is the access_log for ruturaj.net, remember, we’d set the variable of “combined” log format, we are useing it here, if you want a different format, you can specify the LogFormat before specifying the CustomLog directive
  • ErrorLog: Any errors while serving are logged in this file
  • DirectoryIndex: Defines the default document page for root, eg when you do http://ruturaj.net/ it tells the server to serve “index.php”, so you can set it whatever you want default-page.html, default.pl, etc.
  • ServerAdmin: Just specify the email address, this would show up, when there is any server error.

So now if you want to add a configuration for host “johnsmith.com”…

<VirtualHost 67.66.65.64>
  ServerName johnsmith.com
  ServerAlias www.johnsmith.com
  DocumentRoot /www/domains/johnsmith.com
  CustomLog logs/johnsmith.com-access_log combined
  ErrorLog logs/johnsmith.com-error_log
  DirectoryIndex index.php
  ServerAdmin admin@johnsmith.com
</VirtualHost>

The httpd.conf file

The httpd.conf file is the main configuration file of Apache. It rests in “apache-install-dir/conf”

Now lets take a look at some important and useful parameters

ServerName
This is param sets the default server name, it should generally be the FQDN or the Fully Qualified Domain Name of the machine, or the IP, if the machine doesn’t have any FQDN.

Directory
This is a setting which encloses any of the settings for the given directory. So you specify the physical directory as the argument. So if you have a directory as /websites/mywebsite/somedir, you would do the following.

<Directory /websites/mywebsite/somedir>
... your settings
...
</Directory>

AllowOverride
AllowOverride
The AllowOverride allows the user, to override some of the settings by using their own file. This own file is the magical .htaccess file. By default it is set to None, which means the user can’t override the settings by specifying the .htaccess file in the directory. But you can change the AllowOverride None setting to AllowOverride All

Options
This directive takes several options, I’ll explain some them,
Indexes: This allows a directory listing. U must have come accross something like this
Directory Listing

FollowSymLinks: This allows apache to follow symbolic links, symbolic links are nothing but links in *nix systems, eg. “files” in /etc/ can point to /files/myfiles/files
You can use both these options at once by

Options +Indexes -FollowSymLinks

The above setting will allow directory listing but won’t allow Symbolic links. So “+” to apply and “-” to remove the setting

AccessFileName
I talked about the magic file .htaccess, This is the place where you specify the name of the file, By default it is “.htaccess”
The . “period” start is to make it a hidden file in *nix systems

Denying files
To deny files over the web, is the job of the server, in apache, we can do exactly by using the Files directive.

<Files ~ "^\.ht">
    Order allow,deny
    Deny from all
    Satisfy All
</Files>

Note the ~ sign, this is used when you are giving a regular expression to match the files., Once the files are selected, they can be denied by using the Deny directive.
The above regex is to deny all the files that start with a “.ht”

Access Logs
To create access logs, we need to specify the format of the log, and the file path.
First we need to set the LogFormat directive
The most common is the “combined” log, which logs ip, user, time error code, referer and user agent

LogFormat “%h %l %u %t \”%r\” %>s %b \”%{Referer}i\” \”%{User-Agent}i\”” combined

Note: the log format has been given a name “combined”, feel free to create different formats for your needs and name it accordingly
Then we need to set the filename of the log,

CustomLog /usr/local/apache/logs/access_log common

The second parameter of the CustomLog directive which sets the filename of the log is the log format name, that we defined earlier.

Server-Status
When you want to look at the current status of the server, ie whom is it responding to, what pages is it serving, how many servers are running… and so on..
There is no better way than to set server-status
Check the screen shot of it.

server-staus

To enable it …

<Location /server-status>
    SetHandler server-status
    Order deny,allow
    Deny from all
    Allow from 192.168.0.84
</Location>

check the configuration, it is allowing only IP 84 to check the stats and others are forbidden. You can set your IP as you wish.
If you want even more info. you can set the Extended status

ExtendedStatus On

Apache beginings

For guys who have reached here, but still don’t know what httpd is,
Apache is a web server, For all the web pages, websites, blogs, image galleries that are hosted on the web, there needs to be server who “serves” these documents (pages, images, files) to the client (the user’s browser)

Apache got its name from … well… its nothing but a “A patchy server”, httpd apache is an open-source project, which was programmed by many programmers over the world. And everytime a bug-fix, a new feature was required, the main code was just “patched”. And hence it got its name Apache.

Apache being a standard web-browser, runs on port 80, this is the standard HTTP port. Before you begin ahead, let me warn you changing the settings of Apache can change the way a website behaves, and to edit its settings you need root access or Administrator access.

To control apache, you basically need to edit 2 important files “httpd.conf” and “.htaccess”