I purport no expertise in web-marketing particularly because I find the concept abstract and a voodoo/pseudo-science of sorts. From my perspective, it’s in the same realm as astrology. Both make for interesting stories. What works here may not work there… for reasons that usually escape me completely. It’s all very emotional which by nature makes it unpredictable and subject to interpretation.

Some of my experience has been applying web marketing concepts that others have developed. I’ll avoid the delving into the various requirements and solutions as they’ve varied over the years, and aren’t really the point of this but rather all centre around spending money in some form or another to get the right traffic to your site to eventually sell something to someone.

While it strikes me that figuring out “The Right Thing To Do” has consumed a ton of money and produced a lot of sweat, less attention is paid to the “Wrong Thing To Do”. Perhaps one of the big obvious no-nos is in the linking strategy of your web presence, in that you don’t want to spend your advertising dollars getting a prospect to your site only to have them land into your competition’s via a third party linked from your site. I’ve even attended marketing scrums that entertained the notion of outlawing all off-site links in order to avoid this potential problem. This sort of a ban is a surefire means of mitigating the problem, but depending on the nature of the web business would probably be short lived. In some cases, partner programs would require mutual link reciprocation that could open the door to the “Wrong Thing”.

So, how with a minimal amount of effort scan your site, and the sites you link, at given depth of recursion to determine if you are indirectly linking your competition?

This exercise isn’t going to be for everyone, some strategies will be centred around strong inter-partner links programs while companies in a smaller niche market will want to have a look at this. Interested companies will need to know their competition and be able to identify their web aliases, and know how they present on the web.

For this case study we’ll examine a specific website. The selected site is www.f-prot.com, the home of a virus scanning and cleanup tools. This site was chosen due to the nature of it’s market. F-Prot competes with all other virus scanners, such as Norton, McAfee, Trend and countless others. It stands to reason that F-Prot would want to maximize sales with the traffic it receives by keeping prospect on-site with the easiest path to the money-taking-part.

The Concept:

A bot will be seeded with F-Prot’s main URL. The bot will pull the content of the page and extract all the links found on the page. The links extracted from this process will be fed back into a work queue, where they will be processed in sequence using the same tactic until the bot has traversed the site and is well on it’s way to traversing the first 2 or 3 recursive tiers of the linked sites.

The data collected from this crawl will contain the required information to re-create the hierarchy to illustrate the inter-site relationships. This hierarchy will then be expressed using a organization tree graphic which will provide an at-a-glance means of visualizing the inter-site relationships to determine if an organization is providing a path potentially referring traffic to their competition.

The environment:

I code PHP on a daily basis and would like to re-use my script libraries, so these will be command line invoked scripts for the most part. I also want an excuse to polish up on my application of XMLRPC.

The Data model: A MySQL database with 5 tables

  1. Hosts (represents a given hostname)
    1. id integer
    2. hostname varchar(255)
    3. protocol integer (could be an enum, but basically 1 = HTTP, 2 = HTTPS for now)
    4. nocrawl integer
  2. Documents
    1. id integer
    2. host_id integer
    3. ref mediumtext
    4. timestamp integer
    5. dispatched integer
    6. title medium text
  3. Links
    1. id integer
    2. from_doc_id integer
    3. to_doc_id integer
  4. Words
    1. id integer
    2. lnguge_id integer (1 = English for now)
    3. word varchar(128)
    4. last_tidy integer
  5. Document_Words
    1. id integer
    2. document_id integer
    3. word_id integer

The pieces:

An XMLRPC server that exposes a number of functions to the worker (client).

  • get_link(): Get a link for processing. The link returned will be any random link from any host with a ‘nocrawl’ field equal to 0 and the document has a ‘dispatched’ field equal or less than 0.
  • process($doc_id, $title,$links,$words): Handle the work preformed by the worker. Here, any new links (via $links) and hosts are added to the respective tables (if they don’t already exist) and are scheduled for processing. As a side effect, the process method requires that title and an words array be passed as well.

A XMLRPC Client that implements proxy methods to the server defined above. Using this approach, the “Workers” can operate in a distributed fashion without the need for database connectivity.

A worker script that uses the XMLRPC client.

The worker script will call get_link, fetch the content of the page and parse the content creating 3 pieces of data. An array of all the words found on the document, an array of all the links found on the page, expanded to absolute URL, and finally the contents of the <title> tag of the content.

When the worker completes parsing, it calls the process() method on the client. The server services the request and stores the specified data within the previously referenced table structure, adding or updating as needed.

Executing the worker on single seed page that contains one or more links recursively seeds the documents table with additional work, and as work completes, new work is almost always added and the system has more than enough to keep it going. What we end up with is a database that contains a hierarchy of everything that was linked from the seed site, and anything underneath in a recursive fashion.

Output Products:

  • The complete seed-site outgoing link hierarchy.
  • An Internet biased word dictionary.
  • A relationship between Words, Documents and Hosts
  • A quasi-search engine feature.

By themselves, the output products are useful in addressing the initial need of generating a list of linked sites to a given depth of recursion by simply using SQL and some PHP to generate some reports.

More interesting, is expressing the site hierarchy using high performance charting tool. The brains at AT&T sorted this out moons ago with something called Graphviz. It’s a portable graphing tool comprised of various rendering programs that use a scripting language to define the on-graph relationships. The script is fed into the desired executable and output of your choosing can be specified.

After a few hours of letting the bot crawl, I decided to play with Graphviz and all the existing data collected from the f-prot crawl, just to get a feel of how things were progressing. The following is one of the first images generated (caution, it’s huge). F-Prot.com seeded Link Map.

The bots are well on their way but are getting side tracked and sucked into walking through a lot of off-seed content. I’ve let the bots run pretty much unmoderated and would see what emerges “organically”.

To that end, created a script that generated a Gviz script for all the sites the bot has crawled. Here it is at 10% of the original size.

Sites seen by the phpcrawl bot

The output is pretty neat to look at in a resized manner. It’s almost a form of emerging Internet art, which obviously was not my original intent. For what it’s worth, the rendered Jpeg is 29MB in size 11860 x 12380 taking a good 30 minutes to render on a borrowed Sun X2200 Quad CPU with 4Gb of memory.

That’s about it for now, the bots must go on. I’m going to have to start shopping for partners to provide bandwidth and processing power as my lowly cable connection is truly being violated. So far the bots have fetched and parsed 611545 documents from 15343 hosts while operating on a part-time (with my idle CPU and network hours).

Posted by Paul Skinner, filed under PHP, Projects. Date: March 23, 2008, 9:31 pm |

One Response

  1. Internet Neighbourhood Map: Emerging Art? — TOP 25 Searches and keywords Says:

    […] is expressing the site hierarchy using high performance charting tool…. source: Internet Neighbourhood Map: Emerging Art?, Paul […]

Leave a Comment

Your comment

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.