Monday, 4 January 2016

Again, the directory Yahoo!

November 2014 I wrote about the directory Yahoo! One suggestion on Twitter was to copy the whole directory. After I wrote that blog, I tried to copy the directory Yahoo! with WGET. That seems simple, but isn't. Several weeks my computer was working day and night to fetch the directory. Due to an automatic windows update I wasn't able to fetch the whole directory. It's a pity, but I think that I have more than enough data.

After the fetch process I tried to analyse the links, however that was hard due to the fact that the structure of the HTML code wasn't always the same, and I made the mistake that I converted all filenames to lowercase. That wasn't smart, because now I've got a lot of broken links. So I tried to parse the pages with the YahooAnalyser (some Python code written by me, to analyse the content of the old Yahoo! directory), however, the end result was ... nothing beside the finding that this approach was far to complex.

Recently I decided to take another approach. Instead of analysing the complex HTML code, I deleted this complex code so that in the end I got simple lists of links, and I converted all uppercase characters to lowercase characters. Thereafter it was possible to use a broken link checker like Xenu's Link Sleuth to analyse the fetched copy of the directory Yahoo! 

My findings are:

  • I was able to fetch 56,359 folders (I assume that there are roughly 60,000 folders);
  • I was able to fetch 568,744 external links, so the average number of links per folder is 10;
  • Of these 568,744 fetched links there where 92,746 duplicate links, so 84% of the links where unique;
  • Of the 475,998 unique links 365,751 links got the status 200 OK. That's 77% OK and 23% not OK;
  • The top reasons a link is broken are:
    • No such host (9%);
    • No connection (8%);
    • Not found (2%).

A year ago my estimation was that there are 55,000 to 75,000 categories so that's roughly the same. However I overestimated the number of links. First I thought there are 1,000,000 to 3,000,000 links mentioned on Yahoo! Nowadays I believe that there where roughly 500,000 unique links.

The percentage of 23% broken links isn't fair due to the fact that a year passed between my fetch action and this analysis. However in my opinion a broken link percentage of 20% or higher isn't acceptable. So I still repeat the conclusion of last year "It is a pity, but it is logical that the Yahoo! directory is shut down".

