next up previous contents
Next: Conclusion Up: AOLA - The Austrian Previous: Running the Combine-crawler   Contents


Evaluation of the harvested data

To gain insight into the material retrieved during a snapshot, we implemented a module, capable of compiling statistics. Both, the run using the Nedlib-crawler and the other with Combine, were incomplete. However, the latter was considerably larger resulting in more accurate numbers. Therefore, in order to convey a picture of the dimensions this repository is dealing with, an excerpt of the statistics based on the Combine-crawl is presented here. Also incomplete crawls present an appropriate insight, yet, numbers about ten times as high may be expected for a complete snapshot.

Table 4.1 shows numbers for the various domains documents have been extracted from. It shows for each domain the number of hosts that have been accessed, the number of documents that have been acquired, and the size in bytes of all the files downloaded. Obviously, most documents have been collected from the .at-domain. The numbers for standardised second level domains being .ac.at, .co.at, .gv.at, and .or.at are not included in the numbers for the .at-domain but are listed separately. It is quite striking that they have relatively few registered hosts, thus they are seemingly not accepted by the general public. When comparing the numbers between .ac.at and .co.at it is quite striking, that even though the academic sector has less hosts by a minor percentage, it is more than four times as big as the commercial sector. Quite popular in Austria is the .cc-domain of the Cocos (Keeling) Islands, an island group in the Indian Ocean. Yet, .tv, which is a shortcut for Tuvalu, an island group in the South Pacific Ocean, and at the same time for television, was discovered only recently and is expected to grow, especially with the introduction of private television.

Table 4.2 lists extensions of the acquired files, i.e. the data format they have. For each extension the number of files and the size of all those files is listed. The first paragraph of the table details the most prevalent extensions for the HTML data format, first each separately, then summed up. Besides the extensions .html, .htm, .shtml, and .shtm the entry "automatic" is listed. This refers to URLs that do not point directly to a file but rather a directory. On a request the web-server returns a default file that is to be found in the very directory. Our web-server, for example, redirects http://www.ifs.tuwien.ac.at/~aola/ to the URL http://www.ifs.tuwien.ac.at/~aola/index.html.

Furthermore, it is clearly shown, that Adobe's PDF-format is more popular than PostScript-files. Also, the dominance of the JPEG-format over other picture types is quite obvious. This is due to the high compression rate JPEG offers, which is a crucial feature considering the low download rates many users have to manage with.

Furthermore, loads of unusual extensions have been discovered, such as .d15 or .grv. The MIME-type of the document could give information about its type, yet, many files remain unrecognised [Arv01]. For those unknown formats it is difficult, perhaps impossible to find an appropriate long-term preservation strategy. The same difficulties apply for access provision.


Table 4.1: second run - statistics (excerpt) - domains
domain #hosts #documents size
(kilobyte)
at 38.883 2.116.940 77.191.623
ac.at 1.798 311.798 21.299.944
co.at 2.091 124.459 4.674.595
gv.at 262 54.035 3.325.528
or.at 547 61.998 2.188.627
com 797 79.553 2.165.194
edu 14 60 9.954
int 1 1.582 14.962
net 211 24.772 789.394
org 133 10.997 635.357
cc 124 56.083 1.676.642
de 104 1.310 131.809
hu 1 59 1.134
tv 2 32 217
... ... ... ...
total 45.178 2.846.544 114.183.012



Table 4.2: second run - statistics (excerpt) - extensions
extension #documents size
(kilobyte)
html 595.848 7.903.787
htm 798.765 8.712.431
shtml 32.700 583.452
shtm 3.656 89.194
"automatic" 104.212 894.742
=> sum
(htm+shtm+shtml+automatic)
1.535.181 18.183.606
txt 11.175 253.011
pdf 49.913 20.288.111
ps 2.757 1.694.369
wav 1.669 1.480.466
mp3 5.005 7.314.008
avi 576 1.299.784
mpg/mpeg 1.352 4.058.790
jpg/jpeg 99.423 7.872.700
gif 14.181 831.244
tif/tiff 997 1.588.893
zip 13.167 9.867.170
tgz/gz 5.273 1.925.112
exe 10.078 8.267.007
cgi 77.208 852.861
jsp 16.341 243.450
asp 289.657 4.838.417
pl 73.007 826.735
php 251.732 4.653.314
xls 1.722 262.933
doc 11.884 2.031.507
rtf 2.345 259.631
d15 4 52
di 1 25
es 1 12
fas 8 248
grv 1 9
kop 2 30
... ... ...



next up previous contents
Next: Conclusion Up: AOLA - The Austrian Previous: Running the Combine-crawler   Contents
Andreas Aschenbrenner