Pragana's Tcl Guide

Old notes

04Feb99 forms for a database application
18Dec98 stealing other's code
13Dec98 simple input dialog
08Dec98 integrating with a database
01Dec98 accessing the web
23Nov98 creating widgets
12Nov98 subst command

wwwoffle cache extracting

If you spend some time navigating at the internet to find stuff, it may be helpful to have a proxy server, like squid or wwwoffle. I use the latter, because it is simple and fast. Each week or so, I save all "visited places" stored at the wwwoffle cache to a cdrom and clean it's space, before getting my hard disk full. Reinstalling a saved cache is not funny, because its archive is a big thing. To search something is difficult too, because the cached file have names like "Uxdfgfdx" or "Dfhjjhgfrty". Only the sites domains are plain text (directories). So... Time to make a tcl script to arrange those things!

What means those files in chached dirs?

Wwwoffle's cache have only directories with the domain names and files beginning with D or U. Each Uxxxxxx file have a URI with the name of it's corresponding Dxxxxxx file, with the same xxxxxx suffix. The Dxxxxxx file is a complete http received answer, with a header and the html, text, gif, or whatever kind of information was received in answer to the GET protocol request.

Our procedure to make "local" files from the cached version could be stated as:

repeat for each Uxxxxxx file
read the file to get it's real name
get the Dxxxxxx file and extract the header part (discard it)
if the file is an html, change its links for images (SRC and BACKGROUND tags) to the local images
rename the Dxxxxxx file to it's new (local) name

Of course, this code is not bullet-proof, but if the html is correct, only IMG tags will be touched.
To use the program, change the cachedir variable to something suitable (the site directory inside your wwwoffle cache). If you like, change also the destdir (desination directory) variable. If you prefer a widgetized verison, make an entry widget for those variables. This is left as an exercise!

Here's the code:

#!/bin/sh 
# 
# Utilitário para filtrar o cache do wwwoffle, criando 
# um conjunto local de páginas html e as figuras incluidas 
# \ 
exec wish "$0" "$@"

set cachedir /var/spool/wwwoffle/http/members.xoom.com 
set destdir [pwd]
# 
# converte os nomes dos arquivos Dxxxxxx de acordo com o conteudo de Uxxxxxxx 
# salva o resultado no diretório corrente 
# 
proc cvfiles { } {     
global cachedir destdir     
foreach f [glob $cachedir/U*] {         
   regexp {(.*)/U(.*)} $f match prefix sufix         
   set fn "D$sufix"         
   set inf [open $f]         
   set destname [gets $inf]
   close $inf         
   puts "file: D$sufix --> [file tail $destname]" 
### open file and discard it's http header         
   set inf [open $prefix/D$sufix]         
   set newfn $destdir/[file tail $destname]         
   set outf [open $newfn w]         
   fconfigure $inf -translation binary         
   fconfigure $outf -translation binary         
   while {[string length [gets $inf]] > 1} { puts -nonewline . }         
      fcopy $inf $outf         
      close $inf         
      close $outf 
### change links to the pictures in the html files         
      if {[string match {*.html} $newfn]} {             
         set inf [open $newfn]             
         set outf [open tmp_html w]             
         while {![eof $inf]} {
            set line [gets $inf]                 
            if {[string match {*SRC=*} $line]} {                     
               regsub {(.*SRC=)\".*/(.*)\"(.*)} $line {\1"\2"\3} line                 
         }                 
         if {[string match {*BACKGROUND=*} $line]} {                     
            regsub {(.*BACKGROUND=)\".*/(.*)\"(.*)} $line {\1"\2"\3} line
         }                 
         puts $outf $line             
      }
      close $inf             
      close $outf             
      file rename -force tmp_html $newfn         
   }
} 
}

cvfiles
exit

That's all fellows. Happy hacking!

Back Home