Pragana's Tcl Guide


Old notes


Accessing the web with sockets

Tcl provides some facilities to access sockets with simple function calls. This simple code fragments together can be used as a simple web page retriever.
We have edited a configuration file with just the names of URIs we want to get, and a parser can be easily constructed with the regexp command:
proc parse_file {script} {
    set fd [open $script r]
    while {![eof $fd]} {
        regexp {http://([^/]*)(.*)} [gets $fd] match url fn
        if {[string length $fn] == 0} {
            set fn "noname"
        }
        get_webpage $url $fn
    }
    close $fd
}
If no filename is given (the address ends with a slash "/"), we rename the file as "noname", so it will be given a name anyway. We check also if this file already exists, so we don't get a page already present.
The procedure get_webpage will create a socket connection, similar to a open file, and send by this connect the command "GET <filename>" that's the protocol to request a page in http.  The standard port for web page requests (http) is 80.  Other clients may use other ports.
The variable eventLoop will be set when the page is fully read. This will be done by the fileevent callback procedure.

We assume our recently got page to be ok when it exists and is not of zero length.
This simple procedure does the job:

proc file_ok {fn} {
    if ![file exists $fn] {
        return 1
    }
    if {[file size $fn] > 0} {
        return 0
    }
    return 1
}
Then, here is the main routine:
proc get_webpage {url fn} {
    global eventLoop sock savefd savedir retries max_retries
    set retries 0
    puts "transferindo: $url$fn"
    if { [file_ok "$savedir/$fn"] } {
        while { $retries < $max_retries } {
            set savefd [open "$savedir/$fn" w]
            set sock [socket $url 80]
            fileevent $sock readable [list read_sock $sock]
            fconfigure $sock -buffering line
            if {[string length $fn] == 0} {
                set fn /
            }
            puts $sock "GET $fn"
            vwait eventLoop
            close $sock
            close $savefd
            if {[file size "$savedir/$fn"] > 0} {
                set retries $max_retries
            }
            incr retries
        }
    }
}
To manage the reading of lines from the connection, we have prepared a fileevent with the callback "read_sock". This procedure will be called each time the channel (stored in "sock" variable above) becomes readable.
In case we don't get anything (perhaps because of timing out), we count our retries and repeat the procedure.
Here is the callback procedure:
proc read_sock {rsock} {
 global eventLoop savefd
 if {[eof $rsock]} {
   set eventLoop "done"
 } else {
   puts $savefd [gets $rsock]
 }
}
To finish our simple utility, we just have to define some variables and execute the parse_file command:
set savedir "/directory/where/to/save"
set max_retries 5
parse_file "webget.links"
exit

Remeber to create the "webget.links" file and put in each line one url to get.
You may want to change this to get also ftp files. This is left as an exercise :)
 


Back Home