Just a quick note.
I have written something similar in perl, though there are few details
that need to be considered when acting as a tunnel the main problem I
found was what to do when the Content-Length was not provided. As you
are parsing the HTML you must read it. Due to the persistent connection
this normally means you read a fixed number of bytes (depending on the
content length supplied in the header) if no Content Length is given you
can read until the connection is terminated...
OR and I still need to add this fix
sometimes the end of the HTTP body is terminated with Transfer-Encoding
chunked parameter (as defined in the header)
If any one here has anything to add to this then please do as my HTTP
knowledge is not the best!
Tim
ps
Depending on your needs I could dig out my Perl code, but I was think of
a rewrite or even (if I feel mad) a Squid hack!
On Sun, 2002-07-28 at 15:38, andrew cooke wrote:
>
> Thanks. I don't want to become involved in altering Squid itself because
> it's such a big project that to do anything useful would take a lot of time
> (not bad in itself, but time I would like to spend on the application rather
> than the enabling technology).
>
> I looked at Dan's Guardian, but I think it would also need patching (no
> support for plugin etc) and the licence and censorship implications made me
> uneasy. Then I started wondering why DG needed Squid at all and realised
> that it is probably acting as a tunnel - the browser is connecting to DG
> expecting a proxy, and DG forwards transparently to Squid.
>
> This architecture (tunnel in front of Squid) is something I can implement
> myself (I've years of experience writing multithreaded code using sockets),
> while leaving Squid to handle most of the nasty HTTP details (changing down
> versions, managing persistence etc, that were worrying me). Of course I'll
> still need some parsing of data to separate headers and data, but the logic
> should (I hope) be much simpler (just blank lines and content length, I hope).
>
> >From another POV, the tunnel can process the body and Squid can process the
> headers.
>
> So thanks again - that pointed me in the right direction.
>
> Cheers,
> Andrew
>
> On Saturday 27 July 2002 10:14 pm, you wrote:
> > Squid currently has no facilities for processing content beyond the HTTP
> > headers, in plug-in form or otherwise. There have been a few hacks
> > along the way that do some specialized form of filtering (like stripping
> > the anim bit from GIFs, or stripping out javascript), but those projects
> > never really went anywhere and have long been unsupported.
> >
> > Robert has done some promising work on generic content processing in
> > Squid, but ran into some roadblocks that he didn't have time to address.
> > You may want to start from there and tackle the issues he ran into,
> > if you have the time and inclination.
> >
> > ICAP provides support for similar things in limited circumstances (it is
> > targetted at content providers who want to customize or aggregate
> > content or provide additional services nearer to the client). Geetha
> > (and Ralf? I think) has been doing lots of cool stuff in that area, but
> > I don't think it will address your needs at all in its existing form.
> >
> > Dan's Guardian does content processing, and so might provide a good
> > starting point (note the request attached to its GPL license, however,
> > before embarking on any commercial work with it). It is a standalone
> > proxy these days, obviously much simpler in implementation than
> > Squid...I do not know how compliant it is with regard to the HTTP
> > protocols, but I haven't heard any particularly alarming things about it
> > and Dan seems to be a skilled programmer, so I'd suspect it is a good
> > choice for your project.
> >
> > andrew cooke wrote:
> > > Hi,
> > >
> > > Is there a simple way to process files that are requested through Squid?
> > >
> > > I'd like to try constructing a database containing links, word counts
> > > etc, for pages that I view. The simplest way I can think of to do this
> > > is to point my browser at a proxy and process data there. Squid seems
> > > the obvious choice for a proxy (but see last point below).
> > >
> > > Looking for similar functionality in other code working with Squid, I
> > > found the Viralator which checks downloads for viruses
> > > (http://viralator.loddington.com/). It intercepts requests using Squirm,
> > > pulls the file using wget, and the resupplies it (after scanning) via
> > > Apache. This seems very complicated (and may only work correctly for
> > > downloads rather than page views - I'm not clear about the details yet)
> > > (although I could drop Apache when working on the machine hosting Squid).
> > >
> > > Instead, I was wondering if Squid had support for plugin modules (that
> > > might be intended to support filters, for example), but I haven't been
> > > able to find anything.
> > >
> > > Another approach might be to scan the files cached by Squid (ie as files
> > > on the local disk, not streamed data). But this presumably won't work
> > > with dynamic pages and it might be difficult to associate URLs with files
> > > (also, it forces caching when, for single person use, proxy-only might be
> > > sufficient). And how would this be triggered for new files?
> > >
> > > Does anyone have any suggestions on the best way forwards? Perhaps
> > > there's a simpler proxy that I could use instead? There are certainly a
> > > lot of simple http proxies out there, but I'm not sure how closely they
> > > follow the spec.
> > >
> > > Any help appreciated,
> > > Thanks,
> > > Andrew
>
> --
> http://www.acooke.org
Received on Mon Jul 29 2002 - 08:17:24 MDT
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 17:09:23 MST