From praetzel@ecexh.uwaterloo.ca Wed Sep 1 08:44:22 2004 Return-Path: Received: from ecexh.uwaterloo.ca (ecexh.uwaterloo.ca [129.97.56.64]) by sca.uwaterloo.ca (8.13.1/8.13.1) with ESMTP id i81CiMCm032098 for ; Wed, 1 Sep 2004 08:44:22 -0400 Received: (from praetzel@localhost) by ecexh.uwaterloo.ca (8.9.3p2/8.9.3) id IAA19696 for fashion@sca; Wed, 1 Sep 2004 08:44:22 -0400 (EDT) Received: from fed1rmmtao07.cox.net (fed1rmmtao07.cox.net [68.230.241.32]) by ecemail.uwaterloo.ca (8.11.7/8.11.7) with ESMTP id i7PE8Qt15587 for ; Wed, 25 Aug 2004 10:08:26 -0400 (EDT) Received: from toff.tuc ([68.107.133.185]) by fed1rmmtao07.cox.net (InterMail vM.6.01.03.02.01 201-2131-111-104-103-20040709) with SMTP id <20040825140815.TIKP17253.fed1rmmtao07.cox.net@toff.tuc> for ; Wed, 25 Aug 2004 10:08:15 -0400 Received: by toff.tuc (sSMTP sendmail emulation); Wed, 25 Aug 2004 07:08:32 -0700 From: "Chris Tillman" Date: Wed, 25 Aug 2004 07:08:32 -0700 To: Eric Praetzel Subject: Re: h-costume archives Message-ID: <20040825140832.GA961@toff.tuc@cox.net> References: <000701c488aa$d6ae46e0$5192e144@vtdom.local> <200408231307.JAA14390@ecexh.uwaterloo.ca> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <200408231307.JAA14390@ecexh.uwaterloo.ca> User-Agent: Mutt/1.5.5.1+cvs20040105i X-Miltered: at aeacus by Joe's j-chkmail ("http://j-chkmail.ensmp.fr")! X-Spam-Status: No, hits=0.0 required=3.0 tests=none autolearn=no version=2.64 X-Spam-Checker-Version: SpamAssassin 2.64 (2004-01-11) on minos.uwaterloo.ca X-Spam-Level: X-Status: X-Keywords: X-UID: 2328 Sender: praetzel@ecemail.uwaterloo.ca Status: RO On Mon, Aug 23, 2004 at 09:07:44AM -0400, Eric Praetzel wrote: > > > I was wanting to do a full text search in the H-COSTUME archives. I noted > > You could still use the old search engine at: > > http://sca.uwaterloo.ca/Fashion/full-search.cgi > > But you'll likely find it faster to search them at home - esp. if you're good > with search tools like grep. Thanks for posting the cleaned archives. I concatenated all the digests from 1993 through 2003 in order into one big file (~262M), and then used grep to remove almost all the headers except From, Date, and Subject; and many of the repeated list instructions and footers. The command I used was grep -v -f lines_to_remove, where lines_to_remove was ^[[:space:]].*SMTP id ^[>]* *Received: H-COSTUME[@+-] h-costume[@+-] indra.com fashion\@sca.uwaterloo.ca Send subscription ^[>]* *Newsgroups: ^[>]* *Organization: For archives of this digest ^[>]* *X-[^:]*: ^[>]* *Content-[^:]*: ^[>]* *Content-Type: ^[>]* *Message-[^:]*: ^[>]* *Errors-[^:]*: ^[>]* *Sender: ^[>]* *Precedence: ^[>]* *Originator: ^[>]* *Priority: ^[>]* *User-Agent: ^[>]* *Status: ^[>]* *List-[^:]*: ^[>]* *Importance: ^[>]* *[Cc][Cc]: ^[>]* *References: ^[>]* *Reply-To: ^[>]* *In-[Rr]eply-[Tt]o: ^[>]* *M[Ii][Mm][Ee]-[^:]*: send mail to majordomo\@ UNSUBSCRIBE H-COSTUME ^[>]* *Send h-costume mailing ^[>]* *To subscribe or unsubscribe ^[>]* *or, via email, send a message ^[>]* *You can reach the person ^[>]* *When replying, please edit ^[>]* *than "Re: Contents of h-costume ^h-costume mailing list$ charset= ^[>]*[[:space:]]\+[A-Za-z]*,* *[0-9]\+ [A-Za-z]\+ [0-9]\+ [0-9]\+:[0-9]\+:[0-9]\+ -[0-9]\+ [Gg]et .*FREE [Jj]oin .*FREE [Ss]end .*FREE [Ss]ign.*FREE FREE.*Internet FREE.*software FREE.*access Yahoo! .*FREE FREE.*[Ee]mail win a FREE FREE.*download email.*FREE months.*FREE FREE.*limited Trial.*FREE The resulting 'hypercleaned' text file is about 130M. If you'd like to have a copy, I could zip it and transfer it to your ftp server. (Or we could arrange a time when my computer could be on and connected for you to ftp from it.) -- Debian GNU/Linux Operating System By the People, For the People Chris Tillman (a people instance) toff.tuc @ cox.net