gusl | bleg: chopping a plain-text file

You're viewing

gusl's journal
Create a Dreamwidth Account Learn More

Reload page in style: site light

A question for systems people:

If a plain-text log-file gets too big to be openable by emacs (in my case 379MB), how can I search through it? Is it possible to chop such a file into several pieces?

grep works, but it doesn't let me see the context surrounding my hits.

UPDATE: grep -An -Bn pattern file seems to find n lines before and after each hit. Still, not as good as a text editor...

Flat | Top-Level Comments Only

From:

jcreed.livejournal.com

use "less"? Hit the slash key to search.

From:

gustavolacerda.livejournal.com

Thanks! That is helpful.

From:

gustavolacerda.livejournal.com

if 'less' can handle it, why won't emacs? Why can't emacs do it like 'less'?

From:

jcreed.livejournal.com

This is a fair question, and I don't really know the answer.

From:

cdtwigg.livejournal.com

You would hope that emacs would be smart enough to page in parts of the file at a time but it's entirely possible it isn't. I know for sure that less only holds the current file chunk in memory and uses an indexing scheme to map between line numbers and disk byte offsets (if you open a big enough file it will actually stall while building this index unless you tell it otherwise).

Some programs also have the ext2 file size limit built in so even if you're accessing files on filesystems without this limit they can't handle it. Apache, for example, refused to serve files bigger than 2GB until very recently (like, within the last year or so, if I remember correctly).

From:

avocado-tom.livejournal.com

wc -l <filename.txt>
head --lines=<half the number of lines returned by wc> >> <file_1.txt>
tail --lines=<remaining number of lines> >> <file_2.txt>

From:

gustavolacerda.livejournal.com

Thanks! Maybe I should make a script to split a file into several chunks, given a chunk size.

Is there a command similar to 'head' and 'tail', but where you can specify a range in the middle?

From:

avocado-tom.livejournal.com

not that I know of. You can actually just use...

head --lines=<X> <file.txt> | tail --lines=<X> >> <file_middle.txt>

But you're starting to get into the realm where writing a quick perl script would probably be easier/more-functional. If my perl weren't super rusty, i'd offer to do it, but...it is and I'm crazy busy. :-)

From:

avocado-tom.livejournal.com

oh, that "lines" arg for tail should be Y where Y <X

From:

inferno0069.livejournal.com

Here, sed would be less overpowered than perl and maybe faster: sed -n {start},{end}p < input > output where {start} and {end} are line numbers, and both are included. You could also do things like sed -n '/^Oct 18/p' to extract lines that start with "Oct 18", which could be useful if your logfile has a format like that of my /var/log/messages.

From:

gwillen.livejournal.com

I recommend vim... it surprises me that Emacs will die on a large file, but I know that vim will not.

From:

ikeepaleopard

In my experience, vim dies on long lines but not on large files. It probably has something to do with the data structures it builds up to make moving around fast.

From:

gwillen.livejournal.com

Yeah, vim's datastructures are definitely line-oriented.

Now if we've learned anything from the sad tale of Endo the alien, we would realize that _ropes_ are the right datastructure for a text editor.... :-D

From:

chrisamaphone

From:

darius.livejournal.com

Yep. http://fresh.homeunix.net/~luke/ermacs/ uses them.

From:

gregh1983

Wow... is the English lexicon really starting to look that much like German?

From:

ikeepaleopard

I know Word uses some weird out of order data structure, which (less sure about this)has to be defragmented periodically.

From:

gwillen.livejournal.com

Also, if you're willing to split the file up by byte ranges instead of lines:

dd if=oldfilename of=filepartN bs=1024 count=K skip=J

Will copy K kilobytes starting at kilobyte J from oldfilename into filepartN.

From:

dachte.livejournal.com

This may, however, be somewhat suboptimal for whatever poor line (likely) ends up being chopped into parts. It's probably worth the cost in most cases though :)

From:

inferno0069.livejournal.com

split(1)

From:

darius.livejournal.com

When that comes up I use an emacs workalike I wrote myself, http://www.accesscom.com/~darius/hacks/alph.tar.gz

Not as fast as grep or anything like as featureful as emacs, but it doesn't choke on big files or long lines.

From:

darius.livejournal.com

Oh, also you can say just grep -5 instead of grep -A5 -B5. I'd never even heard of -A and -B before.

From:

gwillen.livejournal.com

That's interesting... I thought -C5 would be what you are saying is -5. The -ABC options are definitely specific to gnu grep...

From:

darius.livejournal.com

You're probably right, I didn't look up the docs on -A or -B, or -C for that matter. Just tried them out after Gustavo's note.

Flat | Top-Level Comments Only

Profile

gusl

My Website

February 2020

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29

Page Summary

Style Credit

Base style: Transmogrified by Yvonne
Theme: Shallowest Depths by krja

Expand Cut Tags

No cut tags

Top of page

Gustavo Lacerda

bleg: chopping a plain-text file

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

Profile

February 2020

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags