Denny’s OpenBSD Newbies blog

April 30, 2010

More On Text Processing

Filed under: openbsd — Tags: , , , , , , , — denny @ 6:57 pm

I kept fooling around with a text file the other night, trying to get an
exact character count. The file had 11 lines and I came up with 436
characters, counting them manually. That included whitespace between
the letters and numbers. I ran wc(1) on it to see what it would say:

$ wc -m file
          447

That was a difference of 11 more than what I had counted. How come?
Turns out it counts line feeds as characters, too. Okay, how can I get
rid of them to get the same count as I had gotten counting manually?
I can use sed(1) to do it:

$ sed -e ’s/.$//’ file |wc -m
            436

Actually, I can do the same without the -e switch on it. It got thrown in
the mix and then I realized I didn’t need it. We usually don’t think of a
whitespace as a character, but it is. To see how many characters there
are minus whitespace, I can use tr(1):

$ tr -d ‘[:blank:]‘ < file | wc -c
         373

I know the total number of characters including whitespace and line feeds
is 447. Okay, bc(1) to the rescue:

$ echo “447-373″ | bc
         74

If I use space instead of blank it takes out the line feeds too:

$ tr -d ‘[:space:]‘ < file | wc -c
         362

You add the original 11 line feeds to that and you’re back at 373.

You can also count the characters minus the line feeds and whitespaces
while editing the file in vim like so:

%s/\S/&/gn

So, the adventures in text processing continue. :-) Any input on these
methods and any additional ones always welcome and appreciated.

Cheers!

2 Comments »

  1. [...] This post was mentioned on Twitter by Danielle Potts, Denny White. Denny White said: Playing around with text processing some more on my blog with OpenBSD: http://bit.ly/92D8JU [...]

    Pingback by Tweets that mention More On Text Processing « Denny’s OpenBSD Newbies blog -- Topsy.com — May 1, 2010 @ 6:25 pm

  2. Thanks, you might also want to remove the newlines with a “tr -d ‘\n’”.

    Comment by Michael Kreikenbaum — August 4, 2010 @ 4:22 am

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress

Rss Feed Tweeter button Facebook button Technorati button Reddit button Myspace button Linkedin button Webonews button Delicious button Digg button Stumbleupon button Newsvine button Youtube button