Re: How do I do count the occurrence of each word?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Aug 18, 2012 at 6:44 PM, John Taylor-Johnston
<jt.johnston@xxxxxxxxxxxxxx> wrote:
> I want to parse this text and count the occurrence of each word:
>
> $text = http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html;
> #Can I do this?
> $stripping = strip_tags($text); #get rid of html
> $stripping = strtolower($stripping); #put in lowercase
>
> ----------------
> First of all I want to start AFTER the expression "News Releases" and stop
> BEFORE the next occurrence of "-30-"
>
> #This may occur an undetermined number of times on
> http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html
>
>
> ----------------
> Second, do I put $stripping into an array to separate each word by each
> space " "?
>
> $stripping = implode(" ", $stripping);
>
> ----------------
> Third how do I count the number of occurrences of each word?
>
> Sample Output:
>
> determined = 4
> fire = 7
> patrol = 3
> theft = 6
> witness = 1
> witnessed = 1
>
> ----------------
> <?php
> $text = http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html
> #echo strip_tags($text);
> #echo "\n";
> $stripping = strip_tags($text);
>
> #Get text between "News Releases" and stop before the next occurrence of
> "-30-"
>
> #$stripping = str_replace("\r", " ", $stripping);# getting rid of \r
> #$stripping = str_replace("\n", " ", $stripping);# getting rid of \n
> #$stripping = str_replace("  ", " ", $stripping);# getting rid of the
> occurrences of double spaces
>
> #$stripping = strtolower($stripping);
>
> #Where do I go now?
> ?>
>
>
> --
> PHP General Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>

This is usually a first-year CS programming problem (word frequency
counts) complicated a little bit by needing to extract the text.
You've started off fine, stripping tags, converting to lower case,
you'll want to either convert or strip HTML entities as well, deciding
what you want to do with plurals and words like "you're", "Charlie's",
"it's", etc, also whether something like RFC822 is a word or not
(mixed letters and numbers).

When you've arranged all that, splitting on white space is trivial:

$words = preg_split('/[[:space:]]+/',$text);

and then you just run through the words building an associative array
by incrementing the count of each word as the key to the array:

foreach ($words as $word) {
    $freq[$word]++;
}

For output, you may want to sort the array:

ksort($freq);

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux