Counting occurrence of strings within strings

Somebody asked how to count the number of occurrences of a string within a string. For example, if I have the following data, I want to generate new variables countSS, countSM, and countSG that contains the number of occurrences of “SS”, “SM”, or “SG” in variable awards.

*————————————————————————————*
clear
input id str40 awards
1    “SS; SS; SM; SG”
2    “SM; SG”
3    “SG; SG; SG; SS”
4    “SS; SS; SG; SG; SS; SM; SG”
end
list
*————————————————————————————*

Here is one solution using the macro extended function -subinstr- (-help extended_fcn-).

*————————————————————————————*
local tocount SS SM SG
foreach t of local tocount{
gen count`t’=0
local N = _N
forvalues i = 1/`N’{
local a = awards[`i']
local c : subinstr local  a  “`t’” “`t’” , all  count(local c2)
replace count`t’ = `c2′ in `i’
}
}
*————————————————————————————*

————————————————–

*Thanks to Jacob Reynolds (jlreynol@nps.edu) for the question. Although, for the best advise on Stata, Statalist is the best place to ask :). See Stuck? Hello Statalist .

About these ads

6 Responses

  1. The number of occurrences can be got from a comparison of lengths before and after blanking out.

    gen noccur_SS = (length(awards) – length(subinstr(awards, “SS”, “”,.))) / length(“SS”)

    In this case we know that the length of “SS” is 2. I wrote it out like this to lead up to the more general rule (mixing now Stata and pseudocode)

    (length(original) – length(original_with_substr_blanked)) / length(substr)

    Thus you don’t need a loop over observations. I think you do need to do this separately for each substring.

  2. There are also two [sic] -egen- functions for this within -egenmore- from SSC. Neither of them uses the trick above. I’d prefer to believe that the reason for that was that -subinstr()- wasn’t available when the functions were written, both about ten years ago, but I can’t rule out without checking that the authors (one of them me) just overlooked this simpler way to do it.

  3. Nick & Mitch,
    That last comment about comparing lengths was the best ticket. I was able to count the awards like I needed by generating as many counting variables as req’d (g pa_XX); total of 14.

    I wish I could have gotten the more “eloquent” code above to work, but the comparison line is more my speed in thesis work…maybe when I come back for a PhD :)

    Thank you for your time and attention to this guys!

    Jake

    • I always like simpler solution. Not knowing any better, I had come up with a complex one. ‘Eloquence’, I think, is not about complexity but simplicity. Nick’s solution is an example. :)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 822 other followers

%d bloggers like this: