Functions: inlist()

I once was asked what is wrong about the code similar to the one below:

gen asean4 = 1 if countryname == “Indonesia” | “Malaysia” | “Philippines” | “Thailand”

This is a common mistake. Understandably, the assumption that repeating the left side of the expression, in this case ‘countryname’, is redundant is not far-off. Alas, Stata requires it and the correct syntax is:

gen asean4 = 1 if countryname == “Indonesia” | countryname == “Malaysia” ///
| countryname == “Philippines” | countryname == “Thailand”

But we can do better by using the built-in function inlist(). Learning a little bit more about Stata’s built-in functions can be very convenient (sometimes necessary)—shorter codes, faster processing, more facebook time. Using inlist(), the equivalent code is:

gen asean4 = 1 if inlist(countryname, “Indonesia”, “Malaysia”, “Philippines”, “Thailand”)

inlist() may also be used for numeric values. For example:

gen asean4 = 1 if inlist(countrycode, 360, 458, 608, 764)

The difference between using numeric and string values is in the number of allowable elements in the list (number of countries in our example). For numeric values, 254 elements are allowed and for string values, only 9. See -help inlist-.

Stata from Scratch for DSWD and NAPC staff

Here are photos taken during the 2-day introduction to Stata for staff of the Department of Social Welfare and Development (DSWD) and National Anti-Poverty Commission (NAPC).

Thanks to the World Bank for the opportunity to introduce Stata to this group of young—some young at heart =)—and brilliant people.

DSWD can now use the Stata they purchased :) I hope I have encouraged them to learn more on their own.

o save me -memory

‘o save me’ memory cries
heed and keep data light
byte is all that’s necessary
if variable v is 0 or unity

choose the optimal data type
to save memory and keep data light
but if you’re not sure what is best
trust Stata and use -compress-

-compress- reduces memory
by demoting the type of v
if v is string with maximum length three
-compress- stores v as str3 not str20
if v is 1 to 32740
-compress- stores v as int not floatie

 

Links: Introduction to Stata

1. Baum, Schaffer, and Stillman answer the question “What is Stata, and why should it be the package of choice for applied econometric research?” in “Using Stata for Applied Research: Reviewing its Capabilities”.

2. UCLA’s Stata resources and UCLA’s Stata class notes with accompanying videos. Learning Stata made easier…

3. This site by German Rodriguez provides an introduction to Stata with emphasis on data management and graphics.

4. Here is a link to a 42-page introduction to Stata—from basics to Mata—by Austin Nichols (2007).

5. Following the examples on this page is helpful for those who want to learn the basics of Stata graphics

-destring- complication

Source: http://xkcd.com/208/

Converting string to numeric variables is easy with -destring- (-help destring-). But when -destring- returns “income contains nonnumeric characters; no generate,” it is an unwelcome complication. This tells you that there is a nonnumeric character in a variable that you expect to be all numeric, but it does not tell you what the character(s) is(are) exactly (like the doctor telling you ‘you are sick [full stop]‘). There are two ways to deal with this. First is to use the force option, which converts all nonnumeric strings into missing values. This must be done with CAUTION. Second is to use the ignore() option to specify nonnumeric characters to take out. This must also be done with CAUTION. But to use ignore(), you must know what the specific nonnumeric characters are.

Nonnumeric characters are often easy to spot if you are working with a small dataset or the same character(s) appear in all observations. In this case you can -browse- or -list- the data or use -tab- (if you have few distinct values). Manually looking for the nonnumeric characters becomes a complication, however, if you have a huge dataset and the character(s) appear only in very few cases (for example a single “-” in the middle of a dataset with thousands of distinct observations).

Why are there nonnumeric charcters in a suppose-to-be numeric variable in the first place? There could be embedded spaces or the codes used to indicate missing values are not among the Stata’s 27 numeric missing values. There could be other reasons, including encoding errors. I usually encounter different codes for missing values including “na” (and all its variants), “no data”, or “-”, and this used to give me a headache until I figured out what what ‘regular expressions’ are (-help regexm-).

Below is an illustration.

clear
input str6 income
“9747″
“1,234″
“938.9″
“8344″
“2398″
“-”
“53822″
“na”
“$28477″
“n/a”
end

What we want is a command that will show us what the unwanted characters are, that is, nonnumeric characters excluding the decimal point “.” (except when you expect series of decimal points such as “..”). The condition in the following -tab- command does so.

tab income if regexm(income, “[^0-9 .]“)

destring income, ignore(“$” “-” “,” “na” “n/a”) gen(n_income)
list

It is tempting to overuse -regexm-, but it is not necessary in cases where the characters are obvious.

See also Stata’s FAQ: What are regular expressions and how can I use them in Stata? (Kevin S. Turner, StataCorp).

Error in reading DBF

In Fun with maps in Stata I noted an error that occurs when trying to convert shapefiles from PhilGIS to Stata data files using -shp2dta- (Kevin Crow). It turns out that the column widths are more than 244 (even though the strings in the cells are not long). The easiest way to get around this is to open the dbf file using another program and change the column widths.[1] In my case, I used MS Excel.[2] If you are using Excel 2003, there is a ‘save as dbf’ option. But for some reason, this option is not available in Excel 2007 (what were the geniuses at Microsoft thinking?). Here is where the ExcelToDBF add-in, developed by Andrea Bonfiglio, comes to the rescue.[3] Once you have download and installed the add-in, the ‘DBF IV’ option will be added to the ‘Save As’ menu. To illustrate, here is a screencast.

—————————-
[1] Some Statalist posts hint that one can use Mata. Didn’t try that.
[2] Al Tongco of PhilGIS also suggested the free spreadsheet program OpenOffice Calc.
[3] I have also tried 2 other add-ins (SaveDBFIV and XLSX2DBF) but they returned an error when I tried to convert a huge file.

Bootstrapping Gini

Income inequality in the Philippines, as measured by the Gini coefficient, declined from 46.05 to 44.84 between 2003 and 2009.[1] Is the observed difference in the the Gini coefficient a real reduction in inequality in income distribution or is it only due to sampling variations?

A friend asked me a question related to this weeks ago. She asked if I know a Stata command that tests the significance between the difference of two Gini coefficients. Lazy to think about it, I just shrug the problem with a ‘no’ and never bothered to search for a solution. This problem came back to me while reading the bootstrapping section of NetCourse 151–Lecture 3. Indeed, when I searched further, a number of literature have used bootstrapping for this purpose. Examples include Mills and Zandvakili (1997) and Biewen (2002).[2][3]

Using Stephen Jenkins’s -ineqdeco- (“ssc install ineqdeco”) to calculate for the Gini coefficients, I did the following bootstrapping exercise for a hypothetical dataset:

/*************************************************************/
/* Program to calculate difference in Gini coefficients */
clear
cap program  drop ginidiff
program ginidiff, rclass
qui ineqdeco income if year == 1
return scalar gini1 = r(gini)
qui ineqdeco income if year == 2
return scalar gini2 = r(gini)
return scalar diff = return(gini1) – return(gini2)
end

/* Generate hypothetical dataset */
set obs 200
gen year = 1 in 1/100
replace year = 2 in 101/200
set seed 56453
gen income = runiform()*1000 if year == 1
set seed 86443
replace income = runiform()*1000 if year == 2

/* Apply bootstrap */
assert income < .    /* Make sure no missing values */
set seed 873023
bootstrap r(diff), reps(1000) nodots : ginidiff

program drop ginidiff
exit
/*************************************************************/

Whether this straightforward application of bootstrapping is the best solution is another story (and the more important one). For a discussion of other proposed methods and of the limitations of bootstrapping in this context, see for example, Palmitesta, P. et al (2000) and Van Kerm (2002).[4][5]

——————————–
[1] Thanks to Shiel Velarde (World Bank Manila) for the Gini estimates. These numbers are based on the Family Income and Expenditure Survey (FIES).

[2] Mills, J. and Zandvakili, S. (1997). “Statistical Inference via Bootstrapping for Measures of Inequality”. Journal of Applied Econometrics 12 (2): 133–150. (Working Paper version can be downloaded here)

[3] Biewen, M. (2002). “Bootsrap inference for inequality, mobility and poverty measurement.” Journal of Econometrics 108 (2002): 317–342.

[4] Palmitesta, P. et al (2000). “Confidence Interval Estimation for Inequality Indices of the Gini Family.” Computational Economics
16(1-2):137-147.

[5] Van Kerm, P. (2002). “Inference on inequality measures: A Monte Carlo experiment.” Journal of Economics 9(Supplement1): 283-306. (Working Paper version can be downloaded here)

And we’re rolling, rolling; rolling on the river*

I just learned about -rolling- today. Thanks to a friend for asking about moving averages and standard deviations yesterday. The problem was how to generate a new variable that contains the average and standard deviation of the previous 10 period. For example, the generated data for 1961 would be the average and the standard deviation for the period 1951 to 1960. I knew -tssmooth ma- can be used for moving averages, but I was not aware of a similar command for standard deviations so I did the following exercise for the moving standard deviation yesterday:

/* Create hypothetical data */
clear
set obs 50
gen year = 1951 if _n==1
replace year = year[_n-1] + 1 if _n!=1
set seed 528
gen data1 = runiform()
set seed 285
gen data2 = runiform()

/* Calculate moving standard deviation */
sort year
foreach d of varlist data* {
qui gen sd`d’ = .
local N = _N
local i = 1
local j = 10
forvalues k=11/`N’{
qui sum `d’ in `i’/`j’
qui replace sd`d’ = r(sd) if _n==`k’
local i = `i’ + 1
local j = `j’ + 1
}
}

I should have googled first. If I had, I should have found Nick Cox’s reply to the Statalist post “calculating moving standard deviation” by Ravi Yatawara where he suggested -rolling-. By reshaping the data into panel format and applying -xtset-, I can now use -rolling-.

/* Create the same hypothetical data as above */
/* Organize the data and apply xtset */
reshape long data, i(year) j(group 1 2)
xtset group year

/* Calculate moving standard deviation */
rolling sd=r(sd), window(10) keep(group) clear: sum data
gen year = end + 1
keep group year sd

It is also possible to generate more than one statistics. For example, if I also want to calculate the moving average, I can write:

rolling sd=r(sd) mean=r(mean), window(10) keep(group) clear: sum data

See -help rolling- for more options.

The most important advantage of -rolling- (aside from its simplicity), I think, is that you do not have to worry about the order of your data because -xtset- or -tsset- already took charge of that. Note that by using -in- in my unnecessary code, I have to make sure that the data is sorted by year, otherwise I will be getting the standard deviations for the wrong time periods.

Lesson: Google first!

——————————————–
*”Proud Mary” [not "Rolling" as I used to think] by Tina Turner.

Stata-MySQL a first encounter

In late April, I had my first encounter with .myd, .myi, and .frm files. The challenge was to read these directly from Stata so that users can easily work on the data using Stata. I was told that these are SQL databases. I initially thought that this can easily be read using a straightforward use of -odbc- command. But the file extensions .myd, .myi, and .frm are not among the choices avilable in Windows ODBC Data Source Administrator*. Google…google…google.

Thanks to Andrew Dyck’s post Connect to MySQL database using Stata. His Step 0 and Step 1 is exactly the second half of what I needed to do. But, as I had no idea how SQL works, it took me a while to understand what I needed to do in the first half. I confused the term ‘MySQL server’ to a specific physical machine other than my own PC. Eventually I figured I need MySQL server installed. For those who have their first encounters like me, here is the first half: Step -2 and Step -1.

Step -2: Download/Install the MySQL server**
Not knowing any better, I just used all default settings during the installation, and  typed “pass” as password when asked. I did not bother to change the default username “root”.

You did OK if you can open the MySQL Command Line Client (screen shot below) from the start menu. As you can see (or maybe not because they are too small) from the screenshot, I typed “show databases;” to see the databases that the system has access to. I found that only the default MySQL databases are in the list. Why can’t it see my database? Because my folder was in the desktop.

I guess it is like Stata. If you put your data or ado-file in the wrong place, Stata will not be able to find it. Where is the right place?

Step -1: Paste the folder with your data in the right place
That right place can be found in my.ini file (in my case, my.ini file is in “C:\Program Files\MySQL\MySQL Server 5.5″).

#Path to the database root
datadir=”C:\ProgramData\MySQL\MySQL Server 5.5\data”

Note that C:\ProgramData\ is, by default, a hidden folder. If you can’t find it, it is either you have not changed the settings in folder options or you have no admin rights.

Can you change this data directory? Yes. First, you need to stop MySQL from running (closing the MySQL Client window is not enough). You can use Windows task manager to end the program mysqld.exe. Second, move the entire data folder from old path to the new path. Third, change the data directory path in my.ini. In my case, my new data directory path is:

datadir=”C:/Data/”

Now, you can move your folder (with the .myd, .myi, and .frm files) to this new data directory. Reboot.

Follow Andrew Dyck’s Step 0 and Step 1. Note that the connectionstring() used in Step 1 was only added to -odbc- late last year (see Stata’s blog). Make sure your Stata is up-to-date.

It also helps to learn a few SQL syntax so that you can load only selected lines or those that meet specific conditions, or load everything (see MySQL documentation). To illustrate, we use the built-in MySQL database information_schema.

Example 1: Load rows 1-20 of table TABLES
odbc load, exec(“SELECT * FROM TABLES LIMIT 20“) conn(“DRIVER={MySQL ODBC 5.1 Driver};SERVER=localhost;DATABASE=information_schema;UID=root;PWD=pass;”) clear

Example 2: Load observations where the variable ROW_FORMAT==”Fixed”
odbc load, exec(“SELECT * FROM TABLES WHERE ROW_FORMAT=’Fixed’”) conn(“DRIVER={MySQL ODBC 5.1 Driver};SERVER=localhost;DATABASE=information_schema;UID=root;PWD=pass;”) clear

Example 3: Load everything from table TABLES
odbc load, exec(“SELECT * FROM TABLES“) conn(“DRIVER={MySQL ODBC 5.1 Driver};SERVER=localhost;DATABASE=information_schema;UID=root;PWD=pass;”) clear

————————————
*Control Panel–>Administrtive Tools–>Data Sources (ODBC)

** Complications arise if you have no admin rights to the PC. I am using a 64-bit Windows 7 machine; I am not sure if this works the same in other platforms as well.

Website tables into Stata

That last post bit me (in a good way)…speaking of returns! The downside: no time for blogging. During that time I have also set aside the Stata course work I enrolled in. The lecture notes are easy to follow, and the exercises are very helpful. I just started going through lecture 3, but no more course leaders to ask when I get stuck. On the upside, I have also learned something new while working on other things. The subject of this post is one example.

This was my problem: I need barangay (village) information (population, code, type) for all municipalities. The information is available in NSCB’s PSGC (Philippine Standard Geographic Codes) website, but these are organized by municipality. It would not have been a problem if NCSB provided a data file (in text or any data format) that is readily downloadable. Copy-pasting the tables is out of the question—the Philippines has 1,496 municipalities! Thanks to Google for pointing me to Eric Booth’s reply to Friedrich Huebler’s query on reading HTML files with Stata at Statalist.

Using Eric Booth’s example as guide, below I illustrate how I grabbed the barangay table for a single municipality. In honor of Manny Pacquiao‘s win over Shane Mosley last Sunday, I will use as example the municipality of Kiamba, where Manny Pacquiao resides as Congressman of the province of Sarangani.* The output I need is a Stata dataset with all the 19 barangays of Kiamba, their corresponding PSGC, barangay type (whether urban or rural), and population (as in the table below).

===========================================================

/* Copy web page and save to a text file. The PSGC code for Kiamba is “128003000″, where the first 2 digits correspond to the region code, the next 2 digits correspond to the province code. */

local m “128003000″
local nscb “http://www.nscb.gov.ph/activestats/psgc/municipality.asp?”
local r = substr(“`m’”,1,2)
local p = substr(“`m’”,3,2)
copy “`nscb’muncode=`m’&regcode=`r’&provcode=`p’”  “m`m’.txt”, text

/* Delete unnecessary HTML codes and replace quotation marks (\Q) with “%”. I tried not to replace the quotation marks and learned that there is a reason why Eric Booth got rid of them. */
filefilter “m`m’.txt” “xm`m’.txt”, from(“</td>”) to(“”) replace
filefilter “xm`m’.txt” “m`m’.txt”, from(“\Q”) to(“%”) replace
erase “xm`m’.txt”

/* Load text file */
intext using m`m’.txt, gen(nscb) clear

/* Drop unnecessary rows and columns */
drop in 1/160
keep nscb1

/* Flag lines to keep */
local brgy “<td width=%255% class=%dataCell%>”
local psgc “<td width=%130% align=%center% class=%dataCell%>”
local popn “<td width=%115% align=%center% class=%dataCell%>”

/* -findval- (Stas Kolenikov) is from SSC archive. Type -ssc install findval- */

findval “`brgy’”, substr gen(flag_brgy)
findval “`psgc’”, substr gen(flag_psgc)
findval “Urban”, substr gen(flag_urban)
findval “Rural”, substr gen(flag_rural)
findval “`popn’”, substr gen(flag_popn)

drop if (flag_brgy + flag_urban + flag_rural + flag_psgc + flag_popn == 0)

/* Create required variables */
gen brgy = subinstr(nscb1, “`brgy’”, “”, 1) if flag_brgy
gen psgc = subinstr(nscb1, “`psgc’”, “”, 1) if flag_psgc
gen btyp = “Urban” if flag_urban
replace btyp = “Rural” if flag_rural
gen popn = real(subinstr(nscb1, “`popn’”, “”, 1)) if flag_popn

label var brgy “Barangay name”
label var psgc “PSGC codes”
label var btyp “Urban/Rural”
label var popn “Population”

/* Align data */
keep brgy psgc btyp popn
replace psgc = psgc[_n+1]
replace btyp = btyp[_n+3]
replace popn = popn[_n+4]
drop if brgy == “”

/* Delete trailing spaces */
replace brgy = ltrim(brgy)
replace psgc = ltrim(psgc)
replace btyp = ltrim(btyp)

/* Checks */
assert mi(psgc) == 0 /* No missing code */
assert length(psgc) == 8 | length(psgc) == 9 /* Length of codes within range */
bysort psgc: assert _N == 1    /* Barangay codes unique */

list

===========================================================

I then loop the codes over all muncipality codes for which I have data. This may not be the most elegant way to do this as Eric Booth pointed out but it works, and that is all that matters (most of the time).

———————————
*If it were a ‘running backwards’ contest, Shame Mosley would have won it against Manny Pacquiao. Alas, it was boxing.

*Kiamba is also where we use to spend summer vacations with Apong Baket and Apong Lakay (grandmother and grandfather, respectively, in Ilocano, a language used in the northern part of the Philippines).

*The code above can be downloaded here.

Follow

Get every new post delivered to your Inbox.

Join 48 other followers