Converting timeseries with -tscollap-

Kit Baum’s -tscollap- is another gem from the unsecured treasure vault of user-written Stata commands. It compacts timeseries (and panel) data, creating a new dataset of means, sums, end-of-period values, and more (see -help tscollap-). For example, you have a monthly trade data that you want to add up to generate annual data or monthly data on broad money from which you only want to pick end-of-year observation.

Why not just use -collapse-, a built-in command that also converts data into means, sums, etc.? Sure, you may use -collapse-. But for timeseries data -tscollap- offers convenience. For one, a new tsset variable freq_freq (e.g., q_q, y_y, or whatever you have indicated in the generate option if you did) is generated, i.e., you need not  -tsset- or -xtset- your data again before using time series commands  or functions.   

Let us compare -collapse- and -tscollap- below using the quarterly macro data lutkepohl2.dta from Stata website. First, we use -collapse- to calculate annual data for investment, income, and consumption. To use -collapse-, we need a grouping variable—in this case a variable that indicates year—over which the sum will be calculated.

webuse lutkepohl2, clear  // load data from Stata website    
gen year = year(dofq(qtr)) // extract the year component from variable qtr
collapse (sum) inv inc consump, by(year)

Notice that variable year is not read as a time variable (type -tsset- to display the current tsset settings). Thus, we must -tsset- the data again before we can use functions and commands for timeseries analysis. The following illustrates the use of -tscollap-:

webuse lutkepohl2, clear     
tscollap (sum) inv inc consump, to(y)

-tscollap- gives the same annual data but with the bonus of automatically generating a new time variable. In the example above, variable y_y is the time variable.

-tscollap- is available from SSC. Type -ssc install tscollap- to install.

I have one question though. Why was it named -tscollap- and not -tscollapse-? Brevity? Just curious.

Functions: inlist()

I once was asked what is wrong about the code similar to the one below:

gen asean4 = 1 if countryname == “Indonesia” | “Malaysia” | “Philippines” | “Thailand”

This is a common mistake. Understandably, the assumption that repeating the left side of the expression, in this case ‘countryname’, is redundant is not far-off. Alas, Stata requires it and the correct syntax is:

gen asean4 = 1 if countryname == “Indonesia” | countryname == “Malaysia” ///
| countryname == “Philippines” | countryname == “Thailand”

But we can do better by using the built-in function inlist(). Learning a little bit more about Stata’s built-in functions can be very convenient (sometimes necessary)—shorter codes, faster processing, more facebook time. Using inlist(), the equivalent code is:

gen asean4 = 1 if inlist(countryname, “Indonesia”, “Malaysia”, “Philippines”, “Thailand”)

inlist() may also be used for numeric values. For example:

gen asean4 = 1 if inlist(countrycode, 360, 458, 608, 764)

The difference between using numeric and string values is in the number of allowable elements in the list (number of countries in our example). For numeric values, 254 elements are allowed and for string values, only 9. See -help inlist-.

Stata from Scratch for DSWD and NAPC staff

Here are photos taken during the 2-day introduction to Stata for staff of the Department of Social Welfare and Development (DSWD) and National Anti-Poverty Commission (NAPC).

Thanks to the World Bank for the opportunity to introduce Stata to this group of young—some young at heart =)—and brilliant people.

DSWD can now use the Stata they purchased :) I hope I have encouraged them to learn more on their own.

o save me -memory

‘o save me’ memory cries
heed and keep data light
byte is all that’s necessary
if variable v is 0 or unity

choose the optimal data type
to save memory and keep data light
but if you’re not sure what is best
trust Stata and use -compress-

-compress- reduces memory
by demoting the type of v
if v is string with maximum length three
-compress- stores v as str3 not str20
if v is 1 to 32740
-compress- stores v as int not floatie

 

Links: Introduction to Stata

1. Baum, Schaffer, and Stillman answer the question “What is Stata, and why should it be the package of choice for applied econometric research?” in “Using Stata for Applied Research: Reviewing its Capabilities”.

2. UCLA’s Stata resources and UCLA’s Stata class notes with accompanying videos. Learning Stata made easier…

3. This site by German Rodriguez provides an introduction to Stata with emphasis on data management and graphics.

4. Here is a link to a 42-page introduction to Stata—from basics to Mata—by Austin Nichols (2007).

5. Following the examples on this page is helpful for those who want to learn the basics of Stata graphics

-destring- complication

Source: http://xkcd.com/208/

Converting string to numeric variables is easy with -destring- (-help destring-). But when -destring- returns “income contains nonnumeric characters; no generate,” it is an unwelcome complication. This tells you that there is a nonnumeric character in a variable that you expect to be all numeric, but it does not tell you what the character(s) is(are) exactly (like the doctor telling you ‘you are sick [full stop]‘). There are two ways to deal with this. First is to use the force option, which converts all nonnumeric strings into missing values. This must be done with CAUTION. Second is to use the ignore() option to specify nonnumeric characters to take out. This must also be done with CAUTION. But to use ignore(), you must know what the specific nonnumeric characters are.

Nonnumeric characters are often easy to spot if you are working with a small dataset or the same character(s) appear in all observations. In this case you can -browse- or -list- the data or use -tab- (if you have few distinct values). Manually looking for the nonnumeric characters becomes a complication, however, if you have a huge dataset and the character(s) appear only in very few cases (for example a single “-” in the middle of a dataset with thousands of distinct observations).

Why are there nonnumeric charcters in a suppose-to-be numeric variable in the first place? There could be embedded spaces or the codes used to indicate missing values are not among the Stata’s 27 numeric missing values. There could be other reasons, including encoding errors. I usually encounter different codes for missing values including “na” (and all its variants), “no data”, or “-”, and this used to give me a headache until I figured out what what ‘regular expressions’ are (-help regexm-).

Below is an illustration.

clear
input str6 income
“9747″
“1,234″
“938.9″
“8344″
“2398″
“-”
“53822″
“na”
“$28477″
“n/a”
end

What we want is a command that will show us what the unwanted characters are, that is, nonnumeric characters excluding the decimal point “.” (except when you expect series of decimal points such as “..”). The condition in the following -tab- command does so.

tab income if regexm(income, “[^0-9 .]“)

destring income, ignore(“$” “-” “,” “na” “n/a”) gen(n_income)
list

It is tempting to overuse -regexm-, but it is not necessary in cases where the characters are obvious.

See also Stata’s FAQ: What are regular expressions and how can I use them in Stata? (Kevin S. Turner, StataCorp).

Error in reading DBF

In Fun with maps in Stata I noted an error that occurs when trying to convert shapefiles from PhilGIS to Stata data files using -shp2dta- (Kevin Crow). It turns out that the column widths are more than 244 (even though the strings in the cells are not long). The easiest way to get around this is to open the dbf file using another program and change the column widths.[1] In my case, I used MS Excel.[2] If you are using Excel 2003, there is a ‘save as dbf’ option. But for some reason, this option is not available in Excel 2007 (what were the geniuses at Microsoft thinking?). Here is where the ExcelToDBF add-in, developed by Andrea Bonfiglio, comes to the rescue.[3] Once you have download and installed the add-in, the ‘DBF IV’ option will be added to the ‘Save As’ menu. To illustrate, here is a screencast.

—————————-
[1] Some Statalist posts hint that one can use Mata. Didn’t try that.
[2] Al Tongco of PhilGIS also suggested the free spreadsheet program OpenOffice Calc.
[3] I have also tried 2 other add-ins (SaveDBFIV and XLSX2DBF) but they returned an error when I tried to convert a huge file.

Follow

Get every new post delivered to your Inbox.

Join 98 other followers

%d bloggers like this: