STOP /** ******************************************************************************** * * * * * * * The Workflow and Stata * * UC Berkeley Workshop * * * * * * * * * * © Vernon Gayle, University of Edinburgh. * * Professor Vernon Gayle (vernon.gayle@ed.ac.uk) * * * ******************************************************************************** Latest Update Friday 13th May 2016 Berkeley CA ******************************************************************************** The following topics will be covered (but not in this order): Better understanding the workflow Organising your .do files efficiently Managing survey and administrative data files Organising variables and measures Producing publication style outputs It is intended for people who have prior experience of the package. The workshop is aimed at workers who are experienced Stata users but who want to increase their experise. Please adjust your expectations - this shouuld be a two day course. It will NOT be possible to learn everthing in one afternoon. Please be patient. Computers often go wrong. Please asks for help. Not all of your questions will be answered but I will help as much as we can. Good luck. ******************************************************************************** The four pillars of wisdom Effectiveness: minimising information loss and errors in analyses and output Efficiency: automation, maximising features in software Transparency: showing what you did, why, when and how Reproducibility: producing the same results every time whoever or wherever when editing, rewriting a dissertation or re-submitting papers ******************************************************************************** Citing this .do file Gayle, V. (2016). The Workflow and Stata, University of Edinburgh. © Vernon Gayle, University of Edinburgh. Professor Vernon Gayle (vernon.gayle@ed.ac.uk) **/ ******************************************************************************** ********************************************** * IT IS IMPORTANT THAT YOU READ THE COMMENTS * * AND FOLLOW THE STATA.DO FILE LINE BY LINE! * ********************************************** * some preliminary settings to help the session run smoothly * clear all macro drop all graph drop _all set more off cd c:\temp pwd ******************************************************************************** ******************************************************************************** * * * A Little Role Play * * * ******************************************************************************** * this is a file on women and employment * webuse womenwk, clear tab educ, gen(ed) label variable age "Age in years" rename ed1 no_ed rename ed2 low_education rename ed3 medium_education rename ed4 high_education label variable no_ed "No education" label variable low_education "Low education" label variable medium_education "Medium education" label variable high_education"High education" * a simple regression model * regress wage low_education medium_education high_education age estimates store reg1 * using esttab to get the results in a publication ready format in Word * #delimit ; esttab reg1 using c:\temp\regress1.rtf, cells(b(star fmt(%9.3f)) se(par)) stats(r2 r2_a N, fmt(%9.3f %9.3f) labels(R-Squared AdjR-Squared n)) starlevels(* .10 ** .05 *** .01) stardetach label mtitles("Regression Model") nogaps replace ; #delimit cr * this is a coefplot - we might return to these later * #delimit ; coefplot, vertical baselevels drop(_cons age) xline(0) ytitle("Regression Coefficient" " ") xtitle(" " "Education Level") title("Regression Model of Women's Hourly Wage", size(medium) justification(right) ) subtitle("(educational level)", size(medsmall) justification(right)) scheme(s1mono) note(" " "Source:womenwk dataset; n= 1,343; Adjusted R-Squared =.26") name(myplot, replace) ; #delimit cr * keep the graph window open * * export the graph to a file * graph export c:\temp\myplot.png, replace tempname handle0 rtfappend `handle0' using c:\temp\regress1.rtf, replace capture noisily { file write `handle0' "\line" file write `handle0' "\line" rtflink `handle0' using myplot.png file write `handle0' "\line" file write `handle0' _n "{\line}" _n "{\pard text can be added here {\ul " file write `handle0' "}\par}" file write `handle0' "\line" _n "{\pard more text can be added here too " file write `handle0' ".\par}" _n "\line" _n rtfclose `handle0' } * /** the graph is now in the Word document regress1.rtf in folder c:\temp\ it has been appended under the regression table **/ ******************************************************************************** ******************************************************************************** * * Setting Up Stata and Your Directory Structure * ******************************************************************************** * This section is about organising preliminary settings in Stata * * clear the computer memory * clear all macro drop all /** more causes Stata to display --more-- and pause until any key is pressed. It is usually more convenient to have this function switched off **/ set more off * keep clear log files containing your output * * first close any log files that might already be running * capture log close /** we use the capture command because Stata will not report an error if there is no log file to close **/ /** Getting your directory structure in a consistent form is critical to efficient working. Most people wil already have established a directory structure on their own machines or network areas. We are not suggesting that you change your structure to any particular format. However we ARE suggesting that you put some thought into your directory structure and consider how you could make it more CONSISTENT and how it might be improved to assist your workflow. In an example below we will organised a simple but effective directory sturture... working data_raw data_clean codebooks logs do_files documents figures tables trash temp Here are a few commands that will help you... **/ * display the path of the current working directory * pwd ********************************************** * FIND A DRIVE THAT WORKS FOR YOU * * * ********************************************** * change the working directory * cd e:\ pwd * make a new directory * mkdir "e:\new_directory" * take a look on the drive to check that the directory has been created * * now remove this directory * rmdir "e:\new_directory" /** you can run the following block of commands or decide on your own directory structure... **/ mkdir "e:\working" mkdir "e:\data_raw" mkdir "e:\data_clean" mkdir "e:\codebooks" mkdir "e:\logs" mkdir "e:\do_files" mkdir "e:\documents" mkdir "e:\figures" mkdir "e:\tables" mkdir "e:\trash" mkdir "e:\temp" /** Task: Write a paragraph to justifying the directory structure that you have chosen. **/ ******************************************************************************** ******************************************************************************** * * Locating Directories * ******************************************************************************** /** Locating files using macro commands is an extremely efficient practice. It tells Stata where to look for files on your machine or network. **/ * make sure you run all of the following commands * global path1 "e:\working\" /** the location of a working directory - where you can save newly created data files and output **/ global path2 "e:\do_files\" * the location where your .do files will be saved * global path3 "e:\data_raw\" * the location where your raw (i.e. unprocessed) data is stored * global path4 "e:\data_clean\" * the location where your clean (i.e. processed) data is saved * global path5 "e:\logs\" * the location where your log files are saved * global path6 "e:\codebooks\" * the location where your codebooks are saved * global path7 "e:\temp\" * the location of a temporary folder where you can save intermediate files * ******************************************************************************** * using global macros and paths * clear /** **/ use $path3\ /** add file name .dta , clear **/ summarize /** at this stage you might feel a sensation like fish scales falling from your eyes, or you might hear a sound like pennies falling from hevean... defining paths as macros provides vital help for switching between machines, working in collaboration with colleagues and keeping track of where files came from and where they end up! **/ * erasing a file erase /** $path3\ add file name .dta , clear **/ ******************************************************************************** ******************************************************************************** * * Organising Variables and Measures * ******************************************************************************** use $path3\adrc_s_training_data4.dta, clear * take a look at a summary of the variables * summarize * take a look at a description of the data * describe * take a quick look at the data * browse * how many cases (rows) are in the dataset? count * take a look at the first three cases * list in 1/3 * reorder the dataset with the varaibles id sex age marstat first * order id sex age marstat list in 1/3 * save a codebook of the data * capture log using $path6\codebook_20150923_student_v1.txt, replace text codebook, compact capture log close /** take a look at the compact codebook that has been created as a txt file in path6 "e:\codebooks\" **/ * here is an alternative format for your codebooks * codebook, header /** Right at the top of the output there is some additional information... e.g. Dataset: f:\data_raw\adrc_s_training_data4.dta Last saved: 4 Aug 2015 08:54 Label: [none] Number of variables: 49 Number of observations: 5,048 Size: 328,120 bytes ignoring labels, etc. **/ * start a log file * capture log using $path5\day1_log_20150923_student_v1.txt, replace text * sort the data by id * sort id * list the id values for the first 10 cases * list id in 1/10 * number labels * tab sex * remove the number labels from the dataset * numlabel _all, remove tab sex * most of the time you will want the number labels - put them back * numlabel _all, add * generate a new indicator varaible for males (i.e. male=1) gen males=. tab males replace males=1 if sex==1 replace males=0 if sex==2 tab males sex, missing * add a label to the variable * label variable males "gender" tab males * define a set of labels (called sexlabel) * label define sexlabel 0 "female" 1 "male" * attach the value labels (called sexlabel) to the new male variable * label values male sexlabel tab males * you might also want the number labels * numlabel sexlabel, add tab males * renaming a variable * rename rgsc rgsoc_class * cloning an existing a variable * /** clonevar has various possible uses you may desire that a temporary variable appear to the user exactly like an existing variable you might want a slightly modified copy of an original variable, so the natural starting point is a clone of the original **/ tab ethnic clonevar ethnic2=ethnic recode ethnic2 (1=0) (2/max=1) tab ethnic ethnic2, missing * construct an indicator for girls under age 16 * gen girlu16=. replace girlu16=1 if male==0 & age<16 * remember 2 + 2 = 4 that is 2 plus 2 becomes 4 * * in maths the single equal " = " sign means becomes * * the double equals sign " == " means equivalent to * tab girlu16, missing * generate a variable for age squared * generate agesq = age^2 scatter agesq age * take a look at some of the other maths functions that Stata can perform help math functions * drop the new variables that you have created * drop males ethnic2 girlu16 agesq * keep is the antonym of drop * ******************************************************************************** * constructing dummy variables * tab ethnic, missing capture drop white gen white=(ethnic==1) /** check that there are the correct number of white people. where are the four missing cases? **/ * there are several manual ways to create dummy variables here are two * capture drop white gen white=. replace white=1 if ethnic==1 replace white=0 if ethnic>1 & ethnic<9 tab white ethnic, missing capture drop white gen white=. replace white=1 if ethnic==1 replace white=0 if ethnic>1 replace white=. if ethnic==. tab white ethnic, missing * a far better capture drop white tabulate ethnic, gen(eth) tab1 eth*, missing * now clear the memory * clear * now clear Stata's main window * cls ******************************************************************************** * subsets of data * use id sex ethnic age marstat using $path3\adrc_s_training_data4.dta, clear * take a look at a summary of the variables * summarize * take a look at a description of the data * describe * take a quick look at the data * browse * keep only males * keep if sex==1 tab sex clear * here is another way of getting a subset of the data * use id sex ethnic age marstat if sex==1 using /// $path3\adrc_s_training_data4.dta, clear tab sex * making dataset of summary statistics * use $path3\adrc_s_training_data4.dta, clear mean age, over(ethnic) collapse age, by(ethnic) browse use $path3\adrc_s_training_data4.dta, clear collapse (mean) mean_age=age (count) n=id, by(ethnic) browse * another example of collapsing data * use $path3\adrc_s_training_data4.dta, clear count tabulate sex, generate(sexdum) collapse (count) n=id (sum) girls=sexdum1 boys=sexdum2, by(ethnic) browse use $path3\adrc_s_training_data4.dta, clear contract sex ethnic browse ******************************************************************************** * expanding collapsed datasets * clear input ethnic gender n 1 1 2133 1 2 2186 2 1 52 2 2 52 3 1 49 3 2 48 4 1 72 4 2 71 5 1 40 5 2 41 6 1 39 6 2 45 7 1 28 7 2 46 8 1 78 8 2 64 end count summarize expand n browse * you might want to drop the first 16 rows of the original data * drop if _n<17 count tab ethnic gender, missing ******************************************************************************** * looking at the egen command * * egen extensions to generate * use id sex ethnic age marstat using $path3\adrc_s_training_data4.dta, clear summarize age egen agecat_10 = cut(age), at(10,20,30,40,50,60,70,80,90,100) tab agecat_10 table agecat_10, contents(min age max age) /** if you prefer, you can ask cut() to choose the cutoffs to form groups with approximately the same number per group we request the creation of 4 (roughly) equally sized groups **/ egen agecat_4 = cut(agecat), group(4) label table age agecat_4 ******************************************************************************** * looking for varaibles * cls use $path3\adrc_s_training_data4.dta, clear /** lookfor helps you find variables by searching for a string among all variable names and labels. **/ lookfor ethnic ******************************************************************************** * missing values * tab ethnic, missing * change missing value (.) to a number (999) mvencode ethnic, mv(999) tab ethnic, missing * change missing values (999) to missing (.) mvdecode ethnic, mv(999) tab ethnic, missing * take a look at another variable with two missing value codes * tab workmode, missing mvdecode workmode, mv(-9, 7) tab workmode, missing ******************************************************************************** * dealing with long numbers (e.g. id numbers) * clear input id 123456788 123456789 123456791 123456792 123456793 123456794 123456799 123456100 end list * now use the format command * format id %9.0f list ******************************************************************************** * Temporary Operations * cls use $path3\adrc_s_training_data4.dta, clear preserve drop if sex==1 tab sex, missing restore tab sex, missing ******************************************************************************** ******************************************************************************** * Stata Help * ******************************************************************************** /** Advice on finding help from Stata... Let's say you are trying to find out how to do something. With over 2,000 help files and 11,000 pages of PDF documentation, Stata have probably explained how to do whatever you want. The documentation is filled with worked examples that you can run on supplied datasets. Whatever your question, try the following first: 1. Select Help from the Stata menu and click on Search.... [We are Stata programmers so never use the menu!] 2. Type some keywords about the topic that interests you, e.g. "logistic regression". 3. Look through the resulting list of available resources, including help files, FAQs, Stata Journal articles, and other resources. 4. Select the resource whose description looks most helpful. Usually, this description will be a help file and will include "(help )", or, as in our example, perhaps "(help logistic)". Click on the blue link "logistic". 5. Let's assume you have selected the "(help logistic)" entry. You are probably not interested in the syntax at the top of the file, but you would like to see some examples. Select Jump To from the Viewer menu in the top right corner of the help file now on your screen), and click on Examples. You will be taken to example commands that you can run on example datasets. Simply cut and paste those commands into Stata to see the results. 6. If you are new to the logistic command and want both an overview and worked examples with discussion, from the Also See menu, click on [R] logistic with the PDF icon. Or at the top of the help file, click on the blue title of the entry [R] logistic). Your PDF viewer will be opened to the full documentation of the logistic command. 7. There is a lot of great material in this documentation for both experts and novices. As with the help file, you will often want to begin first with the Remarks and examples section. Simply click on the Remarks and examples link at the top of the logistic entry. A complete discussion of the logistic command can be found in the remarks along with worked examples that run on supplied datasets and are explained in detail. That should get you ready to use the command on your own data. If that does not help, try the many other Stata resources; see Resources for learning more about Stata. **/ ******************************************************************************** /** PLEASE PLEASE PLEASE - spend some time refelecting on the new ideas and commands that you have encountered Remember - COMMENT COMMENT COMMENT **/ ******************************************************************************** clear sysuse auto.dta, clear codebook, compact * a table of summary statistics* /** you might need to install estout first ssc install estout on an Edinburgh University teaching machine you might have to run the following line of code first sysdir set PLUS "c:\Workspace\stata\plus" ssc install estout **/ estpost summarize mpg trunk turn esttab using "$path7\table1.rtf", cell("count(f(0)) mean(f(2)) sd(f(0))") /// title(Table 1: Summary Statisitics) addnotes(Notes: Auto.dta) /// nonumbers noobs replace * the output is written to $path7\table1.rtf click on the text in the output * * a two-way table* webuse citytemp2, clear tabulate region agecat estpost tab region agecat esttab using "$path7\table2.rtf", cell("b(f(0))") /// nonumbers mtitles(" Age Group") /// collabels(none) /// title(Table 1: Census Region by Age Group) addnotes(Notes: Citytemp2.dta) /// noobs unstack replace * the output is written to path7\table2.rtf click on the text in the output * /** this is just a taste of the possibilities for producing publication ready outputs using Stata **/ ******************************************************************************** /** Well done. You have covered a lot of material. It is intended for people who have prior experience of the package. The workshop is aimed at workers who are experienced Stata users but who want to increase their experise. Please adjust your expectations - this shouuld be a two day course. It will NOT be possible to learn everthing in one afternoon. Please be patient. Computers often go wrong. Please asks for help. Not all of your questions will be answered but I will help as much as we can. Good luck. **/ ******************************************************************************** /** Further Reading and Resources Kohler, U. and Kreuter, F. (2009) Data Analysis Using Stata (Second Edition), College-Station Texas, Stata Press. ISBN 9781597180467. (A very good book, ideal for researchers who are new to Stata software). Long, J.S. (2009) The Workflow of Data Analysis Using Stata, College-Station Texas, Stata Press. ISBN 9781597180474. (A great book on the practice of data analysis and data management). Pevalin, D. and Robson, K. (2009) The Stata Survival Manual, McGraw-Hill. ISBN 978-0-335-22388-6. (Many students find this book very accessible). Rabe-Hesketh, S. and Everitt, B. (2007) A Handbook of Statistical Analyses Using Stata, Boca Raton, Chapman & Hall. ISBN 1584887567. (A comprehensive resource). Singer, J.D. and Willett, J.B. (2003) Applied Longitudinal Data Analysis: Modelling change and event occurrence, New York, Oxford University Press. ISBN: 0-19-515296-4. (Wide coverage illustrating a selection of relatively advanced analytical strategies – though not as much pragmatic guidance as the title might suggest). Skrondal, A. and Rabe-Hesketh, S. (2004) Generalized Latent Variable Modelling: Multilevel, Longitudinal and Structural Equations Models, New York, Chapman and Hall. ISBN: 1-58488-000-7. (An advanced, dense text which summarises a wide array of statistical models which may be used for longitudinal analyses, highlighting the connections between them). Treiman, D. J. (2009) Quantitative Data Analysis – Doing Social Research to Test Ideas, San Francisco, Jossey-Bass. ISBN: 9780470380031. (An excellent book).   Stata home page http://www.stata.com/ Stata Bookstore http://www.stata.com/bookstore/books-on-stata/ UCLA Academic Technology Services http://www.ats.ucla.edu/stat/stata/ Princeton Stata resources http://data.princeton.edu/stata/ The website of the ESRC 'Longitudinal Data Analysis for Social Science Researchers' project much of my earlier Stata training resources are available on this site www.longitudinal.stir.ac.uk Stata on Twitter http://twitter.com/#!/stata Stata Journal http://www.stata-journal.com/ **/ ******************************************************************************** /** © Vernon Gayle, University of Edinburgh. Professor Vernon Gayle (vernon.gayle@ed.ac.uk) This file has been produced by Vernon Gayle. Any material in this file must not be reproduced, published or used for teaching without permission from Professor Gayle. Over the last decade much of the Stata materials that Professor Gayle has developed have been in close collaboration with Professor Paul Lambert, Stirling University. However, Professor Gayle is responsible for any errors in this file. Citing this .do file Gayle, V. (2016). The Workflow and Stata, University of Edinburgh. © Vernon Gayle, University of Edinburgh. Professor Vernon Gayle (vernon.gayle@ed.ac.uk) ******************************************************************************** * End of file *