PDA

View Full Version : Statistical / Econometric software discussion



doomy
01-30-2015, 11:39 PM
Hey guys. :)

I couldn't find a similar thread using the search engine, so I thought I'd start one. The question is - Which statistical / econometric package or programming language are you mostly using on a daily basis?

I have some experience with Microfit and EViews but am not very proficient in any way with both. So before really dedicating myself to "mastering" a single programme, I thought I'd ask for your opinions, hoping to create a discussion.

What is best for time series analysis? Cross-sectional data?, Panel data? R? MATLAB? Stata? Python (SciPy etc.)?

Which syntax do you prefer - C, C++, Python? Which software do you feel like wins in terms of visual representation of data?

Assuming that uhm... *cough* *cough* there were no budget constrains, which one do you consider to be superior or have that extra utility you need? Additionally, from your experience, under job skills, proficiency with which software is most looked for? From what I have seen in job descriptions (mostly in the financial sector) EViews and R are somewhat frequently mentioned.

Obviously, knowing your econometric theory is the most important when conducting any empirical research, but it's 2015 and learning a new interface and syntax can be annoying and time consuming. However, sticking to something outdated can also prove to be a bad decision in the long run. What's your opinion on these thoughts?

libre147
01-31-2015, 12:21 AM
I use Stata and SAS. These programs handle time series and panel data well. I find the syntax of these program pretty straight forward and have no problem switching between them, provided that you have a clean and ready data set. I don't think syntax change are problematic, as they do not occur very frequently. Even if they do, you still can figure out very quickly.

In term of data manipulation, I prefer a combination of Stata, SAS, and Excel. Sometimes it is more time efficient switching among different platforms (that you know how to use them well) instead of searching for specific coding to clean your data. Certain tasks can be done very quickly in Excel, some cannot not. It is up to you to decide what to use and when to use it. Again, I am speaking from my experience only.

fakeo
01-31-2015, 08:52 AM
In term of data manipulation, I prefer a combination of Stata, SAS, and Excel. Sometimes it is more time efficient switching among different platforms (that you know how to use them well) instead of searching for specific coding to clean your data. Certain tasks can be done very quickly in Excel, some cannot not. It is up to you to decide what to use and when to use it. Again, I am speaking from my experience only.

We've covered this before, and it's a terrible idea. Excel data cleaning isn't reproducible. Don't do it. For each project, the best is to stick to one programming language and do everything there. This includes downloading the data files (if applicable), cleaning them, plotting, and then doing your analyses. I use R, which is perfect for all this. Visualization is amazing with ggplot2.

I don't know how easy it is to do all this in Stata and other popular software. But it is a fact that in economics Stata is the most popular statistical software. In many fields (e.g. statistics, biostatistics, finance), R is much more popular. R is also great because the vast majority of online courses use it. So if you're serious about mastering a software and want to do it through self-study, R might be a convenient choice.

doomy
01-31-2015, 11:56 AM
I usually tend to process and transform (aggregations, logs, deflators etc.) my data in Excel too. Loading a ready to use .xls in any statistical software seems quite straight forward and I am quire used to Excel.

Never really thought about what fakeo said regarding reproduction. Might be a serious time wasting mistake I make.

libre147
01-31-2015, 03:12 PM
We've covered this before, and it's a terrible idea. Excel data cleaning isn't reproducible.

I don't agree with this generalization. Unless you clean your data simply by deleting columns or observations, then the excel steps may not be reproducible and it is not very time effective to use Excel. Certain tasks like merging, collapsing, expanding, calculation based on preset data (e.g.: distance between zip codes) may not be suitable to be carried in Excel. But some other tasks such as conditioning line-by-line and weighted word-picking can be done pretty quickly and visually in Excel. Although being a SAS and Stata user, I do find Excel efficient enough in some cases to clean you data and explain what you have done in a clear and communicative manner.

The point I am making here is that if you can afford to be flexible, you can integrate many platforms to do your task. It is up to your preference, time, how complicated your data assignments are, how the data assignment being reported, etc. to decide which method/program is the most efficient.

mcsokrates
02-01-2015, 02:18 AM
I'm usually a big proponent of using R, and I'd say I do the bulk of my work in R these days. But, it should be said that when it comes to modelling, sometimes Stata is the path of least resistance. Example from my own work: fitting IV models with fixed effects and clustered standard errors is actually better done with the lfe package in R than in Stata. But let's say your advisor wants to see a first difference model for robustness. Well, you can do that using the plm package, but not with clustered standard errors. So you can try to add that functionality to the package, which could take a day or two of programming time, or you can write.dta(), run a couple lines of Stata code (xtivreg2 y (x=z), fd clu(cluvar)) and you're done. This can even be done reproducibly within R using a call to system()

Lesson: your time is valuable. Use the best tools for the job at hand.

arrm
02-02-2015, 10:18 PM
For each project, the best is to stick to one programming language and do everything there. This includes downloading the data files (if applicable), cleaning them, plotting, and then doing your analyses. I use R, which is perfect for all this. Visualization is amazing with ggplot2.

I strongly disagree. Different languages have different strengths, and it's a shame to limit yourself to one of them. I personally use Python for web scraping and coding custom procedures, Stata for most data cleaning and more basic analyses, and R for plotting and package availability. It works great, and I get to leverage the best parts of everything while avoiding as many of the weaknesses as possible. Just stick to compatible formats (keep data in .csv files) and setup batch/script files to automate as much as possible. It's also essential to have a great text editor that can edit and run everything so you don't have to switch setups for each language. Sublime Text (http://www.sublimetext.com/3) is fast, powerful, flexible, and extensible, and will easily handle every text editing need you ever have with the right setup.

Of course, there are significant fixed costs to learning multiple languages and getting a setup that lets you smoothly manage your workflow, so it's not for everyone. And if you're collaborating you probably want to stick to languages that your collaborators are familiar with. But knowing multiple languages and being able to pick the right one for each task can be very valuable, and I think grad school/pre-grad school is the best time to pay those fixed costs if you want to. It helps that learning new languages is fairly easy once you know a couple; once you get the basic syntax and idioms you can mostly just rely on your search and reference skills.

Whatever you do, just don't use Excel. Reproducibility is an issue since you can't track your steps, you can't use source control for collaboration and version management, .xls and .xlsx are horrid data formats that only cause difficulties with other programs, it doesn't work well in server environments, you can't automate tasks with batch files and scripts, and it's completely incompatible with larger data sets so you'll probably be forced off it at some point either way. I only use it to glance at data occasionally, but you should switch to something else for anything more complicated or important than that.

PhDPlease
02-03-2015, 12:36 AM
I have used STATA for my research assistant work prior to grad school and am continuing to use it. I mainly do regression analysis with a lot of data cleaning beforehand such as merging/appending, formatting, cleaning the data. I have found STATA sufficient for my needs. I think learning C, C++ would be overkill for my interests. I do think that STATA is sufficient for the needs of many people and is not too difficult to learn and has a user-friendly interface, so my sense is that STATA will remain popular and that more advanced languages would be utilized by those with particular advanced needs but not come to dominate the field as a whole.

fakeo
02-03-2015, 10:55 AM
I strongly disagree. Different languages have different strengths, and it's a shame to limit yourself to one of them.

Yes, of course you can leverage the strength of each language. But in reality everything you mentioned (web scraping, data cleaning) is trivial to do in R for instance. Also in Python. So yeah, you can switch between platforms and all of that, but unless one language is
- either much more efficient in carrying out a certain task (and you have a task at hand that requires a lot of computational power and efficiency is key),
- or much more simple in terms of its syntax/implementation of a certain task you want to perform,

I see no reason for using more than one language for a given task. But yeah I agree, there are situations in which you might want to use multiple languages.

Econhead
02-03-2015, 01:21 PM
Disclaimer: I don't do econometrics, nor am I interested in it. (By that I mean that I am not specializing in it, nor do I intend to, nor have I read through any recent (in the last 10 years) research in econometrics. (If this is surprising, I'm not in a Ph.D program yet, and the grad work that I have done so far hasn't required a need for this. Spare time has been focused on my research interests instead.)

One question:

In econometric literature, do people ever say "Well we started out suing Python, did XYZ, then switched to another program to run ABC, and finally ended up running DEF in STATA and SAS to compare the results."? This seems like reproducibility would be more difficult when not using a single program, even if each individual step could be traced through several programs.

I agree that it is more efficient to use the right program for the right technique/project goal. But simply in terms of producing research for a journal, is this practical? It seems like using multiple programs for a single project would be more of a 'how do things check out/follow my intuition' rather than being practical for publication.

Please advise if anyone can offer insight.

fakeo
02-03-2015, 02:50 PM
To be honest, I don't think this multi-software thing is advisable in econometric work. But I'm just guessing here. Most econometric work I see is trivial from a computational point of view. In most cases, no matter what software you use, the analysis can be carried out quickly. I think this would be more of a concern if someone did more computationally intensive work. So I'm still of the opinion that within economics staying with one single software/language is the right thing to do for most purposes.

rawls234
02-03-2015, 03:00 PM
I think that the reason why people write things in multiple languages could be because A) different research assistants or authors breaking up the work and choosing language independently or B) simply to optimize time without a focus on reproducibility. Honestly, there should be no reproducibility issues as you can just post your code on Github (if there isn't there should be a requirement to post any code used in a paper on Github but that's purely my opinion).

Also, if you really think that web scraping is faster in R than Python, I would like some further justification behind that. Using BeautifulSoup web scraping in Python is absolutely painless and I've found it to be rather annoying to do it in R. Web scraping in general also can be really easy or really difficult, but if you're just scraping a single page you can usually hardcode some stuff to get the desired results though I wouldn't call the endeavor as a whole trivial :). Python is way faster than R, so if there's no reason to use R over Python I use Python every time.

C++ is a devil of a language, but it's super fast if you know what you're doing with it. The classic Python ("Cpython") compiles into it and I wouldn't be surprised if R does and if you look at performance tests online (or run your own) you'll see how much faster it is but I see no reason to ever use it for stuff like this. I actually work for a company that does a "data science as a service" and all our main statistical engine is written in C++ and you basically need to re-invent the wheel every time you want to do anything. For instance, if you want to use simplex or ridge regression in C++, you basically have to do it yourself and, as with most software development, it's probably not too hard to get it working for 80% of the cases but there's a lot of nasty implementation details that will cause the additional 20% of cases to not work. Unless it's absolutely killer, I'm always willing to take a performance hit to use more widely used and reliable implementation in another language, especially if the code you're writing can be parallelized and executed on a compute cluster (which some R packages have made super easy to do and Python has list comprehensions, etc.)

rawls234
02-03-2015, 03:10 PM
Here's a relevant paper I found. Seems interesting. http://economics.sas.upenn.edu/~jesusfv/comparison_languages.pdf

fakeo
02-03-2015, 03:11 PM
Also, if you really think that web scraping is faster in R than Python, I would like some further justification behind that.

For the record, I never said R was faster at web scraping. I just said it can do it. If you're doing data science, then that's another thing. But at least the kind of econometric papers I generally read really don't contain anything that even comes close to being computationally intensive. So jumping between languages seems like an unnecessary complication (from a reproducibility point of view).

I wholeheartedly agree that all code should be posted on Github. Unfortunately, in economics this is rather the exception than the rule.

arrm
02-03-2015, 03:20 PM
Yes, of course you can leverage the strength of each language. But in reality everything you mentioned (web scraping, data cleaning) is trivial to do in R for instance. Also in Python. So yeah, you can switch between platforms and all of that, but unless one language is
- either much more efficient in carrying out a certain task (and you have a task at hand that requires a lot of computational power and efficiency is key),
- or much more simple in terms of its syntax/implementation of a certain task you want to perform,

I see no reason for using more than one language for a given task. But yeah I agree, there are situations in which you might want to use multiple languages.

Personally, I have a strong preference against doing extensive coding in R. I dislike its syntax and idioms and it's a very slow language, but the biggest issue is that debugging in R is an extremely uninformative experience. Half of the procedures never error out and simply force their way through, filling your data with missing values and other nonsense, and the error messages you do get don't tell you much about the context of the error. If you ever get used to Python's detailed tracebacks and error messages it's very hard to go back, especially when you're coding non-trivial procedures that require debugging through multiple layers. As far as I know, R's internet packages are also underdeveloped, for instance, I've never seen an R package that can automatically handle cookies.

Really, I'm just not a fan of coding in R. When I use it I'm much less efficient, have many more bugs that are more difficult to find, and I have a lot less fun. but the amount of packages that are written in it and the beauty of ggplot2 forces me to use it occasionally. And I do really like Stata for a lot of basic data cleaning steps. It has nice and easy syntax to handle almost all of the standard operations that end up being a lot more verbose and complicated to do in other languages.


In econometric literature, do people ever say "Well we started out suing Python, did XYZ, then switched to another program to run ABC, and finally ended up running DEF in STATA and SAS to compare the results."? This seems like reproducibility would be more difficult when not using a single program, even if each individual step could be traced through several programs.

I agree that it is more efficient to use the right program for the right technique/project goal. But simply in terms of producing research for a journal, is this practical? It seems like using multiple programs for a single project would be more of a 'how do things check out/follow my intuition' rather than being practical for publication.

I am not familiar with econometric literature. But as far as I know, people rarely talk about implementation in detail, their discussion is usually more focused on the statistics/algorithm. If they do talk about it, they only talk about the implementation of the procedure itself, it's impossible to know what they used to clean the data or handle other little steps. Source code is rarely released and almost never paid attention to.

I don't see how reproducibility would be an issue. I'm not suggesting you excessively intermix languages, like running half a procedure in on language and half in another. Having your data processing/cleaning in one language, your estimation procedure in another, and your figure creation in a third is very easy to manage.


To be honest, I don't think this multi-software thing is advisable in econometric work. But I'm just guessing here. Most econometric work I see is trivial from a computational point of view. In most cases, no matter what software you use, the analysis can be carried out quickly. I think this would be more of a concern if someone did more computationally intensive work. So I'm still of the opinion that within economics staying with one single software/language is the right thing to do for most purposes.

For one, I think that's a bit of a chicken and egg problem. Do people stick to a slow inefficient language like R because their work is computationally trivial, or do people shy away from computationally intensive methods because they only know slow languages? In a time when a lot of cutting-edge statistics and machine learning work is being run on GPU farms and large clusters, most econometrics work is still being run in R or Matlab on a single processor.

Also, I was mostly talking about the average researcher. In my experience the majority of economics researchers are fairly programming illiterate, which is part of why Stata is so popular. They generally aren't even using a "real" programming language like R or Python. But that limits you to mostly pre-programmed procedures, if you want to run anything custom it's going to be very difficult in Stata. I think that's part of why most empirical researchers stick to very basic methods.

Computation is not the only reason you might want to switch languages. Python's sensical packaging, nice syntax and idioms, easy self-documentation, reasonable error patterns, and detailed tracebacks mean that I can often get two or three times as much work done in the same amount of time compared to R. And figuring out how to do any non-trivial web scraping in R is very painful, while Python has a very well-developed and easy-to-use internet infrastructure. But if you want to run a cutting-edge statistical procedure, you'll probably have to code it from scratch in Python, while there's almost certainly an R package that does most of your work for you. Knowing multiple languages will make you a more productive coder and will save you a lot of time if you can pay the up-front costs to learning them.

Econhead
02-03-2015, 03:43 PM
Also, I was mostly talking about the average researcher. In my experience the majority of economics researchers are fairly programming illiterate, which is part of why Stata is so popular. They generally aren't even using a "real" programming language like R or Python. But that limits you to mostly pre-programmed procedures, if you want to run anything custom it's going to be very difficult in Stata. I think that's part of why most empirical researchers stick to very basic methods.

Granted I have no experience (yet) at programs better than T70, but my experience after talking with professors and listening to presentations at conferences is that most researchers that aren't involved with econometrics don't know the material well enough to really do much beyond "relatively basic methods." I also keep getting told that "when I went through my Ph.D program we basically just learned Least Squares." This is from individuals that even received their degrees from Ivies, but 20-30 years ago. (Note, I'm not saying this IS true, just rather that it wouldn't surprise me given the discussions I have had, and the way material has been presented at conferences that I've been to. I fully accept that this could be wrong. )

arrm
02-03-2015, 03:44 PM
C++ is a devil of a language, but it's super fast if you know what you're doing with it. The classic Python ("Cpython") compiles into it and I wouldn't be surprised if R does and if you look at performance tests online (or run your own) you'll see how much faster it is but I see no reason to ever use it for stuff like this. I actually work for a company that does a "data science as a service" and all our main statistical engine is written in C++ and you basically need to re-invent the wheel every time you want to do anything. For instance, if you want to use simplex or ridge regression in C++, you basically have to do it yourself and, as with most software development, it's probably not too hard to get it working for 80% of the cases but there's a lot of nasty implementation details that will cause the additional 20% of cases to not work. Unless it's absolutely killer, I'm always willing to take a performance hit to use more widely used and reliable implementation in another language, especially if the code you're writing can be parallelized and executed on a compute cluster (which some R packages have made super easy to do and Python has list comprehensions, etc.)

On a side note, Python doesn't compile to C++ at all. The CPython interpreter is written in C (not C++), but the language is interpreted and never compiles (well, technically it compiles to an intermediate bytecode, but it doesn't do much). Of course, Python can use libraries written in C as long as you provide the right APIs, tools like Cython exist to translate Python-like code to C, and packages like numpy and scipiy provide APIs to c-speed implementations of common operations.

mcsokrates
02-03-2015, 05:53 PM
I think that's part of why most empirical researchers stick to very basic methods.


I'm sympathetic to the idea that (especially) applied micro people should be better programmers (and use R/Python more often). But the simplicity of methods is, I think, primarily (completely?) a legitimate philosophical decision. Applied micro people are interested first and foremost on identifying causal effects, and the profession decided that the identifying assumptions for "basic methods" (i.e. the MHE toolkit) are more believable than for more complicated methods. I would argue that the programming ability is endogenous to this choice - researchers with poor programming skills are, at the margin, more likely to specialize in applied micro (versus, say, doing Bayesian estimation of DSGEs), knowing that the accepted toolkit is relatively basic.

HungryGriot
02-05-2015, 10:09 PM
This is helpful thread. Similar to OP I have strong excel and stata training but no true programming training. I plan to take the time to master one prior to enrolling this fall. When it comes to the actual coding/language, is there a base syntax that translates between programs? ie if I spend months learning Matlab will I have some proficiency in the language for R? I have some on the job training with SQL as well.

arrm
02-06-2015, 05:54 AM
This is helpful thread. Similar to OP I have strong excel and stata training but no true programming training. I plan to take the time to master one prior to enrolling this fall. When it comes to the actual coding/language, is there a base syntax that translates between programs? ie if I spend months learning Matlab will I have some proficiency in the language for R? I have some on the job training with SQL as well.
General coding skills will obviously transfer over, but language-specific syntax and idioms will not. Though, in my experience, it's not as much of an issue as you might think. Learning the basic syntax doesn't take too much time, and a surprisingly high amount of your knowledge will come in the form of the ability to search for help and read documentation. If you have a good idea of the basic tools and how to use them, learning the syntax to call them isn't too hard. Every new language you learn also makes learning the next language easier and faster.

For instance, most of my training and experience has been with Python and R. But I can pick up Matlab and code competently in it, even if I might miss some language idioms or specific functions that would make me or my code more efficient. It would take me significantly more time to pick up something like C again, but all of the languages that most people would consider in econ occupy the same general language space (high-level, dynamic, scripting languages), so most things are fairly similar.

HungryGriot
02-06-2015, 01:30 PM
^^ Many thanks!

behavingmyself
02-06-2015, 05:30 PM
Here's a relevant paper I found. Seems interesting. http://economics.sas.upenn.edu/~jesusfv/comparison_languages.pdf
These kinds of speed considerations are important for people (most commonly in macro) who are trying to computationally solve complicated theoretical models, or people (commonly in structural labor or IO) who have to estimate the parameters of a complex model from data. Picking the wrong language can waste enormous amounts of time waiting for the code to run.

People who are doing simple statistical analyses (like most labor economists) generally don't need to worry about their code taking a week to run, so they prioritize attributes like ease of use and compatibility. Typically you will see these types of people using R or especially Stata.

To get a sense of what people use, one can just look at published papers. Many journals now require authors to post their code and (if possible) data. So just go to the AER site and click the data files for whatever papers you find interesting.

ColonelForbin
02-06-2015, 11:08 PM
These kinds of speed considerations are important for people (most commonly in macro) who are trying to computationally solve complicated theoretical models, or people (commonly in structural labor or IO) who have to estimate the parameters of a complex model from data. Picking the wrong language can waste enormous amounts of time waiting for the code to run.

People who are doing simple statistical analyses (like most labor economists) generally don't need to worry about their code taking a week to run, so they prioritize attributes like ease of use and compatibility. Typically you will see these types of people using R or especially Stata.

To get a sense of what people use, one can just look at published papers. Many journals now require authors to post their code and (if possible) data. So just go to the AER site and click the data files for whatever papers you find interesting.


This is a great comment. The thing about economists is that we aren't really in the business of making code efficient. It is helpful, but it isn't necessary like you could imagine for someone programming a huge relational database for Amazon.

CPU time is free (essentially). If you're going to solve some huge model you can let it run while you do other research for two weeks. It's not ideal, but it isn't ideal spending 2 months learning how to program it in a faster language, unless of course you're going to do this sort of thing time and time again. It's nice to have instant results, but I don't think that justifies learning a brand new programming language. Now -- this is really just an argument about fixed costs and marginal costs. I think it makes sense to become a guru in something or a few somethings. You'll be able to conceptualize your ideas better in that language. Spending less time learning code and more time being an economist.

There isn't much you cannot do in R, Stata & MATLAB (maybe Mathematica too) in the realm of the econ. Of course, there will be idiosyncratic cases. But academic economists get paid quite well -- if you need something really crazy, just hire a freelancer. Maybe even a computer science student who might be interested in economics.

scholar
02-07-2015, 01:24 AM
First of all, learning basic syntax and learning how to use a computational environment efficiently to get the work done are not the same things. The latter can take many years. Moreover, it pays off a lot to learn one tool deeply rather than many tools superficially.


Personally, I have a strong preference against doing extensive coding in R. I dislike its syntax and idioms and it's a very slow language, but the biggest issue is that debugging in R is an extremely uninformative experience.

This basically says that you have very poor R skills, if any at all. Learn how to use the language properly and use efficient algorithms. R cannot be slower in doing any standard regression modelling, because it uses BLAS for doing the underlying computations. So even if you reprogram your estimations in Fortran, there will be no real computational speed advantage. A very common mistake that beginners make is using loops rather than using linear algebra to implement the same calculation.

arrm
02-07-2015, 04:01 AM
This basically says that you have very poor R skills, if any at all. Learn how to use the language properly and use efficient algorithms. R cannot be slower in doing any standard regression modelling, because it uses BLAS for doing the underlying computations. So even if you reprogram your estimations in Fortran, there will be no real computational speed advantage. A very common mistake that beginners make is using loops rather than using linear algebra to implement the same calculation.

I actually do understand all of that, but unfortunately not everything is standard regression modeling or can be done with relatively simple linear algebra manipulations. That stuff outside of BLAS often dominates your computation time, it all depends on the nature of what you're doing.

I also find programming in Python much more fun and efficient, though this is largely a personal choice. Especially for more complicated work. I just can't stress enough how much more productive Python's error messages and stack traces make debugging. In comparison, half the time R decides to not throw an error and simply force the operation through with nonsense results, and the other half the error barely gives you any idea what's actually going wrong or the context for where and why.

Though I should note that I am actually not going into Economics academia. I've ended up as more of a Statistics/Machine Learning person even though the last several years of my life were spent prepping for Econ graduate school. So my use-cases are not necessarily standard Econ use-cases.