Paul Novosad

Stata-Tex: Create custom LaTeX tables from Stata.
It's often necessary to produce an output table that doesn't fit any of the format provided by the standard tools outreg, esttab, etc.. Some examples:

Showing p-values for differences of coefficients
Putting certain coefficients in bold or colors
Putting different outcome variables in the same row of a regression table
Multi-panel tables with different formatting in each panel

Stata-tex allows you to separate the LaTeX table template from the table data. This lets you set up and compile exactly the LaTeX table you want, with placeholders for the data. Then you can generate the data separately, and transfer it into the LaTeX table automatically.

stata-tex on Github

Masala Merge: Fuzzy matching of Hindi (or any) names.
This is Python and Stata code for fuzzy merging Hindi names. The algorithm is based on the Levenshtein edit distance algorithm, which calculates the number of edits, deletions and insertions required to get from one word to another. We modified this to lower the cost of certain substitutions that are common to Hindi, e.g. KS->X would have a cost of 2 in Levenshtein - we assign it a cost of 0.2. Modifying this code for another language consists only of changing this list of costs.

In addition to calculating edit distances, the program usees a default set of "smart" thresholds to determine which fuzzy matches to accept and which ones to reject. The premise is that you can tolerate a higher edit distance when matching very long words. The smart thresholds also reject matches if the next nearest match is very close - even an exact match should be considered uncertain if there's another very close match.

We calibrated costs, common substitutions and smart distance thresholds by analyzing results from a set of 500,000 known village name matches from the 1991 and 2001 population censuses. Higher or lower thresholds may be desirable depending on what you are trying to match. We use very conservative thresholds; we put a much higher cost on incorrect matches than on missed matches. You can raise or lower all thresholds proportionately with the fuzziness() parameter. The default is 1.

There might be a better fuzzy matching program out there - if so, please let me know about it! On location name matches, masala-merge consistently outperforms Stata's reclink. But reclink's string similarity algorithm is going to do better, for example, if you want to match "Dell Inc." to "Dell Incorporated".

If you want to optimize this for another language, please let me know about it! Just a few lines need to be changed (but you need to figure out the common substitutions), and I'd be happy to post the new language function here.

masala-merge on github

Paul Novosad

Associate Prof. of Economics
6106 Rockefeller Hall, Room 301
Hanover, NH 03755
Email: paul.novosad AT dartmouth.edu