Stata-Tex: Create custom
LaTeX tables from Stata.
It's often necessary
to produce an output table that doesn't fit any of the
format provided by the standard tools outreg, esttab,
etc.. Some examples:
- Showing p-values for differences of coefficients
- Putting certain coefficients in bold or colors
- Putting different outcome variables in the same row of a regression table
- Multi-panel tables with different formatting in each panel
Stata-tex allows you to separate the LaTeX table template from the table data. This lets you set up and compile exactly the LaTeX table you want, with placeholders for the data. Then you can generate the data separately, and transfer it into the LaTeX table automatically.
Masala Merge: Fuzzy matching of Hindi (or any) names.
This is
Python and Stata code for fuzzy merging Hindi names. The
algorithm is based on the Levenshtein edit distance
algorithm, which calculates the number of edits,
deletions and insertions required to get from one word
to another. We modified this to lower the cost of
certain substitutions that are common to Hindi,
e.g. KS->X would have a cost of 2 in Levenshtein - we
assign it a cost of 0.2. Modifying this code for another
language consists only of changing this list of costs.
In addition to calculating edit distances, the program usees a default set of "smart" thresholds to determine which fuzzy matches to accept and which ones to reject. The premise is that you can tolerate a higher edit distance when matching very long words. The smart thresholds also reject matches if the next nearest match is very close - even an exact match should be considered uncertain if there's another very close match.
We calibrated costs, common substitutions and smart
distance thresholds by analyzing results from
a set of 500,000 known village name matches from the 1991 and
2001 population censuses. Higher or lower thresholds
may be desirable depending on what you are trying to
match. We use very conservative thresholds; we put a
much higher cost on incorrect matches than on missed
matches. You can raise or lower all thresholds
proportionately with the fuzziness()
parameter. The
default is 1.
There might be a better fuzzy matching program out there
- if so, please let me know about it! On location name
matches, masala-merge consistently outperforms Stata's
reclink
. But reclink
's string similarity algorithm is
going to do better, for example, if you want to match
"Dell Inc." to "Dell Incorporated".
If you want to optimize this for another language, please let me know about it! Just a few lines need to be changed (but you need to figure out the common substitutions), and I'd be happy to post the new language function here.