Judging text entry compositions via Amazon Mechanical Turk
----------------------------------------------------------
http://keithv.com/software/judging

This directory contains scripts and HTML code that assist in crowdsourced judging of a set of 
sentences.  This is intended for use in text entry evaluations in which a freeform short message 
composition task is used.

We have built this method to work with Amazon Mechanical Turk on a standard HTML + JavaScript page. 
The human intelligence task (HIT) is designed to work with the standard facilities provided by 
Amazon without the need for an external web server.

You will need a computer with Perl.  The scripts require the Text::CSV_XS for parsing comma 
separated files.  Typically you can install this package on a Linux system with:
  % sudo perl -MCPAN -e 'install Text::CSV_XS'

Process overview
----------------
Workers are shown a set of single line text compositions, one composition at-a-time.  The 
worker then rates each text on a 3 point scale:
  2 - Completely correct, no errors in spelling, grammar, punctuation or spacing.
  1 - Needs correction
  0 - Makes no sense and/or impossible to correct

If a sentence is scored 1, the worker is asked to make their best effort to correct the 
text.  Multiple workers can be asked to judge and correct each text.  A certain number of
texts can be injected that have known corrections.

After the workers finish their judging, the accuracy of each worker is assessed based on the 
injected known corrections.  Those workers above an accuracy threshold are then used to generate
a judged character error rate (CER) for each composition.  The judged CER is the mean of all
the judges in the pool.  If a judge gave an entry a 0, the CER is assumed to be 100%.  If a 
judge gave an entry a 2, the CER is assumed to be 0%. 

Detailed steps
--------------
1) Produce a tab-delimited text file (e.g. exp_data.txt) containing 2 columns:
   column 1 - unique ID of the composition to be judged
   column 2 - text of the composition to be judged

2) The tab-delimited text file containing the entries to be judged needs to be converted into
   a into a comma-delimited file containing sets of entries to be judged in a single HIT. 
   
   Optionally a second file can be used to provide erroneous sentences with known corrections.  
   This allows you to assess the quality of a particular worker.  We have provided the file
   known_corrections.txt containing the known corrections used in our study.
   
   For example, to create a CSV that creates 30 sentence HITs, 10 of which will be from
   a file of known corrections, you would do the following:
   
   % perl JudgeInput.pl exp_data.txt 30 known_corrections.txt 10 > set_30_10.csv

3) Create a new HIT on Amazon Mechanical Turk, see:
   https://requester.mturk.com/mturk/resources

   Paste in the HTML code from judge_comp.html.  When publishing the HIT, you use the output 
   from JudgeInput.pl to generate a HIT for each set of texts.  Each

4) Download the resulting CSV file after the workers have finished.  
   Create a tab-delimited output file that has one line per judgment made by each worker:

   % perl JudgeResults.pl BatchFromAmazon.csv exp_data.txt known_corrections.txt > results.txt

   NOTE: JudgeResults.pl drops any work that is marked as rejected.  It keeps everything else.
   So you may want to review the output of this step to decide if you want to reject
   any submissions and allow those HITs to be fielded again.
   
5) Compute aggregate statistics for each original composition, also eliminate judgments
   from workers that scored too low on the known corrections.

   For example, to use only workers who (over all HITs they did) got >= 70% accuracy on
   the known corrections in their sets:

   % perl JudgeStats.pl results.txt 70 known_corrections.txt > stats.txt

   The stats output file has the following columns:
     id	- unique ID of the original text that was being judged
     num                     - how many workers above the accuracy cutoff were used to calculate the stats
     avg_judge               - average judged score, 2 = completely correct, 1 = correctable, 0 = uncorrectable
     sd_judge                - standard deviation of the judged score
     avg_cer                 - average character error rate (CER) between the judges and the original text
     sd_cer                  - standard deviation of the CER
     median_cer              - median of the CER
     avg_cer_lower           - average CER, ignoring case
     sd_cer_lower            - standard deviation of CER ignoring case
     median_cer_lower        - median of CER ignoring case
     avg_cer_lower_nopunc    - average CER, ignoring case, stripping punctuation (except apostrophe)
     sd_cer_lower_nopunc     - standard deviation of CER, ignoring case, stripping punctuation (except apostrophe)
     median_cer_lower_nopunc - median of CER, ignoring case, stripping punctuation (except apostrophe)
     orig                    - the original text the worker was given to correct
     fixed                   - fixed versions by the judges (one column for each judge)

   We recommend using the median as the overall estimate for the error rate of the compositions.
   This tends to produce an error rate consistent with the majority of the judges as it eliminates 
   outlier judges, e.g. judges who aggressively reword or add punctuation when not strictly necessary. 

Have fun!

Keith Vertanen
Per Ola Kristensson

Revision history
----------------
June 20, 2013    First release of judging resource.