Judging text entry compositions via Amazon Mechanical Turk ---------------------------------------------------------- http://keithv.com/software/judging This directory contains scripts and HTML code that assist in crowdsourced judging of a set of sentences. This is intended for use in text entry evaluations in which a freeform short message composition task is used. We have built this method to work with Amazon Mechanical Turk on a standard HTML + JavaScript page. The human intelligence task (HIT) is designed to work with the standard facilities provided by Amazon without the need for an external web server. You will need a computer with Perl. The scripts require the Text::CSV_XS for parsing comma separated files. Typically you can install this package on a Linux system with: % sudo perl -MCPAN -e 'install Text::CSV_XS' Process overview ---------------- Workers are shown a set of single line text compositions, one composition at-a-time. The worker then rates each text on a 3 point scale: 2 - Completely correct, no errors in spelling, grammar, punctuation or spacing. 1 - Needs correction 0 - Makes no sense and/or impossible to correct If a sentence is scored 1, the worker is asked to make their best effort to correct the text. Multiple workers can be asked to judge and correct each text. A certain number of texts can be injected that have known corrections. After the workers finish their judging, the accuracy of each worker is assessed based on the injected known corrections. Those workers above an accuracy threshold are then used to generate a judged character error rate (CER) for each composition. The judged CER is the mean of all the judges in the pool. If a judge gave an entry a 0, the CER is assumed to be 100%. If a judge gave an entry a 2, the CER is assumed to be 0%. Detailed steps -------------- 1) Produce a tab-delimited text file (e.g. exp_data.txt) containing 2 columns: column 1 - unique ID of the composition to be judged column 2 - text of the composition to be judged 2) The tab-delimited text file containing the entries to be judged needs to be converted into a into a comma-delimited file containing sets of entries to be judged in a single HIT. Optionally a second file can be used to provide erroneous sentences with known corrections. This allows you to assess the quality of a particular worker. We have provided the file known_corrections.txt containing the known corrections used in our study. For example, to create a CSV that creates 30 sentence HITs, 10 of which will be from a file of known corrections, you would do the following: % perl JudgeInput.pl exp_data.txt 30 known_corrections.txt 10 > set_30_10.csv 3) Create a new HIT on Amazon Mechanical Turk, see: https://requester.mturk.com/mturk/resources Paste in the HTML code from judge_comp.html. When publishing the HIT, you use the output from JudgeInput.pl to generate a HIT for each set of texts. Each 4) Download the resulting CSV file after the workers have finished. Create a tab-delimited output file that has one line per judgment made by each worker: % perl JudgeResults.pl BatchFromAmazon.csv exp_data.txt known_corrections.txt > results.txt NOTE: JudgeResults.pl drops any work that is marked as rejected. It keeps everything else. So you may want to review the output of this step to decide if you want to reject any submissions and allow those HITs to be fielded again. 5) Compute aggregate statistics for each original composition, also eliminate judgments from workers that scored too low on the known corrections. For example, to use only workers who (over all HITs they did) got >= 70% accuracy on the known corrections in their sets: % perl JudgeStats.pl results.txt 70 known_corrections.txt > stats.txt The stats output file has the following columns: id - unique ID of the original text that was being judged num - how many workers above the accuracy cutoff were used to calculate the stats avg_judge - average judged score, 2 = completely correct, 1 = correctable, 0 = uncorrectable sd_judge - standard deviation of the judged score avg_cer - average character error rate (CER) between the judges and the original text sd_cer - standard deviation of the CER median_cer - median of the CER avg_cer_lower - average CER, ignoring case sd_cer_lower - standard deviation of CER ignoring case median_cer_lower - median of CER ignoring case avg_cer_lower_nopunc - average CER, ignoring case, stripping punctuation (except apostrophe) sd_cer_lower_nopunc - standard deviation of CER, ignoring case, stripping punctuation (except apostrophe) median_cer_lower_nopunc - median of CER, ignoring case, stripping punctuation (except apostrophe) orig - the original text the worker was given to correct fixed - fixed versions by the judges (one column for each judge) We recommend using the median as the overall estimate for the error rate of the compositions. This tends to produce an error rate consistent with the majority of the judges as it eliminates outlier judges, e.g. judges who aggressively reword or add punctuation when not strictly necessary. Have fun! Keith Vertanen Per Ola Kristensson Revision history ---------------- June 20, 2013 First release of judging resource.