I posted my first task on Amazon Mechanical Turk today, paying many unknown workers the world around $4 an hour to classify tweets. Specifically, asking whether tweets containing the word "job" are:
- related to the user's job
- a job offer
- another use of the word, e.g. "#AmericanIdol I love Mariah, Nicki, Keith and Randy r doing a fabulous job"
The process wasn't all easy. Some lessons:
- Master workers agreed on 70 of the 88 items. Some of the edge cases were the result of poor instructions, although thankfully very little money was wasted. A test run was a good idea.
- $0.02 per classification task gets quick results. I will try $0.01 and see what difference it makes.
- Amazon gets $0.009 / HIT (Human Intelligence Task).
- Any sentiment analysis of job-tweets is mainly affected by non-employment tweets, making it a noisy indicator of job satisfaction.
Next up: getting more data for the classifier, then sentiment analysis for those tweets that are really job-related.
Tech note: about 12% of tweets had smileys or other characters that were deemed UTF-8 by Ruby, but choked Amazon. These HEX values on a line seemed to trip them up: 62 72 69 6E 67 69 6E 67 20 68 6F and 65 72 20 68. I'm not sure what to do about these. If you have ideas, please let me know.