SOCcer (Standardized Occupation Coding for Computer-assisted Epidemiological Research) is a publicly available application that was developed to assist epidemiological researchers incorporate occupational risk into their studies. The application is not intended to replace expert coders, but rather prioritizes job descriptions that would most benefit from expert coders. Low scoring job descriptions are more likely to require expert review than high scoring job description. The coding is performed using an ensemble classifier, which combines the results of multiple classifiers to produce a single classifier that performs better than any single classifier in the ensemble.
If you publish results that use SOCcer, please reference:
Russ DE, Ho K-Y, Colt JS, Armenti KR, Baris D, Wong-Ho C, Davis F, Johnson A, Purdue MP, Karagas MR, Schwarz K, Schwenn M, Silverman DT, Stewart PA, Johnson, CA, Friesen MC,
Computer-based coding of free-text job descriptions to efficiently identify occupations in epidemiological studies. Occup Environ Med 2016;73(6):417-24.
SOCcer can use different models, depending on the type of data that you have. The table below describes the models available
model | description |
---|---|
model 1.0 | This model codes job descriptions to the SOC 2010 classification system as described in Computer-Based Coding of Free-Text Job Descriptions to Efficiently and Reliably Incorporate Occupational Risk Factors into Large-Scale Epidemiological Studies. This model uses the variables JobTitle, SIC, and JobTask. This model can be used even if parts of the Job Description are not available (e.g. SIC or JobTask is missing). The classifiers will assign a '0' for the missing information in the calculation of the overall SOCcer score. |
SOCAssign is an application to assist expert review of the top 10 SOCcer assignments for each job description to provide an expert SOC-2010 assignment. SOCAssign will read SOCcer ouput. Before importing, the SOCcer results can be preprocessed to focus the expert review on a subset of job descriptions, such as job descriptions with SOC codes that are tied for the highest score or that had low SOCcer scores. For each job description, SOCAssign will allow the selection of up to 3 SOC-2010 codes. The code scan be selected from the SOCcer output list, from a list of all SOC-2010 codes, or manually entered. A validation check ensures that only valid SOC-2010 can be entered
Currently, the input for SOCcer is a comma-separated file with three columns: job title, SIC, and job tasks. SOCcer strictly enforces the format of the input file. The input file must contain the header line (the case must match also):
JobTitle,SIC,JobTask
After the header line this is a separate row for each job description. There MUST be three comma-separated values on each line. If the job title or task contains a comma, the value must be in quotes or else there will be an error. An example of a valid job description is:
"Teacher, high school", 8211, "formulate lesson plans, teach 11th grade match"Leaving out the quotation marks would cause the line to appear to have five values and will return an error. SOCcer will list all line numbers with errors and require you to fix the input before proceeding. A value may be blank (missing information), but must be included. Valid examples with missing information are:
"Teacher, high school", , "formulate lesson plans, teach 11th grade match"
,8211, "formulate lesson plans, teach 11th grade match"
"Teacher, high school",,
Warning: Microsoft Word and other word processesors often use smart quotes (“ ”) instead of standard quotation marks (",ASCI character 34). This will cause problems with the word count and require fixing before SOCcer can proceed.
The SOCcer results are provided in a comma-separated file that contains the row number, job title, SIC, job tasks, and the top ten highest ranked SOC codes, with corresponding SOCcer scores.
Id | JobTitle | SIC | JobTask | SOC2010_1 | Score_1 | SOC2010_2 | Score_2 | ... | SOC2010_10 | Score_10 |
---|---|---|---|---|---|---|---|---|---|---|
1 | "Teacher, high school" | "8211" | "formulate lesson plans, teach 11th grade match" | 25-2031 | 0.979 | 25-2032 | 0.357 | ... | 25-1194 | 0.042 |
2 | "Java Developer" | "7371" | "Develop use cases, write computer software in java" | 15-1131 | 0.959 | 15-1132 | 0.717 | ... | 11-2031 | 0.034 |
NOTE: SOC3 through SOC9 were omitted and scores were rounded for display purposes. |
If you have any questions, comments, or concerns, send us an e-mail at NCISOCcerWebAdmin@mail.nih.gov