SOCcer (Standardized Occupation Coding for Computer-assisted Epidemiological Research) is a publicly available application that was developed to assist epidemiological researchers incorporate occupational risk into their studies. The application is not intended to replace expert coders, but rather prioritizes job descriptions that would most benefit from expert coders. Low scoring job descriptions are more likely to require expert review than high scoring job description. The coding is performed using an ensemble classifier, which combines the results of multiple classifiers to produce a single classifier that performs better than any single classifier in the ensemble.

If you publish results that use SOCcer, please reference:
Russ DE, Ho K-Y, Colt JS, Armenti KR, Baris D, Wong-Ho C, Davis F, Johnson A, Purdue MP, Karagas MR, Schwarz K, Schwenn M, Silverman DT, Stewart PA, Johnson, CA, Friesen MC, Computer-based coding of free-text job descriptions to efficiently identify occupations in epidemiological studies. Occup Environ Med 2016;73(6):417-24.

Run SOCcer

Example 1 Example 2
Please type in your email address:
No Results
Download Output * Directly opening the CSV file into Excel will render numbers as dates incorrectly

SOCcer can use different models, depending on the type of data that you have. The table below describes the models available

modeldescription
model 1.0 This model codes job descriptions to the SOC 2010 classification system as described in Computer-Based Coding of Free-Text Job Descriptions to Efficiently and Reliably Incorporate Occupational Risk Factors into Large-Scale Epidemiological Studies. This model uses the variables JobTitle, SIC, and JobTask. This model can be used even if parts of the Job Description are not available (e.g. SIC or JobTask is missing). The classifiers will assign a '0' for the missing information in the calculation of the overall SOCcer score.

SOCAssign is an application to assist expert review of the top 10 SOCcer assignments for each job description to provide an expert SOC-2010 assignment. SOCAssign will read SOCcer ouput. Before importing, the SOCcer results can be preprocessed to focus the expert review on a subset of job descriptions, such as job descriptions with SOC codes that are tied for the highest score or that had low SOCcer scores. For each job description, SOCAssign will allow the selection of up to 3 SOC-2010 codes. The code scan be selected from the SOCcer output list, from a list of all SOC-2010 codes, or manually entered. A validation check ensures that only valid SOC-2010 can be entered


SOCAssign   run as Web Start
Download   run as Java Application

To run it as a Java application, double click on the downloaded SOCAssign.jar file (make sure the java executable file is in the path)

Input Data Format

Currently, the input for SOCcer is a comma-separated file with three columns: job title, SIC, and job tasks. SOCcer strictly enforces the format of the input file. The input file must contain the header line (the case must match also):

JobTitle,SIC,JobTask

After the header line this is a separate row for each job description. There MUST be three comma-separated values on each line. If the job title or task contains a comma, the value must be in quotes or else there will be an error. An example of a valid job description is:

"Teacher, high school", 8211, "formulate lesson plans, teach 11th grade match"
Leaving out the quotation marks would cause the line to appear to have five values and will return an error. SOCcer will list all line numbers with errors and require you to fix the input before proceeding. A value may be blank (missing information), but must be included. Valid examples with missing information are:
"Teacher, high school", , "formulate lesson plans, teach 11th grade match"
,8211, "formulate lesson plans, teach 11th grade match"
"Teacher, high school",,

Warning: Microsoft Word and other word processesors often use smart quotes (“ ”) instead of standard quotation marks (",ASCI character 34). This will cause problems with the word count and require fixing before SOCcer can proceed.

The SOCcer results are provided in a comma-separated file that contains the row number, job title, SIC, job tasks, and the top ten highest ranked SOC codes, with corresponding SOCcer scores.

IdJobTitleSICJobTaskSOC2010_1Score_1SOC2010_2Score_2...SOC2010_10Score_10
1"Teacher, high school" "8211" "formulate lesson plans, teach 11th grade match"25-20310.97925-20320.357...25-11940.042
2"Java Developer""7371""Develop use cases, write computer software in java"15-11310.95915-11320.717...11-20310.034
NOTE: SOC3 through SOC9 were omitted and scores were rounded for display purposes.

If you have any questions, comments, or concerns, send us an e-mail at NCISOCcerWebAdmin@mail.nih.gov

References

  1. Russ DE, Ho K-Y, Colt JS, Armenti KR, Baris D, Wong-Ho C, Davis F, Johnson A, Purdue MP, Karagas MR, Schwarz K, Schwenn M, Silverman DT, Stewart PA, Johnson, CA, Friesen MC, "Computer-based coding of free-text job descriptions to efficiently and reliably incorporate occupational risk factors into large-scale epidemiologic studies", Occup Environ Med 2016;73(6):417-24.
  2. Russ DE, Ho K-Y, Johnson CA, Friesen MC, "Computer-Based Coding of Occupation Codes for Epidemiological Analyses", Proc IEEE Int Symp Comput Based Med Syst 2014 , 2014, pp. 347-350.
What can I do to improve the results of SOCcer?
To improve SOCcer’s performance, we are building a database of job descriptions linked to SOC-2010 (and other classification systems) that can be used to build and refine classifiers. If you have job descriptions that have been coded by an expert coder (or initially coded by SOCcer, then reviewed by an expert) and you are willing to provide them to us, we will be happy to include those job descriptions in our knowledge base for use in future versions of SOCcer. If the data are protected, a data use agreement may be possible. Your institute may provide guidance on data use agreements.
What about HIPAA concerns?
The data input file does not accept identifiers in order to help prevent you from uploading PII; however, we do not screen your data for PII. Please check your input file for PII before you upload you data onto our server.
What are SOCcer scores?
Our classifier uses logistic regression to calculate that log-odds that an expert reviewer would have selected a SOC 2010 code. The SOCcer score is the transformed log-odds (to a probability). In general the higher the SOCcer score, the greater the probability of matching an expert review. Please see our paper for more details on the relationship between SOCcer score and probability of matching an expert coder's SOC assignment. Some dataset are more difficult to classify than others and depend on the quality of the data. The SOCcer score distribution provides an overview of how well SOCcer performed on your data set.
Site is currently under section 508 review. This site is undergoing remediation for compliance with Section 508. In the interim, should you require any accessibility assistance with any content, please contact us at NCISOCcerWebAdmin@mail.nih.gov. Thank you!