380 likes | 477 Views
Content Categorization A Road Map. Julia Marshall USAID ( Bridgeborn Inc.). http://dec.usaid.gov. What Is Your Goal? . Write Down Your Goal. Samples: Discover what topics are most mentioned in a set of documents and whether they change over time
E N D
Content CategorizationA Road Map Julia Marshall USAID (Bridgeborn Inc.)
Write Down Your Goal Samples: Discover what topics are most mentioned in a set of documents and whether they change over time Create metadata records more quickly than we can with human catalogers/indexers
Flavors of Text Analytics Text Mining Discovering Data Finding Patterns within the Data Auto-Categorization Structuring data to a schema Assigning pre-determined tags
Assess Your Resources People Materials to be categorized Processes Metadata schemas Systems Budget
People How many people will you have? What is their expertise? IT People Indexers/Subject Matter Experts Web Developers/Designers Project Manager How much time will they have to devote to this project?
Materials to be categorized How much material? What format is it in? Paper? Digital files? OCR’d files? What shape is it in?
Processes Are processes already in place for categorization? If so, how is the process done? Who does the process? How standardized is the process?
Metadata Schemas Does your organization have: Thesaurus of topics? Personal name authority files? Organizational name authority files? Gazetteers or geographic names? Standard list of types of documents? Standard way dates are handled?
Systems Will there be a system that consumes output from the SAS Content Cat Studio? How will the system consume the SAS output? Will there need to be code to pull the text of the documents through SAS Code to push the SAS output into your consuming system?
Budget How much money can you spend?
Assess the Costs Tools Application Server space/equipment Staff time Preparatory costs
Select a plan/tool that best fits your organization’s needs Revisit your original goal What do you have the resources to do? Revise your goal to fit your circumstances Find the best tool for the job
Strategize the Implementation What metadata/processes to automate? What are priorities for the above processes? What are the easiest to automate? How much time will it take? Who’s doing what?
Manage the Management Manage Expectations Pick a “Quick Win” piece of the project Keep them informed at a level that they can understand
The SAS Content Categorization Studio is Plugged in -Now What Do I Do?
Create Profiles For each piece of metadata to auto-categorize, write a profile that tells the application which terms to assign for each document Each term will need a unique set of rules assigned that tell the application when to apply that particular term – and when not to
Tips for Writing Profile Rules Simpler is better – at first Analyze a sample of documents to be auto-categorized – what words show up with the term Differentiate between “Concept” and “Context” Document your rules and your updates as you write them.
Sample Profile Build Logging (OR, (MIN_2, “Logging”, “Selective logging”, “Illegal logging”, “Logging concession”, “Timber extraction”, “Sawmill”, (SENT, “logging”, “impact”), (SENT, “timber”, “harvest”))) Trees (OR, (MAXOC_50, (NOTIN, “Trees”, “Teak trees”), (NOTIN, “trees”, “fruit trees”), NOTINSENT, “Trees”, “Timber”), (NOTINSENT, “trees”, “logging”)))
Collect Sample Sets of Documents Need at least 3 sets. (Probably more). 1st set for writing profile 2nd set for testing 3rd set for the final test
Run the 1st Sample Set Against your Profile Each document will have terms that SAS assigned to it Each term will have a relevancy score Rank the terms by the highest to lowest relevancy score Look at the top 5-10 terms
Evaluate the Output Do the top 5-10 terms make sense? Are the terms too general? What phrases in the set of documents caused SAS to pick those terms? How do you need to rewrite the rules?
RepeatRepeatRepeatRepeatRepeat RepeatRepeat As Needed
I’ve Created the Profile The Output is the Way I Want Now What?
Integrate the Output Design the Workflow Interface Design Connect to Local Systems Train staff More tests
Documents Design the Workflow I SAS Profile Where is the data in each step? Who is handling the data? What has to happen to move the data to the next step? Java Code XML Code Metadata in DEC
Interface DesignSample: USAID Geographic Term(s): USAID Geographic Term(s) SAS values: SAS GeoDescriptor Run Date:
Train Staff IT Staff Profile managers Output evaluators
Test the Integrated System Gather test samples – again! Run the profile in your test environment Does the output stay the same? Can you update the profiles? Are other users of the system able to use/update the output?
If you answered yes: CELEBRATE!
Maintain the System Documentation Tests Staff training Follow up evaluations
Lessons Learned the Hard Way Be careful using outside data Buy only what you need
Thank you! Email: jmmarshalljb@gmail.com