Automated Metadata Creation: Possibilities and Pitfalls

Automated Metadata Creation: Possibilities and Pitfalls • Presented by Wilhelmina Randtke • June 10, 2012 • Nashville, Tennessee • At the annual meeting of the North American Serials Interest Group. • Materials posted at www.randtke.com/presentations/NASIG.html

Teaser: Preview of the sample project. http://www.fsulawrc.com

Background: What is “metadata”? • Metadata = any indexing information • Examples: MARC records color, size, etc. to allow clothes shopping on a website writing on the spine of a book food labels

What we'll cover • Automated indexing: • Human vs machine indexing • Range of tools for automated metadata creation: Techy and less techy. • Sample projects • A little background on relational databases • Database design for a looseleaf (a resource that changes state over time). • Sample project: The Florida Administrative Code 1970-1983

Automated Indexing: What’s easy for computers? • Computers like black and white decisions. • Computers are bad with discretion.

Word search vs. Subject headings

One Trillion • 1,000,000,000,000 • webpages indexed in Google • … 4 years ago …

Nevertheless…… Human indexing is alive and well

How to fund indexing?

http://www.ebay.com/sch/Dresses-/63861/i.html?_nkw=summer+dresshttp://www.ebay.com/sch/Dresses-/63861/i.html?_nkw=summer+dress

How to fund indexing?

Who made the metadata:Human or Machine? How GoogleBooks gets its metadata: http://go-to-hellman.blogspot.com/2010/01/google-exposes-book-metadata-privates.html

Not automated indexing, but a related concept…. • Always try to think about • how to reuse existing metadata.

High Tech automated metadata creation

The high end: Assigning subject headings with computer code • Some technologies: • UIMA (Unstructured Information Management Architecture) • GATE (General Architecture for Text Engineering) • KEA (Keyphrase Extraction Algorithm)

Person’s role: Select an appropriate ontology. Configure the program so that it’s looking at outside sources. Review the results and make sure the assigned subject headings are good. Program’s role: Take ontology or thesaurus and apply it to each item to give subject headings. Computer Program for Automated Indexing Ontology Thesaurus Item Subject Headings

http://www.nzdl.org/Kea/examples1.html

The lower end: Deterministic fields

There’s an app for that • Scripts for extracting fields from a thesis posted on GitHub: https://github.com/ao5357/thesisbot

Batch OCR

Many tools exist to extract text from PDFS to Excel

Walkthrough – examining the extracted spreadsheets • http://fsulawrc.com/excelVBAfiles/index.html

How to plan the program • Look for patterns • Write step-by-step instructions about how to process the Excel file • Remember, NO DISCRETION, computers do not take well to discretion. • Good steps: • Go to the last line of the worksheet • Look for the letter a or A • Copy starting from the first number in the cell, up to and including the last number in the cell. • Bad steps: • Find the author’s name (this step needs to be broken into small “stupid” steps)

Writing the program • Identify appropriate advisors. • Remember, most IT staff on a campus just install computers in offices, etc. Programming and database planning are rare skills. The worst IT personnel will not realize that they do not have these skills. • If an IT staff tells you they do not know how to do something, then go back to that person for advice on all future projects. • Try to find entry level material on coding. • (Sadly, most computer programming instructions already assume you know some programming.) • If outsourcing or collaborating, remember, the index is the ultimate goal. Understanding of the index needs to be in the picture. You probably have to bring it in.

Finding Advisors: Most campus IT is about carrying heavy objects

Perfection? • How close to perfection can you get? • Let’s run some code: • A spreadsheet with extracted text: http://fsulawrc.com/excelVBAfiles/23batch6A.xls • Visual Basic script: http://fsulawrc.com/excelVBAfiles/VBAscriptForFAC.docx • The files: You can retrieve some of these same files by searching 6A-1 in the main search for the database at www.fsulawrc.com

How much metadata was missing?

Cheap and fastand incomplete • This is a search engine build on an index for the automated metadata only: • http://fsulawrc.com/automatedindex.php • It’s better than a shuffled pile of 30,000 pages. • It’s not very good. • If you are thousands of miles away, then this is better than print. If you are in the same room as organized print, print might be better.

Automated Metadata Creation: Possibilities and Pitfalls