880 likes | 900 Views
This talk introduces the EPrints data model, significant objects, their relationships, and common methods. Learn how to find and extract documentation using perldoc, and explore scripting your archive.
E N D
About This Talk • Light on syntax • object->function(arg1, arg2) • Incomplete • Designed to • give you a feel for the EPrints data model • introduce you to the most significant objects • how they relate to one another • their most common methods • act as a jumping off point for exploring
Finding Documentation • EPrints modules have embedded documentation • Extract it using perldoc • perldoc perl_lib/EPrints/EPrint.pm
EPrints 3.0 • This talk based on EPrints 2.3 series • 3.0 API still being finalised • tidies up object hierarchy • resolves some of 2.3’s naming clashes • lots of extra functionality • but core data model remains the same • EPrints 3.0 is fully back-compatible • 2.3 scripts will work with EPrints 3.0
Roadmap • Data • EPrints, Users, Documents, Subjects, Subscriptions • Data collections • DataSets, MetaFields • Searching your data • SearchExpressions • Scripting your archive • Archives, Session
1. Data EPrints, Users, Documents, Subjects, Subscriptions
Data Model Sketch EPrint
Data Model Sketch Document PDF EPrint all documents Document HTML HTML HTML
Data Model Sketch Document PDF EPrint all documents Document owner HTML User HTML HTML
Data Model Sketch Document PDF EPrint all documents EPrint Document owned eprints owner HTML User HTML HTML
Data Model Sketch Document PDF EPrint all documents EPrint Document owned eprints owner HTML User HTML HTML subscriptions Subscription Subscription
Data Model Sketch Document PDF EPrint Subject all documents EPrint Document owned eprints owner HTML User HTML HTML subscriptions Subscription Subscription
Data Model Sketch Subject Document child PDF EPrint Subject all documents EPrint Document owned eprints parent owner Subject HTML User HTML HTML subscriptions Subscription Subscription
Data Model Sketch Subject EPrint Document child posted eprints PDF EPrint Subject all documents EPrint Document owned eprints parent owner Subject HTML User HTML HTML subscriptions Subscription Subscription
EPrint • An EPrint object represents a single deposit in your EPrints archive • has some metadata fields • has one or more documents • is owned by a user
Creating EPrints • new(session, id) • create an EPrint object for an existing deposit • create(session, dataset, data) • create a new EPrint object • More on sessions and datasets later!
Introducing DataObj • EPrint is a subclass of DataObj • DataObj provides common methods for • accessing metadata • rendering XHTML output
Inherited from DataObj • get_id • get_url(staff) • get the URL of an EPrint • e.g. URL to the abstract page of an eprint in the archive • if staff is true then returns the URL to the staff view, which shows more detail • get_type() • get the EPrint type • e.g. article, book, thesis, conference paper...
Inherited from DataObj • get_value(fieldname) • get the value of the named field • set_value(fieldname, value) • set the value of the named field • Remember to call commit() to make changes in database! • is_set(fieldname) • true if the named field has a value
EPrint Methods • remove() • erase the eprint and any associated records/files from the database and filesystem • this should only be called on EPrints in the "inbox" or "buffer" datasets • commit() • commit any changes made to the database • datestamp() • set the last modified date to today
Moving EPrints Around • move_to_deletion() • transfer the eprint to the deletion dataset • should only be called on eprints in the archive dataset • See also: • move_to_inbox() • move_to_buffer() • move_to_archive()
Rendering EPrints • generate_static() • generate the static abstract page for the eprint • in a multi-language archive this will generate a page in each language
Rendering - Inherited from DataObj • render_citation(style) • create an XHTML citation for the EPrint • if style is set then use the named citation style • defined in citations-en.xml • render_citation_link(style) • as above, but citation is linked to the EPrint’s abstract page
Rendering - Inherited from DataObj • render_value(fieldname, showall) • get an XHTML fragment containing the rendered version of the value of the named field • in the current language • if showall is true then all languages are rendered • usually used for staff viewing (checking) data
Rendering Tips • Most rendering methods return XHTML • but not a string! • XML Node objects • DocumentFragment, Element, TextNode... • In your scripts, build a document tree from these nodes • e.g. node1->appendChild(node2) • then flatten it to a string • Why? It’s easier to manipulate a tree than to manipulate a large string
More Rendering Tips • XML Node objects are not part of EPrints • XML::DOM or XML::GDOME libraries • explore these libraries using perldoc • XHTML is good for building Web pages • but not so good for command line output! • use tree_to_utf8() • extracts a string from the result of any rendering method • tree_to_utf8( eprint->render_citation)
Navigating to Related Objects • get_user() • get a User object representing the user to whom the EPrint belongs • get_all_documents() • get a list of all the Document objects associated with the EPrint • We will look at these objects next...
User • A User object represents a single registered user • Also a subclass of DataObj • inherits metadata access methods • get_url get_type get_value set_value is_set • inherits rendering methods • render_citation render_citation_link render_value • Also has commit and remove • inherited from DataObj in 3.0
Creating Users • new(session, id) • create a User object from an existing user record • user_with_email(session, email) • user_with_username(session, username) • create_user(session, access_level) • create a new User
User Accessors • get_editable_eprints() • get a list of EPrints that the user can edit • get_owned_eprints(dataset) • get a list of EPrints owned by the user in the dataset • is_owner(eprint) • true if the user is the owner of the EPrint • get_subscriptions() • get a list of Subscriptions associated with the user
Document • A single document associated with an eprint • may actually contain one or more physical files • PDF = 1 file • HTML + images = many files • Another subclass of DataObj
Creating a Document Object • new(session, docid) • create a Document object from an existing record • create(session, eprint) • create a new Document object for the given EPrint
Document Accessors • get_eprint() • get the EPrint object the document is associated with • local_path() • get the full path of the directory where the document is stored in the filesystem • files() • get a list of (filename, file size) pairs
Main File and Format • get_main() • set_main(main_file) • get/set the ‘main’ file for the document • e.g. if the document is multipage HTML with images, the main file needs to be set to the top index.html file • when rendering document links, EPrints always links to the main file in the document • set_format(format) • sets the document format
Adding Files to Documents • upload(filehandle, filename) • uploads the contents of the given file handle • adds the file to the document (using the given filename) • add_file(file, filename) • adds a file to the document (using the given filename) • file is the full path to the file
Adding Files to Documents • upload_url(url) • grab file(s) from given URL • in the case of HTML, only relative links will be followed • add_archive(file, format) • add files from a .zip or .tar.gz archive • remove_file(filename) • remove the named file from the Document
Subject • A single subject from the subject hierarchy • Another subclass of DataObj
Creating Subjects • new(session, subjectid) • create a Subject object from an existing subject • create(session, id, name, parent, depositable) • create a new Subject • depositable specifies whether or not users can deposit eprints in the subject
Subject Accessors • children() • get a list of Subjects which are the children of the subject • get_parents() • get a list of Subjects which are the parents of the subject • subject_label(session, subject_tag) • get the full label of a subject, including parents
Subject Accessors • count_eprints(dataset) • get the number of eprints associated with the subject • posted_eprints(dataset) • get a list of EPrints associated with the subject
Rendering Subjects • render_with_path(session, topsubjid) • get a DocumentFragment containing the subject path • example of a subject path: H Social Sciences > HD Industries. Land use. Labor > HD28 Management. Industrial Management
Subscription • A stored search which is performed every day/week/month on behalf of a user • get_user() • get the User who owns the subscription • Another subclass of DataObj
Creating Subscriptions • new(session, id) • create a Subscription object from an existing subscription • create(session, userid) • create a new Subscription object for the given user
Processing Subscriptions • send_out_subscription() • search for new items matching the subscription settings • email them to the user owning the subscription
So Far.. • We’ve looked at individual data objects • but an EPrints archive holds many eprints and documents, has many registered users etc. • how do we access them collectively? • We’ve seen the get_value and set_value methods for metadata • but an archive’s metadata is configurable • so how do we know what metadata fields an EPrint, User etc. has? • how do we access properties of the fields?
2. Data Collections DataSets and MetaFields
Dataset • A collection of data items • Tells us all the possible types in the collection • e.g. EPrints may be article, thesis • Tells us the fields in each type • e.g. article has title, authors, publication... • e.g. conference_item has title, authors, event_title, event_date.. • Can also tell us all the fields that apply to a dataset • title, authors, publication, event_title..
Dataset Configuration • ArchiveMetadataFieldsConfig.pm • fields in each dataset • additional system fields defined in EPrint.pm, User.pm etc. • metadata-types.xml • types in each dataset • fields that apply to each type
Datasets in EPrints • archive • EPrints that are live in the main archive • buffer • EPrints that have been submitted for editorial approval • deletion • EPrints that have been deleted from the archive • inbox • EPrints which users are still working on • eprint • All EPrints from archive, buffer, deletion and inbox