410 likes | 459 Views
Explore the importance of ad-hoc corpora in linguistic research for technical terminology and lexicography needs. Learn how to use the AntConc program for specialized language tasks. Discover resources and tools for corpus linguistics. Hands-on practice with insulin analogs search.
Workshop: AntConc as a corpus-linguistic program for ad-hoc LSP purposes Presentation for the NDSU Conference by Birthe Mousten MA, Ph.D. 19 July, 2016
Corpus linguistics is a field connected with tracking language for mega-corpora such as Oxford’s, Longman’s and Webster’s dictionaries. Even though such dictionaries are based on huge corpora, they often fail to provide answers to even fairly simple technical and scientific terminology and lexicography questions. This knowledge representation gap cannot be referred to corpus linguistics being useless for LSP purposes, but must be referred to the lack of LSP inputs in the corpora used for dictionary tracking. This is where an ad-hoc corpus comes into the picture. An ad-hoc corpus is collected on the fly, typically for a certain LSP task at a certain time. It can therefore be used as a tool for technical writers and translators who need to swiftly map the lexicographic, terminological and genre characteristics of a new, delimited field. The ad-hoc corpus tool is the freeware program AntConc. Join me for an ad-hoc LSP task. Abstract
Articlesabout ad-hoc corpus linguistics Mega-corpora Ad-hoc-corpora Traditional corpus use Cross-linguistic corpus use Specializedlanguagecorpora Training with AntConc – freeware program Corpus linguistics - Overview
Corpus linguistics - articles Ourhumble start:
American National Corpora: http://corpus.byu.edu/ British National Corpus: http://www.natcorp.ox.ac.uk/ Wordschatz: Http://wortschatz.uni-leipzig.de/ KorpusDK http://ordnet.dk/ Megacorpora
Youwillbeasked to work with insulin analogslater, so why not test thatone right now? Wewillsearch the American megacorpora for insulin analog and seewhere it getsus. Let ustry the American megacorpus first. Task later: Knowledge about insulin analogs
American Now corpus Result: 8 hits = 8 texts; at a closer look maybeonly 4, of whichone is not American English, but probably Indian English, and one is international English. The three hits from FDAareprobably from onlyonetext. So in practice, from an American point of view, there is the FDA text and the Seeking Alpha text. Not veryimpressing: Shows the need for a lack of furthersearch to work with the area. However, why not take the FDA text for our corpus nowthatwe have it. So weclick the text and…
The first Reference renders this: ….getthistext, whichwecopy for ourtext corpus.
Our corpus text NO. 1 The same textcopied to Word.
Copytext Open Word Savingtext in Word in a file in a folder: - Save as –> source/title/date (or yourchosen parameters) - Save as .txt by choosingplaintext => (all codes from html removed) Startingourcollection for the corpus
Save as My chosen folder Source text date Plain text
I have to searchelsewhere for text, and why not the largest big-data corpus in the world: Google. I use Google advancedsearch – it is easier, and quicker. But a small step beforethat – I readmy wiki https://www.google.ca/advanced_search Building up the corpus
To helpmemake a veryprecisesearch on Google, I sneak peak in Wikipedia to seewhether it knowsanythingabout insulin analog: An insulin analog is an altered form of insulin, different from any occurring in nature, but still available to the human body for performing the same action as human insulin in terms of glycemic control. Through genetic engineering of the underlying DNA, the amino acid sequence of insulin can be changed to alter its ADME (absorption, distribution, metabolism, and excretion) characteristics. Officially, the U.S. Food and Drug Administration (FDA) refers to these as "insulin receptor ligands", although they are more commonly referred to as insulin analogs. These modifications have been used to create two types of insulin analogs: those that are more readily absorbed from the injection site and therefore act faster than natural insulin injected subcutaneously, intended to supply the bolus level of insulin needed at mealtime (prandial insulin); and those that are released slowly over a period of between 8 and 24 hours, intended to supply the basal level of insulin during the day and particularly at nighttime (basal insulin). The first insulin analog approved for human therapy (insulin Lispro rDNA) was manufactured by Eli Lilly and Company. Wikipedia
Then Google Advanced I wantthese parameters My keyconcept Most recent year
Google Advanced results Ok – let’sget to it then Give yourself 10 minutes to copypasteintoyour folder.
Ten minutesafter I have my corpus Then I am ready for my corpus work
Now --- our task We just got a task from a company, say Eli Lilly or Novo Nordisk aboutwriting or translation somethingabout insulin analog How can a corpus helpyou? Register Collocations Definitions Synonyms Knowledge! Etc.
Getting the program Find Laurence Anthony’s website – Just google search the name. By the way, the address is here: http://www.antlab.sci.waseda.ac.jp/software.html#antpconc Press: AntConc 3.4.4 (or the Mac or other version that is compatible with your computer) (NB: The languagecode must be Western Latin 1! (Check under Global Settings – Language encoding – set it to Western Latin 1) Please join mehere!
Loadingyour corpus into the program • Guide: • Press AntConc 3.4.4 (sprogudgave skal være Latin 1) • Load your folder (Windows explorermethod) • Thenyouareready to search.
AntConcopened – but empty AntConc is now open. Press File Open Directory 3) Load your corpus folder in the Windows way
Lantus – what is that? Onlythreesources: -> Product name? Check in texts.
Scrolldown -> differentsourcesuse the word => ESP word What is hypoglycemia If youpresssome of the words, yougetdirectlyinto the texts.
The or trick Finding: Content knowledge Alternatives Synonyms (tryalso Aka Referred to Known as (
The ( trick – findingparenthetical info Which kind of info wouldyou find in a parenthesis and whatdoesthat data tellyou?
Findingcollocations - right Set the sorting parameters at the bottom. 1R 2R 3 R Findings: Metabolicchanges Metaboliccontrol Metabolicdecompensation Metabolicdeterioration All of themconcepts in theirown right.
Findingcollocations - left Set the sorting parameters at the bottom. 1L 2L 3L Findings: Good metabolic Poormetabolic Rapid metabolic … ..but for instance not bad metabolic!
Statistics & collocation – for the nerds Shows ranking, frequencyleft, frequency right, statistics and the collocate. Note for instanceregimen, canbeused with dosing a L and R collocate. The same meaning?
Concordance plot for precautions Shows in whichtexts a wordexists and how the worduse is distributedthroughout the text. Precautions has a strong front tendency in the texts. A possible genre tool?
Let yourphantasyloose. What do you find out? Whatcan it beused for? Is it useful in the firstplace? Yourturn
The program is intuitive in use – I never learned it from anyone. General Microsoft processeswork in AntConc. Good as an ad-hoc writing and translation tool. Good for register work. Good for collocations. Good for proof of whatyouaredoing. Forgetaboutyour intuitive ideas and check how it works. Define a problem and devise yourown solution method. Even 10 texts as in our case can provide a wealth of information. Ten times faster and 100 times more reliablethantiresome open-and-read x number of Google docs. Must nowadaysnecessarilyreplaceanyoldfashioned pencil-and-paperwork. My opinion – Good
The quality of the findingsdepends on the user Quality in => quality out You have got to getstarted to like it You have to learnapprox. fiveshortcuts in order not to tire out My opinion - limitations
Thankyou for joiningme in this. I wishyou the best of luck with the rest of the conference. If youwant to contactme, please writeme: bmo@dac.au.dk / bmo@expo-com.dk