390 likes | 409 Views
What is Information?. What will we retrieve with information retrieval? There are several ways to define “ information ” Subjective : People develop models of their environment. Information created by people makes those models more accurate.
E N D
What is Information? • What will we retrieve with information retrieval? • There are several ways to define “information” • Subjective: People develop models of their environment. Information created by people makes those models more accurate. • Thing/artifact: Information is what’s captured in a book, web page, or other resource. • More information is digital & is increasing
Information - wikipedia • Information as a concept has a diversity of meanings, from everyday usage to technical settings. Generally speaking, the concept of information is closely related to notions of constraint, communication, control, data, form, instruction, knowledge, meaning, mental stimulus, pattern, perception, and representation. • Many people speak about the Information Age as the advent of the Knowledge Age or knowledge society, the information society, the Information revolution, and information technologies, and even though informatics, information science and computer science are often in the spotlight, the word "information" is often used without careful consideration of the various meanings it has acquired.
How much information is there in the world Informetrics - the measurement of information • Stored • What can we store • What do we intend to store. • What is stored. • How do we use it • Decision making • Knowledge discovery
Aspects of the Information & Data Age • Much information/data will/can be made and stored digitally • Information/data can be automatically processed, mined, and accessed • Why? Moore’s Law
Information Age & Data Age • We have entered the information & data age • What is the information age? • When do we leave it and where do we go next? • David Weinberger’s Too Big to Know • What information was
Digitization of Everything: the Zettabytes are coming • Soon most everything will be recorded and indexed • Much will remain local • Most bytes will never be seen by humans. • Search, data summarization, trend detection, information and knowledge extraction and discovery are key technologies • So will be infrastructure to manage this.
Digital Information Created, Captured, Replicated Worldwide Exabytes 10-fold Growth in 5 Years! DVD RFID Digital TV MP3 players Digital cameras Camera phones, VoIP Medical imaging, Laptops, Data center applications, Games Satellite images, GPS, ATMs, Scanners Sensors, Digital radio, DLP theaters, Telematics Peer-to-peer, Email, Instant messaging, Videoconferencing, CAD/CAM, Toys, Industrial machines, Security systems, Appliances Source: IDC, 2008
Yotta Zetta Exa Peta Tera Giga Mega Kilo How much information is there? Everything! Recorded • Soon most everything will be recorded and indexed • Most bytes will never be seen by humans. • Data summarization, trend detection anomaly detection are key technologies See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html See Lyman & Varian: How much information http://www.sims.berkeley.edu/research/projects/how-much-info/ All Books MultiMedia All books (words) .Movie A Photo A Book 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
Big Data: Volume One page of text One song One movie 6 million books 55 storeys of DVD Data up to 2003 Data in 2011 NSA data center 5 MB 30KB 5 GB 1 TB 1 PB 1.8 ZB 1 YB 5 EB Byte Kilobyte Megabyte Gigabyte Terabyte Petabyte Exabyte Zettabyte Yottabyte KB MB GB TB PB EB ZB YB 1000 bytes 1000 KB 1000 MB 1000 GB 1000 TB 1000 PB 1000 ZB 1000YB
Information Facts Print, film, magnetic, and optical storage media produced about 5 exabytes of new information in 2002. Ninety-two percent of the new information was stored on magnetic media, mostly in hard disks. • How big is five exabytes? If digitized with full formatting, the seventeen million books in the Library of Congress contain about 136 terabytes of information; five exabytes of information is equivalent in size to the information contained in 37,000 new libraries the size of the Library of Congress book collections. • Hard disks store most new information. Ninety-two percent of new information is stored on magnetic media, primarily hard disks. Film represents 7% of the total, paper 0.01%, and optical media 0.002%. • The United States produces about 40% of the world's new stored information, including 33% of the world's new printed information, 30% of the world's new film titles, 40% of the world's information stored on optical media, and about 50% of the information stored on magnetic media. • How much new information per person? According to the Population Reference Bureau, the world population is 6.3 billion, thus almost 800 MB of recorded information is produced per person each year. It would take about 30 feet of books to store the equivalent of 800 MB of information on paper.
wikipedia Zettabye Era • The Zettabyte Era is a period of human and computer science history that started in one of two ways: the global IP traffic first exceeded that of one zettabyte, which happened in 2016; or the amount of digital data in the world first exceeded a zettabyte, which happened in 2012. A zettabyte is a multiple of the unit byte that measures digital storage, and it is equivalent to 1,000,000,000,000,000,000,000 [1021] bytes.
wikipedia Zettabye Era Predictions • Global IP traffic will triple and is estimated to reach 3.3 ZB on a yearly basis • In 2016 video traffic (e.g. Netflix and YouTube) accounted for 73% of total traffic. In 2021 this will increase to 82% • The number of devices connected to IP networks will be more than three times the global population • The amount of time it would take for one person to watch the entirety of video that will traverse global IP networks in one month is 5 million years • PC traffic will be exceeded by smartphone traffic. PC traffic will account for 25% of total IP traffic while smartphone traffic will be 33% • There will be a twofold increase in broadband speeds
Comparisons at Scale • From 1986-2007 the world's technological capacity to receive information through one-way broadcast networks was 0.432 zettabytes of optimally compressed information in 1986, 0.715ZB in 1993, 1.2ZB in 2000, and 1.9 (optimally compressed) ZB in 2007, this being the informational equivalent to every person on Earth receiving 174 newspapers per day • In 2003, Mark Liberman had calculated the storage requirements for all human speech ever spoken at 42 zettabytes if digitized as 16 kHz 16-bit audio. He did this in response to a popular expression that states "all words ever spoken by human beings" could be stored in approximately 5 exabytes of data. Liberman confessed that "maybe the authors [of the exabyte estimate] were thinking about text". • In 2007, humankind successfully sent 1.9 zettabytes of information through broadcast technology such as televisions and GPS per research from the University of Southern California. • In 2008, Americans alone consumed 3.6 zettabytes of information per a 2009 study from the University of California, San Diego. • As of 2009, the entire World Wide Web was estimated to contain close to 500 exabytes, or half a zettabyte. • In 2011 the International Data Corporation expected the "total amount of global data" to grow to 2.7 zettabytes during 2012, an increase of 48% from 2011. • In 2012, U.S. Americans accessed already 6.9 zettabytes of data per a 2013 study. • In 2013, one expert estimated that the "amount of data generated worldwide" would reach 4 zettabytes by the end of the year.] • By 2025, according to an IDC study commissioned by Seagate, "the global datasphere will grow to 175 zettabytes" wikipedia
Moore's Law • Defined by Dr. Gordon Moore during the sixties. • Predicts an exponential increase in component density over time, with a doubling time of 18 months. • Applicable to microprocessors, DRAMs , DSPs and other microelectronics. • Monotonic increase in density observed since the 1960s.
First Disk 1956 • IBM 305 RAMAC • 4 MB • 50x24” disks • 1200 rpm • 100 ms access • 35k$/y rent • Included computer & accounting software(tubes not transistors)
10 years later 30 MB 1.6 meters
Now - Terabytes on your desk Terabyte external drive for <$25 - 3 cents a gigabyte. In 10 years, 0.3 cent/gigabyte, $3 for a terabyte?
Storage capacity beating Moore’s law • Improvements:Capacity 60%/yBandwidth 40%/yAccess time 16%/y • 1000 $/TB today • 100 $/TB in 2007 Moores law 58.70% /year TB growth 112.30% /yearsince 1993 Price decline 50.70% /yearsince 1993 Most (80%) data is personal (not enterprise)This will likely remain true.
Digital Immortality Bell, Gray, CACM, ‘01 Requirements for storing various media for a single person’s lifetime at modest fidelity
What is Digital Immortality? • Preservation and interaction of digitized experiences for individuals and/or groups • Preservation and access • Active interaction with archives through queries and/or an avatar (agents) • Avatar interactions for group experiences • Issues: • Archiving • Indexing • Veracity • Access
All the world’s libraries on your iPod! NY Times Magazine And you thought finding that song was hard. SmartPhone • Storage is practically free • Much is mobile • Access is crucial • Moore’s law keeps on trucking
Why Put Everything in Cyberspace? Low rent min $/byte Shrinks time now or later Shrinks space here or there Automate processing knowbots Point-to-Point OR Broadcast Immediate OR Time Delayed Locate Process Analyze Summarize
Progress of Science Thousand years ago: science was empirical describing natural phenomena Last few hundred years: theoretical branch using models, generalizations Last few decades: a computational branch simulating complex phenomena Today: (big data/information)data and information exploration (eScience) unify theory, experiment, and simulation - information driven Data captured by sensors, instrumentsor generated by simulator Processed/searched by software Information/Knowledge stored in computer Scientist analyzes database / filesusing data management and statistics Network Science Cyberinfrastructure
People and Information • People process information based on their experience and context. • Human information processing is affected by emotions and needs. • Your data may be my information • Search engine relevance is the same
What is knowledge? • Data - Facts, observations, or perceptions. • Information - Subset of data, only including those data that possess context, relevance, and purpose. • Knowledge -A more simplistic view considers knowledge as being at the highest level in a hierarchy with data (at the lowest level) and information (at the middle level). • Data refers to bare facts void of context. • A telephone number. • Information is data in context. • A phone book. • Knowledge is information that facilitates action. • Recognizing that a phone number belongs to a good client, who needs to be called once per week to get his orders.
From Facts to Wisdom(Haeckel & Nolan, 1993)one example of the hierarchy
What is knowledge? • Knowledge -A more complex view considers knowledge as intrinsically different from information. Instead of considering knowledge as richer or more detailed set of facts, we define knowledge in an area as justified beliefs about relationships among concepts relevant to that particular area.
Great Predictions • "Computers in the future may weigh no more than 1.5 tons.” Popular Mechanics, forecasting the relentless march of science, 1949 • "I think there is a world market for maybe five computers.” Thomas Watson, chairman of IBM, 1943 • "Heavier-than-air flying machines are impossible.” Lord Kelvin, president, Royal Society, 1895. • "Man will never reach the moon regardless of all future scientific advances."Dr. Lee De Forest, inventor of the vacuum tube and father of television. • "Everything that can be invented has been invented.” Charles H. Duell, Commissioner, U.S. Office of Patents, 1899. • “Nobody would ever need more than 640 kilobytes of memory on their personal computer,” 1981, Bill Gates. • Other predictions of Bill Gates?
Great Predictions RIGHT! • Artificial Intelligence: • speech recognition • Some reasoning; computer beats man in chess • Privacy and security problems • Computers can be a pain in the butt WRONG! • Missed Moore’s law and ubiquity of computers
Predicting the future • “The future ain’t what it used to be” Yogi Berra • Can we really predict the future? • Who predicted the implications of the web and search engines? • Social networking? • Can we understand power laws and their implications? • We have no examples of exponential growth in our evolution except plagues. • Can we understand the pervasiveness of computers?
Information Science and Data Generation Trends • What does large amounts of information provide? • New opportunities for search! • New discoveries • Business opportunities? • Research opportunities? • Problems? • Wisdom search engine?
Thanks to: • Jim Gray, Microsoft • L. Floridi, Hertfordshire • Robert Allen, Drexel • Wikipedia