320 likes | 332 Views
This proposal outlines a new approach to big data analytics for anomaly detection, aiming to surpass current NSA capabilities. It suggests using pTree SVD based analytics and introduces new data types such as connected, sparse, multi-entity relationship matrixes. The approach involves training concatenated feature vectors in the cloud and using them for analytics on the client side. The proposal also emphasizes the importance of unsupervised anomaly detection and scalability and performance testing. The response aims to be generic and adaptable to incremental changes.
E N D
Hi Mark, (2013/08/21) I wanted to return some preliminary thoughts early in case any of them are useful to you in writing the proposal to NSA. First of all, congratulations! I have NEVER been sent a RFP with my name right in it and with a comment to the effect "previous work was so excellent that...". Secondly, the wording as to what they are looking for is very much 35,000ft in the RFP, You probably know what will clinch it better than anyone (having worked with them and gotten the above comment back from them already ;-) They want big data analytics (I read "predictors" or classifiers) and anomaly detectors which will "work better than NSA's current ones". Not sure what that means exactly, but having read just a little about PRISM, I conclude that they're looking for a better (faster?) Big Data PRISM? If so, then the RFP section asking for 3 new analytics and 3 new data types, seems important. The attached slides may give some ideas as to new analytics (pTree SVD based analytics - which I am very excited about-for both classification and anomaly detection) and new data types (basically connected, sparse, multi-entity relationship matrixes, both 2D and 3D). See if there is anything you can use. If not just ignore it all - no problem. Again, you know best what will get the grant. The slides detail an analytics model that is specific to big data sets of emails, tweets, text or phone records (or all of them combined?). Broadly, the new idea (if it's new?) is to retain sender/receiver info but not by expanding the dimension of the information cube (to 4D: terms, docs, senders, receivers) because that may be the problem - massive size increase with redundancies. So wereplace 2D n-by-m Term-Document (tf-idf) matrix with a single well-trained n-vector and a single well-trained m-vector (the so-called main singular value vectors from SVD theory) (note that the TD matrix need not be tf-idf but any measurement that provides knowledge on how much a document is characterized by a term).replace 2D p-by-q User-Term matrix (which specifies how much a term characterizes a user) with a well trained single p-vector and single q-vector;replace the 3D r-by-s-by-t Document-Sender-Receiver matrix with a single well-trained r-vector, a single s-vector and a single t-vector (DSR is a binary or Yes/No matrix). We train the 1 concatenated feature vector 1 time, download it from the cloud to the client and use it there to do the analytics. If you have any questions at all on the slides (the first one is the most important), just call or exchange email. Mark's reply: Yes, was *very* flattering. I do the best I can to try and keep customers happy. Unfortunately there is very little to go on. I think we are only speculating about data types, which makes this response even more challenging…. From last years exercise, I think the focus will be on anomaly detection, particularly unsupervised where we did not fully get over the finish line. This is important There will be incremental work over the course of a year, so I think we need to plan kind of generically – that is, we will need to estimate how long it takes to produce a new method. We'll work on one method, test, and they will come back to us and say try these changes. So I’m going to be as generic as I can in the response. My focus will be on getting oblique working in Hadoop and IBM streams environs, scalability and performance testing, and data import types. I have zero info on what new data types they mean, but assume that it could include both structured and unstructured – just not another language
Dr. Wettstein's reply: I would echo Dr. Perrizo's congratulations on being the recipient of a directed RFP. I have some experience with the frustrations inherent in breaking through the beuracracy inherent in the agencies in particular and DC in general. I have found Dr. Perrizo's recent work interesting. My Ph.D. minor was in statistical theory and we spent a great deal of time working in linear and non-linear regression analysis. The focus of my primary research was how to optimize the multi-dim energy minimization prob inherent in modelling drug ligand interactions with protein based neuroreceptors. That problem resolves to locating the optimimum Intrinsic Reaction Coordinate (IRC) pathway, sometimes referred to as a ballistic trajectory path, through a multi-dimensional energy field. The problem ultimately boils down to optimizing away the protein-ligand configuration which lead to local rather then maximum energy minimums. I find particularly interesting similarities in the use of steepest descent prediction methods. The next iteration of the energy field calculation was made by estimating the coefficients to be assigned to the parameter field through steepest descent predictions. The obvious optimizations we ended up moving to involved developing assessment procedures for when to zero weight or reject parameters which appeared to not be contributing to or relevant to the next energy step prediction. As you probably know my research and development efforts shifted toward identity theory and its implications with respect to privacy. Interestingly enough my consulting group is involved in the security and privacy issues, or more correctly lack thereof, in the health insurance and information exchanges supporting the Affordable Care Act. I'm quite literally flattered Mark that you would consider that I might be able to have positive contributions to your work.Unfortunately due to the realities of the current national dialogue I would have ethical issues with respect to contributing to anything which the NSA is attempting to do at this point. Which is probably surprising to most people that know me since they would classify me as conservative thinking and strong on national security issues. Unfortunately the whole system has evolved without sufficient protections for privacy and data security concerns which means that this will end badly for everyone in the United States. Based on the experiences of my group I am increasingly suspicious that there is little interest on behalf of the government in protecting health related identity information. There is a veritable motherload of information other there and unfortunately that is all too seductive a siren. But I certainly wish you and Treeminer the best in your work on theseissues. I will re-double my design efforts on using large field theory to shroud sentinel identity generation.... :-) Good luck with your work and best wishes for a pleasant weekend.
New vertical Analytics for NSA's PRISM (for email classification and anomaly detection): FR 0 0 0 0 0 0 0 0 0 FR 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 FS F1,S F2,S 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 E.g., 2 features, using a 3D DocumentSenderReceiver matrix, DSR 0 1 0 1 0 0 0 1 2 3 DSR 4 0 1 0 1 FS 5 T TD rec 0 FU FD FT 0 1 1 0 0 sender 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 2 2 0 0 1 1 1 1 1 1 0 0 0 3 3 1 0 0 1 FD 0 0 0 0 0 0 4 4 FT 5 5 0 FT FD D D 0 1 0 0 1 1 0 0 1 1 2 2 0 0 0 0 0 0 0 1 1 2 3 3 3 UT 4 4 4 5 5 0 1 0 0 5 0 1 0 1 FD U U 0 0 0 1 T 0 0 1 0 0 0 0 1 BackPropagate+LineSearch to minimize sum of square errors (sse) using Gradient(sse), where sse sums errors over non-blanks in TD,UT,DSR. The question is: train feature segments separately (train FU summing errors over UT only, FS and FR summing errors over DSR only, or train all 3 summing errors over all non-blanks in UT and DSR?) to produce F= FU 0 0 0 1 0 1 1 0 0 1 0 1 0 0 1 1 0 1 0 1 0 1 1 <----FT----> <----FR----> <----FS----> <----FU----> <----FD----> 0 Training the User feature segment just once makes sense, F = 0 1 0 1 0 1 0 1 0 1 0 1 <----FT----> <----FD----> <FU=FS=FR> 0 1 0 1 FT Use pSVD. but distinguish Senders and Receivers in the email DocumentUser (DU) [binary?] matrix (i.e., replace DU with a 3D matrix, DSR) Email is emphasized, but the same analytics and data structures apply to phone_records/tweets/SMStext (distinguish senders/recieivers in each) 1 feature, the trained F=FDFTFUFSFR to predict blanks in DSR, TD and UT (without needing DSR,TD or UT). Prediction of blanks using pSVD can classify and anomaly detect. DSR TD UT Do conversions to pTrees and train F in the CLOUD. Then download the resulting F to user's personal devices for predictions, anomaly detections. The same setup should work for phone record Documents, tweet Documents (in the US Library of Congress) and text Documents.
The mathematics involved when there is a 3D relationship (e.g., DSR) is similar: e = p - r p(d,x,y) = dxy so e = dxy - r se = e2 = (dxy - r)2 sseDSR = nonblank(dxy)(dxy - r)2 sse/d = supp(d)2exy + supp(y)2edx + supp(x)2edy se/d = 2exy = 2(dxy-r)xy = 2dx2y2 - 2rxy se/x = 2edy = 2(dxy-r)dy = 2d2xy2 - 2rdy se/y = 2edx = 2(dxy-r)dx = 2d2x2y - 2rdx So we can factor out the 2 and then each round we find the s for which sse is minimized in: f + s = f + s ( exy, edy, edx ) We can do that by linear line search or, when there is one, use the formula to optimize over s. What is there is a 4D relationship (e.g., if time of email send becomes an issue, we have the 4D: DSRT)? Then similarly, we minimize, each round: f + s = f + s ( exyt, edyt, edxt, edxy ) Heaven forbid we should have a higher dimensional relationship (but we could). If we have a nD relationship, X1X2X3...Xn we minimize, each round, f + s = f + s ( e x2x3...xn, . . . e x1x2...xr-1xr+1...xn, . . . ex1x2...xn-1 ) or f + s = f + s ( e x2x3...xn-1xn, e x1 x3...xn-1xn, : e x1x2x3...xn-1 )
Ancillary Background stuff:(on the Twitter Archive at US Library of Congress, NSA's PRISM, ... (My omments in are in red)) Certainly the record of what millions of Americans say, think, and feel each day (their tweets) is a treasure-trove for historians [and for NSA?]. But is the technology feasible, and important for a federal agency> Is it cost-effective to handle the three V's that form the fingerprint of a Big Data project – volume, velocity, and variety? U.S. Library of Congress said yes and agreed to archive all tweets sent since 2006 for posterity. But the task is daunting. Volume? US LoC will archive 172 billion tweets in 2013 alone (~300 each from 500 million tweeters), so many trillions, 2006-2016? Velocity? currently absorbing > 20 million tweets/hour, 24 hours/day, seven days/week, each stored in a way that can last. Variety? tweets from a woman who may run for president in 2016 – and Lady Gaga. And they're different in other ways. "Sure, a tweet is 140 characters" says Jim Gallagher, US Library of Congress Director of Strategic Initiatives, "There are 50 fields. We need to record who wrote it. Where. When." Because many tweets seem banal, the project has inspired ridicule. When the library posted its announcement of the project, one reader wrote in the comments box: "I'm guessing a good chunk ... came from the Kardashians." But isn't banality the point? Historians want to know not just what happened in the past but how people lived. It is why they rejoice in finding a semiliterate diary kept by a Confederate soldier, or pottery fragments in a colonial town. It's as if a historian today writing about Lincoln could listen in on what millions of Americans were saying on the day he was shot. [Or NSA could classify TweetSendersReceivers over the past 10 years to profile the class of "likely terrorists". So the single feature is "ill-intent" and we use pSVD to predict it (trained on historical record of tweets from known or convicted ill-intenders.] Youkel and Mandelbaum might seem like an odd couple to carry out a Big Data project: One is a career Library of Congress researcher with an undergraduate degree in history, the other a geologist who worked for years with oil companies. But they demonstrate something Babson's Mr. Davenport has written about the emerging field of analytics: "hybrid specialization." [How about Data Mining and Mathematics?!?] For organizations to use the new technology well, traditional skills, like computer science, aren't enough. Davenport points out that just as Big Data combines many innovations, finding meaning in the world's welter of statistics means combining many different disciplines. Mandelbaum and Youkel pool their knowledge to figure out how to archive the tweets, how researchers can find what they want, and how to train librarians to guide them [but not how to data mine them!]. Even before opening tweets to the public, the library has gotten more than 400 requests from doctoral candidates, professors, and journalists. "This is a pioneering project," Mr. Dizard says. "It's helping us begin to handle large digital data." For "America's library," at this moment, that means housing a Gutenberg Bible and Lady Gaga tweets. What will it mean in 50 years? I ask Dizard. He laughs – and demurs. "I wouldn't look that far ahead." [I would! It will mean that data mining will drive a sea-change move to vertically storing all big data!]
NSA programs (PRISM (email/twitter/facebook analytics?) and ? (phone record analytics) Two US surveillance programs – one scooping up records of Americans' phone calls and the other collecting information on Internet-based activities (PRISM?) – came to public attention. The aim: data-mining to help NSA thwart terrorism. But not everyone is cool with it. In the name of fighting terrorism, the US gov has been mining data collected from phone companies such as Verizon for the past seven years and from Google, Facebook, and other social media firms for at least 4 yrs, according to gov docs leaked this week to news orgs. The two surveillance programs, one that collects detailed records of telephone calls, the other that collects data on Internet-based activities such as e-mail, instant messaging, and video conferencing [facetime, skype?], were publicly revealed in "top secret" docs leaked to the British newspaper the Guardian and the Washington Post. Both are run by the National Security Agency (NSA), the papers reported. The existence of the telephone data-mining program was previously known, and civil libertarians have for years complained that it represents a dangerous and overbroad incursion into the privacy of all Americans. What became clear this week were certain details about its operation – such as that the government sweeps up data daily and that a special court has been renewing the program every 90 days since about 2007. But the reports about the Internet-based data-mining program, called PRISM, represent a new revelation, to the public. Data-mining can involve the use of automated algorithms to sift through a database for clues as to the existence of a terrorist plot. One member of Congress claimed this week that the telephone data-mining program helped to thwart a significant terrorism incident in the United States "within the last few years," but could not offer more specifics because the whole program is classified. Others in Congress, as well as President Obama and the director of national intelligence, sought to allay concerns of critics that the surveillance programs represent Big Government run amok. But it would be wrong to suggest that every member of Congress is on board with the sweep of such data mining programs or with the amount of oversight such national-security endeavors get from other branches of government. Some have hinted for years that they find such programs disturbing and an infringement of people's privacy. Here's an overview of these two data-mining programs, and how much oversight they are known to have. Phone-record data mining On Thursday, the Guardian displayed on its website a top-secret court order authorizing the telephone data-collection prog. The order, signed by a federal judge on the mysterious Foreign Intelligence Surveillance Court, requires a subsidiary of Verizon to send to the NSA “on an ongoing daily basis” through July its “telephony metadata,” or communications logs, “between the United States and abroad” or “wholly within the United States, including local telephone calls.” Such metadata include the phone number calling and the number called, telephone calling card numbers, and time and duration of calls. What's not included is permission for the NSA to record or listen to a phone conversation. That would require a separate court order, federal officials said after the program's details were made public. After the Guardian published the court's order, it became clear that the document merely renewed a data-collection that has been under way since 2007 – and one that does not target Americans, federal officials said. “The judicial order that was disclosed in the press is used to support a sensitive intelligence collection op, on which members of Congress have been fully and repeatedly briefed,” said James Clapper, director of national intelligence, in a statement about the phone surveillance program. “The classified program has been authorized by all three branches of the Government.” That does not do much to assuage civil libertarians, who complain that the government can use the program to piece together moment-by-moment movements of individuals throughout their day and to identify to whom they speak most often. Such intelligence operations are permitted by law under Section 215 of the Patriot Act, so-called “business records” provision. It compels businesses to provide information about their subscribers to the government. Some companies responded, but obliquely, given that by law they cannot comment on the surveillance programs or even confirm their existence. Randy Milch, general counsel for Verizon, said in an e-mail to employees that he had no comment on the accuracy of the Guardian article, the Washington Post reported. The “alleged order,” he said, contains language that “compels Verizon to respond” to government requests and “forbids Verizon from revealing [the order's] existence.”
Will NSA leaks wake us from our techno-utopian dream?A vast surveillance state is being made possible by technologies that we were told would liberate us. Christian Science Monitor Dan Murphy, Staff writer, 6/10/13 They work a few hundred yards from one of the Library of Congress's most prized possessions: a vellum copy of the Bible printed in 1455 by Johann Gutenberg, inventor of movable type. But almost six centuries later, Jane Mandelbaum and Thomas Youkel have a task that would confound Gutenberg. The researchers are leading a team that is archiving almost every tweet sent out since Twitter began in 2006. A half-billion tweets stream into library computers each day. Their question: How can they store the tweets so they become a meaningful tool for researchers – a sort of digital transcript providing insights into the daily flow of history? Thousands of miles away, Arnold Lund has a different task. Mr. Lund manages a lab for General Electric, a company that still displays the desk of its founder, Thomas Edison at its research headquarters in Niskayuna, N.Y. But even Edison might need training before he'd grasp all the dimensions of one of Lund's projects. Lund's question: How can power companies harness the power of data to predict which trees will fall on power lines during a storm – thus allowing them to prevent blackouts before they happen? The work of Richard Rothman, a professor at Johns Hopkins University in Baltimore, is more fundamental: to save lives. The Centers for Disease Control and Prevention (CDC) in Atlanta predicts flu outbreaks, once it examines reports from hospitals. That takes weeks. In 2009, a study seemed to suggest researchers could predict outbreaks much faster by analyzing millions of Google searches. Spikes in queries like "My kid is sick" signaled a flu outbreak before the CDC knew there would be one. That posed a new question for Dr. Rothman and his colleague Andrea Dugas: Could Google help predict influenza outbreaks in time to allow hospitals like the one at Johns Hopkins to get ready? They ask different questions. But all of these researchers form part of the new world of Big Data – a phenomenon that may revolutionize every facet of life, culture, and, well, even the planet. From curbing urban crime to calculating the effectiveness of a tennis player's backhand, people are now gathering and analyzing vast amounts of data to predict human behaviors, solve problems, identify shopping habits, thwart terrorists – everything but foretell which Hollywood scripts might make blockbusters. Actually, there's a company poring through numbers to do that, too. Just four years ago, someone wanted to do a Wikipedia entry on Big Data. Wikipedia said no; there was nothing special about the term – it just combined 2 common words. Today, Big Data seems everywhere, ushering in what advocates consider the biggest changes since Euclid. Want to get elected to public office? Put a bunch of computer geeks in a room and have them comb through databases to glean who might vote for you – then target them with micro-tailored messages, as President Obama famously did in 2012. Want to solve poverty in Africa? Analyze text messages and social media networks to detect early signs of joblessness, epidemics, and other problems, as the United Nations is trying to do. Eager to find the right mate? Use algorithms to analyze an infinite number of personality traits to determine who's the best match for you, as many online dating sites now do. What exactly is Big Data? What makes it new? Different? What's the downside? Such questions have evoked intense interest, especially since June 5. On that day, former National Security Agency analyst Edward Snowden revealed that, like Ms. Mandelbaum or Rothman, the NSA had also asked a question: Can we find terrorists using Big Data – like the phone records of hundreds of millions of ordinary Americans? Could we get those records from, say, Verizon? Mr. Snowden's disclosures revealed that PRISM, the program the NSA devised, secretly monitors calls, Web searches, and e-mails, in the United States and other countries.
Short Message Service (SMS) is a text messaging service component of phone, web, or mobile communication systems, uses standardized communications protocols that allow the exchange of short text messages between fixed line or mobile phone devices. SMS is the most widely used data application in the world, with 3.5 billion active users, or 78% of all mobile phone subscribers. The term "SMS" is used for all types of short text messaging and the user activity itself in many parts of the world. SMS as used on modern handsets originated from radio telegraphy in radio memo pagers using standardized phone protocols. These were defined in 1985, as part of the Global System for Mobile Communications (GSM) series of standards as a means of sending messages of up to 160 characters to and from GSM mobile handsets. Though most SMS messages are mobile-to-mobile text messages, support for the service has expanded to include other mobile technologies, such as ANSI CDMA networks and Digital AMPS, as well as satellite and landline networks. Message size: Transmission of short messages between the SMSC and the handset is done whenever using the Mobile Application Part (MAP) of the SS7 protocol. Messages are sent with the MAP MO- and MT-ForwardSM operations, whose payload length is limited by the constraints of the signaling protocol to precisely 140 octets (140 octets = 140 * 8 bits = 1120 bits). Short messages can be encoded using a variety of alphabets: the default GSM 7-bit alphabet, the 8-bit data alphabet, and the 16-bit UCS-2 alphabet. Larger content (concatenated SMS, multipart or segmented SMS, or "long SMS") can be sent using multiple messages, in which case each message will start with a User Data Header (UDH) containing segmentation info. Text messaging, or texting, is the act of typing and sending a brief, electronic message between two or more mobile phones or fixed or portable devices over a phone network. The term originally referred to messages sent using the Short Message Service (SMS) only; it has grown to include messages containing image, video, and sound content (known as MMS messages). The sender of a text message is known as a texter, while the service itself has different colloquialisms depending on the region. It may simply be referred to as a text in North America, the United Kingdom, Australia and the Philippines, an SMS in most of mainland Europe, and a TMS or SMS in the Middle East, Africa and Asia. Text messages can be used to interact with automated systems to, for example, order products or services, or participate in contests. Advertisers and service providers use direct text marketing to message mobile phone users about promotions, payment due dates, etcetera instead of using mail, e-mail or voicemail. In a straight and concise definition for the purposes of this English Language article, text messaging by phones or mobile phones should include all 26 letters of the alphabet and 10 numerals, i.e., alpha-numeric messages, or text, to be sent by texter or received by the textee. Security concerns: Consumer SMS should not be used for confidential communication. The contents of common SMS messages are known to the network operator's systems and personnel. Therefore, consumer SMS is not an appropriate technology for secure communications. To address this issue, many companies use an SMS gateway provider based on SS7 connectivity to route the messages. The advantage of this international termination model is the ability to route data directly through SS7, which gives the provider visibility of the complete path of the SMS. This means SMS messages can be sent directly to and from recipients without having to go through the SMS-C of other mobile operators. This approach reduces the number of mobile operators that handle the message; however, it should not be considered as an end-to-end secure communication, as the content of the message is exposed to the SMS gateway provider. Failure rates without backward notification can be high between carriers (T-Mobile to Verizon is notorious in the US). International texting can be extremely unreliable depending on the country of origin, destination and respective carriers.
Twitteris an online social networking and microblogging service that enables its users to send and read text-based messages of up to 140 characters, known as "tweets". Twitter was created in March 2006 by Jack Dorsey and by July, the social networking site was launched. The service rapidly gained worldwide popularity, with over 500 million registered users as of 2012, generating over 340 million tweets daily and handling over 1.6 billion search queries per day. Since its launch, Twitter has become one of the ten most visited websites on the Internet, and has been described as "the SMS of the Internet. Unregistered users can read tweets, while registered users can post tweets thru the website, SMS, or a range of apps for mobiles. Twitter Inc. is in San Francisco, with servers and offices in NYC, Boston, San Antonio. Tweets are publicly visible by default, but senders can restrict message delivery to just their followers. Users can tweet via the Twitter website, compatible external apps (such as for smartphones), or by Short Message Service (SMS) available in certain countries. While the service is free, accessing it thru SMS has phone fees. Users may subscribe to other users' tweets – this is known as following and subscribers are known as followers or tweeps, a portmanteau of Twitter and peeps. Users can also check people who are un-subscribing them on Twitter (unfollowing). Also, users have the capability to block those who have followed them. Twitter allows users to update their profile via their mobile phone either by text messaging or by apps released for certain smartphones and tablets. Twitter has been compared to a web-based Internet Relay Chat (IRC) client. In a 2009 Time essay, described the basic mechanics of Twitter as "remarkably simple": As a social network, Twitter revolves around the principle of followers. When you choose to follow another Twitter user, that user's tweets appear in reverse chronological order on your main Twitter page. If you follow 20 people, you'll see a mix of tweets scrolling down the page: breakfast-cereal updates, interesting new links, music recommendations, even musings on the future of education. Pear Analytic analyzed 2,000 tweets (originating from US in English) over a 2-week period in 8/09 from 11:00 am to 5:00 pm (CST) and separated them into six categories: Pointless babble – 40% Conversational – 38% Pass-along value – 9% Self-promotion – 6% Spam – 4% News – 4% Social networking researcher Danah Boyd argues what Pear researchers labeled "pointless babble" is better characterized as "social grooming" and/or "peripheral awareness" (which she explains as persons "want[ing] to know what the people around them are thinking and doing and feeling, even when co-presence isn’t viable"). Format: Users can group posts by topic/type with hashtags – words or phrases prefixed with a "#" sign. Similarly, "@" sign followed by a username is used for mention/reply to other users. To repost a message from another Twitter user, and share it with one's own followers, retweet function, symbolized by "RT" in the message. In late 2009, the "Twitter Lists" feature was added, making it possible for users to follow (as well as mention and reply to) ad hoc lists of authors instead of individual authors Through SMS, users can communicate with Twitter thru 5 gateway numbers: short codes for US, Canada, India, New Zealand, Isle of Man-based number for international use. There is also a short code in the UK only accessible to those on the Vodafone, O2 and Orange networks. In India, since Twitter only supports tweets from Bharti Airtel an alternative platform called smsTweet was set up by a user to work on all networks - GladlyCast exists for mobile phone users in Singapore, Malaysia, Philippines. The tweets were set to a 140-character limit for compatibility with SMS messaging, introducing the shorthand notation and slang commonly used in SMS messages. The 140-character limit has also increased the usage of URL shortening services such as bit.ly, goo.gl, and tr.im, and content-hosting services, such as Twitpic, memozu.com and NotePub to accommodate multimedia content and text longer than 140 characters. Since June 2011, Twitter has used its own t.co domain for automatic shortening of all URLs posted on its website. Trending topics: A word, phrase or topic that is tagged at a greater rate than other tags is said to be a trending topic. Trending topics become popular either thru a concerted effort by users, or by an event prompts people to talk about one specific topic These topics help Twitter and their users understand what's happening in the world. Trending topics are sometimes the result of concerted efforts by fans of certain celebrities or cultural phenomena, particularly musicians like Lady Gaga (known as Little Monsters), Justin Bieber (Beliebers), and One Direction (Directioners), and fans of the Twilight (Twihards) and Harry Potter (Potterheads) novels. Twitter has altered the trend algorithm in the past to prevent manipulation of this type. Twitter's March 30, 2010 blog post announced that the hottest Twitter trending topics would scroll across the Twitter homepage. Controversies abound on Twitter trending topics: Twitter has censored hashtags other users found offensive. Twitter censored the #Thatsafrican and the #thingsdarkiessay hashtags after users complained they found the hashtags offensive. There are allegations that twitter removed #NaMOinHyd from trending list and added Indian National Congress sponsored hashtag.
Adding and following content There are numerous tools for adding content, monitoring content and conversations including Telly (video sharing, old name is Twitvid), TweetDeck, Salesforce.com, HootSuite, and Twitterfeed. As of 2009, fewer than half of tweets were posted using the web user interface with most users using third-party applications (based on analysis of 500 million tweets by Sysomos). Verified accounts In June 2008, Twitter launched a verification program, allowing celebrities to get their accounts verified.[97] Originally intended to help users verify which celebrity accounts were created by the celebrities themselves (and therefore are not fake), they have since been used to verify accounts of businesses and accounts for public figures who may not actually tweet but still wish to maintain control over the account that bears their name. Mobile Twitter has mobile apps for iPhone, iPad, Android, Windows Phone, BlackBerry, and Nokia There is also version of the website for mobile devices, SMS and MMS service. Twitter limits the use of third party applications utilizing the service by implementing a 100,000 user limit. Authentication As of August 31, 2010, third-party Twitter applications are required to use OAuth, an authentication method that does not require users to enter their password into the authenticating application. Previously, the OAuth authentication method was optional, it is now compulsory and the user-name/password authentication method has been made redundant and is no longer functional. Twitter stated that the move to OAuth will mean "increased security and a better experience". Related Headlines On August 19, 2013, Twitter announced Twitter Related Headlines. Usage Rankings Twitter is ranked as one of the ten-most-visited websites worldwide by Alexa'sweb traffic analysis. Daily user estimates vary as the company does not publish statistics on active accounts. A February 2009 Compete.com blog entry ranked Twitter as the third most used social network based on their count of 6 million unique monthly visitors and 55 million monthly visits. In March 2009, a Nielsen.com blog ranked Twitter as the fastest-growing website in the Member Communities category for February 2009. Twitter had annual growth of 1,382 percent, increasing from 475,000 unique visitors in February 2008 to 7 million in February 2009. In 2009, Twitter had a monthly user retention rate of forty percent. Demographics Twitter.com Top5 Global Markets by Reach (%) CountryPercent IndonesiaJun 2010 20.8%, Dec 2010 19.0% BrazilJun 2010 20.5%, Dec 2010 21.8% VenezuelaJun 2010 19.0%, Dec 2010 21.1% NetherlandsJun 2010 17.7%, Dec 2010 22.3% JapanJun 2010 16.8%, Dec 2010 20.0% Note: Visitor age 15+, home and work locations. Excludes visitation from public computers such as Internet cafes or access from mobile phones or PDAs. In 2009, Twitter was mainly used by older adults who might not have used other social sites before Twitter, said Jeremiah Owyang, an industry analyst studying social media. "Adults are just catching up to what teens have been doing for years," he said. According to comScore only eleven percent of Twitter's users are aged twelve to seventeen. comScore attributed this to Twitter's "early adopter period" when the social network first gained popularity in business settings and news outlets attracting primarily older users. However, comScore also stated in 2009 that Twitter had begun to "filter more into the mainstream", and "along with it came a culture of celebrity as Shaq, Britney Spears and Ashton Kutcher joined the ranks of the Twitterati." According to a study by Sysomos in June 2009, women make up a slightly larger Twitter demographic than men — fifty-three percent over forty-seven percent. It also stated that five percent of users accounted for seventy-five percent of all activity, and that New York City has more Twitter users than other cities. According to Quancast, twenty-seven million people in the US used Twitter as of September 3, 2009. Sixty-three percent of Twitter users are under thirty-five years old; sixty percent of Twitter users are Caucasian, but a higher than average (compared to other Internet properties) are African American/black (sixteen percent) and Hispanic (eleven percent); fifty-eight percent of Twitter users have a total household income of at least US$60,000. The prevalence of African American Twitter usage and in many popular hashtags has been the subject of research studies. On September 7, 2011, Twitter announced that it has 100 million active users logging in at least once a month and 50 million active users every day. In an article published on January 6, 2012, Twitter was confirmed to be the biggest social media network in Japan, with Facebook following closely in second. comScore confirmed this, stating that Japan is the only country in the world where Twitter leads Facebook. Finances Funding Twitter's San Francisco headquarters located at 1355 Market St. Twitter raised over US$57 million from venture capitalist growth funding, although exact numbers are not publicly disclosed. Twitter's first A round of funding was for an undisclosed amount that is rumored to have been between US$1 million and US$5 million. Its second B round of funding in 2008 was for US$22 million and its third C round of funding in 2009 was for US$35 million from Institutional Venture Partners and Benchmark Capital along with an undisclosed amount from other investors including Union Square Ventures, Spark Capital and Insight Venture Partners. Twitter is backed by Union Square Ventures, Digital Garage, Spark Capital, and Bezos Expeditions. In May 2008, The Industry Standard remarked that Twitter's long-term viability is limited by a lack of revenue. Twitter board member Todd Chaffee forecast that the company could profit from e-commerce, noting that users may want to buy items directly from Twitter since it already provides product recommendations and promotions. The company raised US$200 million in new venture capital in December 2010, at a valuation of approximately US$3.7 billion. In March 2011, 35,000 Twitter shares sold for US$34.50 each on Sharespost, an implied valuation of US$7.8 billion. In August, 2010 Twitter announced a "significant" investment lead by Digital Sky Tech that, at US$800M, was reported to be the largest venture round in history. Twitter has been identified as a possible candidate for an initial public offering by 2013. In December 2011, the Saudi prince Alwaleed bin Talal invested $300 million in Twitter. The company was valued at $8.4 billion at the time.
Revenue sources In July 2009, some of Twitter's revenue and user growth documents were published on TechCrunch after being illegally obtained by Hacker Croll. The documents projected 2009 revenues of US$400,000 in the third quarter and US$4 million in the fourth quarter along with 25 million users by the end of the year. The projections for the end of 2013 were US$1.54 billion in revenue, US$111 million in net earnings, and 1 billion users. No information about how Twitter planned to achieve those numbers was published. In response, Twitter co-founder Biz Stone published a blog post suggesting the possibility of legal action against hacker. On April 13, 2010, Twitter announced plans to offer paid advertising for companies that would be able to purchase "promoted tweets" to appear in selective search results on the Twitter website, similar to Google Adwords' advertising model. As of April 13, Twitter announced it had already signed up a number of companies wishing to advertise including Sony Pictures, Red Bull, Best Buy, and Starbucks. To continue their advertising campaign, Twitter announced on March 20, 2012, that it would be bringing its promoted tweets to mobile devices. Twitter generated US$139.5 million in advertising sales during 2011 and expects this number to grow 86.3% to US$259.9 million in 2012. The company generated US$45 million in annual revenue in 2010, after beginning sales midway through that year. The company operated at a loss through most of 2010. Revenues were forecast for US$100 million to US$110 million in 2011. Users' photos can generate royalty-free revenue for Twitter, with an agreement with WENN being announced in May 2011. In June 2011, Twitter announced that it would offer small businesses a self serve advertising system. In April 2013, Twitter announced that its Twitter Ads self-service ads platform was available to all US users without an invite. Technology Implementation Great reliance is placed on open-source software. The Twitter Web interface uses the Ruby on Rails framework, deployed on a performance enhanced Ruby Enterprise Edition implementation of Ruby. As of April 6, 2011, Twitter engineers confirmed they had switched away from their Ruby on Rails search-stack, to a Java server they call Blender. From spring 2007 to 2008 the messages were handled by a Ruby persistent queue server called Starling, but since 2009 implementation has been gradually replaced with software written in Scala. The service's application programming interface (API) allows other web services and applications to integrate with Twitter. Individual tweets are registered under unique IDs using software called snowflake and geolocation data is added using 'Rockdove'. The URL shortner t.co then checks for a spam link and shortens the URL. The tweets are stored in a MySQL database using Gizzard and acknowledged to users as having been sent. They are then sent to search engines via the Firehose API. The process itself is managed by FlockDB and takes an average of 350 ms. On August 16, 2013, Twitter’s Vice President of Platform Engineering Raffi Krikorian shared in a blog post that the company's infrastructure handled almost 143,000 tweets per second during that week, setting a new record. Krikorian explained that Twitter achieved this record by blending its homegrown and open source. Interface On April 30, 2009, Twitter adjusted its web interface, adding a search bar and a sidebar of "trending topics" — the most common phrases appearing in messages. Biz Stone explains that all messages are instantly indexed and that "with this newly launched feature, Twitter has become something unexpectedly important – a discovery engine for finding out what is happening right now." In March 2012, Twitter became available in Arabic, Farsi, Hebrew and Urdu, the first right-to-left language versions of the site. About 13,000 volunteers helped with translating the menu options. it is available in 33 different languages. Outages When Twitter experiences an outage, users see the "fail whale" error message image created by Yiying Lu, illustrating eight orange birds using a net to hoist a whale from the ocean captioned "Too many tweets! Please wait a moment and try again." Twitter had approximately ninety-eight percent uptime in 2007 (or about six full days of downtime). The downtime was particularly noticeable during events popular with the technology industry such as 2008 Macworld Conf & Expo keynote. Privacy and security Twitter messages are public but users can also send private messages. Twitter collects personally identifiable information about its users and shares it with third parties. The service reserves the right to sell this information as an asset if the company changes hands. While Twitter displays no advertising, advertisers can target users based on their history of tweets and may quote tweets in ads directed specifically to the user. A security vulnerability was reported on April 7, 2007, by Nitesh Dhanjani and Rujith. Since Twitter used the phone number of the sender of an SMS message as authentication, malicious users could update someone else's status page by using SMS spoofing. The vulnerability could be used if the spoofer knew the phone number registered to their victim's account. Within a few weeks of this discovery Twitter introduced an optional personal identification number (PIN) that its users could use to authenticate their SMS-originating messages. On January 5, 2009, 33 high-profile Twitter accounts were compromised after a Twitter administrator's password was guessed by a dictionary attack. Falsified tweets — including sexually explicit and drug-related messages — were sent from these accounts. Twitter launched the beta version of their "Verified Accounts" service on June 11, 2009, allowing famous or notable people to announce their Twitter account name. The home pages of these accounts display a badge indicating their status. In May 2010, a bug was discovered by İnci Sözlük, involving users that allowed Twitter users to force others to follow them without the other users' consent or knowledge. For example, comedian Conan O'Brien's account, which had been set to follow only one person, was changed to receive nearly 200 malicious subscriptions. In response to Twitter's security breaches, the US Federal Trade Commission brought charges against the service which were settled on June 24, 2010. This was the first time the FTC had taken action against a social network for security lapses. The settlement requires Twitter to take a number of steps to secure users' private information, including maintenance of a "comprehensive information security program" to be independently audited biannually. On 12/14/10, USDoJ issued a subpoena directing Twitter to provide information for accounts registered to or associated with WikiLeaks. Twitter decided to notify its users and said "...it's our policy to notify users about law enforcement and governmental requests for their information, unless we are prevented by law from doing so"....
Open source Twitter has a history of both using and releasing open source software while overcoming technical challenges of their service. A page in their developer documentation thanks dozens of open source projects which they have used, from revision control software like Git to programming languages such as Ruby and Scala. Software released as open source by the company includes the Gizzard Scala framework for creating distributed datastores, the distributed graph database FlockDB, the Finagle library for building asynchronous RPC servers and clients, the TwUI user interface framework for iOS, and the Bower client-side package manager. The popular Twitter Bootstrap web design library was also started at Twitter and is the most popular repository on GitHub. Innovators patent agreement On April 17, 2012, Twitter would implement an “Innovators Patent Agreement” which obligate Twitter to only use its patents for defense. URL shortener t.co is a URL shortening service created by Twitter. It is only available for links posted to Twitter and not available for general use. All links posted to Twitter use a t.co wrapper. Twitter hopes that the service will be able to protect users from malicious sites, and will use it to track clicks on links within tweets. Having previously used the services of third parties TinyURL and bit.ly. Twitter began experimenting with its own URL shortening service for private messages in March 2010 using the twt.tl domain, before it purchased the t.co domain. The service was tested on the main site using the accounts @TwitterAPI, @rsarver and @raffi. On Sept 2, 2010, an email from Twitter to users said they would be expanding the roll-out of the service to users. On June 7, 2011, Twitter was rolling out the feature. Integrated photo-sharing service On June 1, 2011, Twitter announced its own integrated photo-sharing service that enables users to upload a photo and attach it to a Tweet right from Twitter.com. Users now also have the ability to add pictures to Twitter's search by adding hashtags to the tweet. Twitter also plans to provide photo galleries designed to gather and syndicate all photos that a user has uploaded on Twitter and third-party services such as TwitPic. Use and social impact Dorsey said after a Twitter Town Hall with Barack Obama held in July 2011, that Twitter received over 110,000 #AskObama tweets. Main article: Twitter usage: Twitter has been used for a variety of purposes in many industries and scenarios. For example, it has been used to organize protests, sometimes referred to as "Twitter Revolutions", which include the Egyptian revolution, 2010–2011 Tunisian protests, 2009–2010 Iranian election protests, and 2009 Moldova civil unrest. The governments of Iran and Egypt blocked the service in retaliation. The Hill on February 28, 2011 described Twitter and other social media as a "strategic weapon ... which have the apparent ability to re-align the social order in real time, with little or no advanced warning." During the Arab Spring in early 2011, the number of hashtags mentioning the uprisings in Tunisia and Egypt increased. A study by the Dubai School of Government found that only 0.26% of the Egyptian population, 0.1% of the Tunisian population and 0.04% of the Syrian population are active on Twitter. The service is also used as a form of civil disobedience: in 2010, users expressed outrage over the Twitter Joke Trial by making obvious jokes about terrorism; and in the British privacy injunction debate in the same country a year later, where several celebrities who had taken out anonymised injunctions, most notably the Manchester United player Ryan Giggs, were identified by thousands of users in protest to traditional journalism being censored. Another, more real time and practical use for Twitter exists as an effective de facto emergency communication system for breaking news. It was neither intended nor designed for high performance communication, but the idea that it could be used for emergency communication certainly was not lost on the originators, who knew that the service could have wide-reaching effects early on when the San Francisco, California company used it to communicate during earthquakes. The Boston Police tweeted news of the arrest of the 2013 Boston Marathon Bombing suspect. A practical use being studied is Twitter's ability to track epidemics, how they spread. Twitter has been adopted as a communication and learning tool in educational settings mostly in colleges and universities. It has been used as a backchannel to promote student interactions, especially in large-lecture courses. Research has found that using Twitter in college courses helps students communicate with each other and faculty, promotes informal learning, allows shy students a forum for increased participation, increases student engagement, and improves overall course grades. In May 2008, The Wall Street Journal wrote that social networking services such as Twitter "elicit mixed feelings in the technology-savvy people who have been their early adopters. Fans say they are a good way to keep in touch with busy friends. But some users are starting to feel 'too' connected, as they grapple with check-in messages at odd hours, higher cellphone bills and the need to tell acquaintances to stop announcing what they're having for dinner." Television, rating Twitter is also increasingly used for making TV more interactive and social. This effect is sometimes referred to as the "virtual watercooler" or social television — the practice has been called "chatterboxing". Statistics Most popular accounts As of July 28, 2013, the ten accounts with the most followers belonged to the following individuals and organizations:[262] Justin Bieber (42.2 mil followers worldwide) Katy Perry (39.9m) Lady Gaga (39.2m) Barack Obama (34.5) - most followed account for politician Taylor Swift (31.4m) Rihanna (30.7m) YouTube (31m) - highest account not representing an individual Britney Spears (29.7m) Instagram (23.6m) Justin Timberlake (23.3m) Other selected accounts: 12. Twitter (21.6m) 16. Cristiano Ronaldo (20m) - highest account athlete 58. FC Barcelona (9.5m) - highest account representing a sports team Oldest accounts 14 accounts belonging to Twitter employees at the time and including @jack (Jack Dorsey), @biz (Biz Stone) and @noah (Noah Glass). Record tweets On February 3, 2013, Twitter announced that a record 24.1 million tweets were sent the night of Super Bowl XLVII. Future Twitter emphasized its news and information-network strategy in November 2009 by changing the question asked to users for status updates from "What are you doing?" to "What's happening?" On November 22, 2010, Biz Stone, a cofounder of the company, expressed for the first time the idea of a Twitter news network, a concept of a wire-like news service he has been working on for years.
The dark side of Big Data involves much more than Snowden's disclosure, or what the US does. And what made Big Data possible did not happen overnight. The term has been around for at least 15 years, though it's only recently become popular. "It will be quite transformational," says Thomas Davenport, an information technology expert at Babson College in Wellesley, Mass., who co-wrote the widely used book "Competing on Analytics: The New Science of Winning." Going back to the beginning. Big Data starts with ... a lot of data. Google executive chairman Eric Schmidt has said that we now uncover as much data in 48 hours – 1.8 zettabytes (that's 1,800,000,000,000,000,000,000 bytes) – as humans gathered from "the dawn of civilization to the year 2003." You read that right. The head of a company receiving 50 billion search requests a day believes people now gather in a few days more data than humans have done throughout almost all of history. Mr. Schmidt's claim has doubters. But similar assertions crop up from people not prone to exaggeration, such as Massachusetts Institute of Technology researcher Andrew McAfee and MIT professor Erik Brynjolfsson, authors of the new book "Race Against the Machine." "More data crosses the Internet every second," they write, "than were stored in the entire Internet 20 years ago." A key driver of the growth of data is the way we've digitized many of our everyday activities, such as shopping (increasingly done online) or downloading music. Another factor: our dependence on electronic devices, all of which leave digital footprints every time we send an e-mail, search online, post a message, text, or tweet. Virtually every institution in society, from government to the local utility, is churning out its own torrent of electronic digits – about our billing records, our employment, our electricity use. Add in the huge array of sensors that now exist, measuring everything from traffic flow to the spoilage of fruit during shipment, and the world is awash in information that we had no way to uncover before – all aggregated and analyzed by increasingly powerful computers. Most of this data doesn't affect us. Amassing information alone doesn't mean it's valuable. Yet the new ability to mine the right information, discover patterns and relationships, already affects our everyday lives. Anyone, for instance, who has a navigation screen on a car dashboard uses data streaming from 24 satellites 11,000 miles above Earth to pinpoint his or her exact location. People living in Los Angeles and dozens of other cities now participate, knowingly or not, in the growing phenomenon of "predictive policing" – authorities' use of algorithms to identify crime trends. Tennis fans use IBM SlamTracker, an online analytic tool, to find out exactly how many return of serves Andy Murray needed to win Wimbledon. When we use sites like SlamTracker, companies take note of our browsing habits and, through either the miracle or the meddling of Big Data, use that information to send us personal pitches. That's what happens when AOL greets you with a pop-up ad (Slazenger tennis balls – 70 percent off!). In their book, "Big Data: A Revolution That Will Transform How We Live, Work, and Think," Kenneth Cukier and Viktor Mayer-Schönberger mention Wal-Mart's discovery, gleaned by mining sales data, that people preparing for a hurricane bought lots of Pop-Tarts. Now, when a storm is on the way, Wal-Mart puts Pop-Tarts on the shelves next to the flashlights. But what excites and concerns people about Big Data is more far-reaching than that. Seeing the bigger picture: taking a closer look at some of the people in the digital trenches. I follow Mandelbaum and Mr. Youkel down a corridor of the Library of Congress, past exhibits redolent of history and what you might expect from what we call "America's library," with its 38 million books on 838 miles of shelving. They open a door. We pass behind people staring at huge computer screens and enter a room that doesn't look as if it belongs in a library at all. It's the size of a gym, with fluorescent lights overhead and tall metal boxes rising from the floor. "The tweets come here," Mandelbaum says. It's been three years since Twitter approached the library with a question. What the online networking service started in 2006 had become a new way of communicating. Would there, Twitter asked, be historical value in archiving tweets? "We saw the value right away," says Robert Dizard, deputy director of the library. "[Our] mission is, preserve the record of America."
Arnold Lund is looking ahead. Lund has a Ph.D. in experimental psychology. He holds 20 patents, has written a book on managing technology design, and directs a variety of projects for General Electric. Last year, a tree fell on power lines behind my house. As the local utility repaired things, an electrical surge crashed my computer, destroying all the contents. Lund's power line project has my attention. "For power companies, one of the largest expenses is managing foliage," he says. "We lay out the entire geography of a state – and the overlay of the power grid. We use satellite data to look at tree growth and cut back where there's most growth. Then [we] predict where the most likely [problem] is. We have 50 different variabilities to see the probability of outage." In that one compressed paragraph, I see three big changes Mr. Cukier and Mr. Mayer-Shönberger say Big Data brings to research. It's what we might call the three "nots." Size, not sample. For more than a century, statisticians have relied on small samples of data from which to generalize. They had to. They lacked the ability to collect more. The new technology means we can "collect a lot of data rather than settle for ... samples." Messy, not meticulous. Traditionally, researchers have insisted on "clean, curated data. When there was not that much data around, researchers [had to be as] exact as possible." Now, that's no longer necessary. "Accept messiness," they write, arguing that the benefits of more data outweigh our "obsession with precision." Correlation, not cause. While knowing the causes behind things is desirable, we don't always need to understand how the world works "to get things done," they note. Lund's lab exemplifies all three. First, his "entire geography" and 50 variables involve massive sets of data – information streaming in from sensors, satellites, and other sources about everything from forest density to prevailing wind direction to grid loads. Second, he looks for "probability" not "obsessive precision." Correlation? Lund values cause, but the reason behind, say, tree growth interests him less than spotting correlations that might spur action. "Ah – that tree," he exclaims, as if he is an engineer in the field. "Better get the trucks out ahead of the storm!" Cukier and Mayer-Schönberger cite the United Parcel Service to bolster their argument about correlation. UPS equips its trucks with sensors that identify vibrations and other things associated with breakdowns. "The data do not tell UPS why the part is in trouble. They reveal enough for the company to know what to do." Lund's boss, GE chief executive officer Jeff Immelt, also talks about sensor data. The company is now investing $1 billion in software and analytics, which includes putting sensors on its jet engines to help enhance fuel efficiency. Mr. Immelt has said that just a 1 percent change in "fuel burn" can be worth hundreds of millions of dollars to an airline. "You save an oil guy 1 percent," Immelt said at a conference this spring, "you're his friend for life." While Lund has talked glowingly about how much data his projects can collect, he wants to make sure I know data isn't everything. "As a scientist," he says, "I know the biggest challenge is finding the right questions. How do you find the questions important to business, society, and culture?" Rothman has questions, too. "We work in emergency rooms," he says about himself and Dr. Dugas. "We're the boots on the ground." Rothman's work has involved emergency medicine and the nexus between public health and epidemics, including influenza, which kills as many as 500,000 people a year around the world and about 45,000 in the US. The two researchers wanted to find out if the Google national study held lessons for Baltimore and their emergency room (ER). They studied Google queries for the Baltimore area – queries about flu symptoms, or chest congestion, or where to buy a thermometer. If they could spot spikes, that might help solve one crucial problem. "Crowding," Dugas says. "Huge issue."
When epidemics start, people rush to hospitals. Waiting rooms fill up. If Google trends showed a spike just as epidemics started, ERs could staff up and reserve more space for the surge of patients. The link between Google spikes and hospital visits in Baltimore turned out to be strong, especially for children. As soon as the first news reports surfaced about the 2009 H1N1 virus, pediatric ER visits at Hopkins increased – at the peak by as much as 75 percent. But when the two researchers looked closer, they found something unexpected. No flu. It turned out that news reports about H1N1 elsewhere fueled a rush to ERs in Baltimore – what one researcher called "fear week." "If you just looked at correlation for flu, you'd say it was a false trend," says Dugas. Even so, she and Rothman found data important for ERs: No matter why people are coming, they need to staff up. The Baltimore study also showed the importance of finding out what was behind all those medically related Google searches – in other words, not just correlation but cause. Like GE's Lund, Rothman emphasizes the value of "the questions you're asking." Evidence that Big Data promises enormous benefits is more than anecdotal. MIT's Mr. Brynjolfsson did a study in 2012 examining 179 companies. He found those whose decisions were "data-driven" had become 5 to 6 percent more productive in ways only the use of data could explain. On the other hand, consider just this one data point: If you type "Big Data Dark Side" into Google, you'll get 40 million results. Despite the potential, there's also peril. The dark side of Big Data concerns Laura DeNardis, Internet scholar, author of three books, and professor at American University's school of communication in Washington. She and others worry – not exclusively – about three questions. Does the new technology (1) erode privacy, (2) promote inequality, and (3) turn government into Big Brother? She points to public health data as one potential source of abuse. Her concern echoes that of critics who fear that supposedly anonymous patient records are not anonymous at all. As far back as the 1990s, a Massachusetts state commission gave researchers health data about state workers, believing this would help officials make better health-care decisions. William Weld, then governor of Massachusetts, assured workers their files had been scrubbed of the data that could identify them. Harvard University computer science graduate student took this promise of privacy as a challenge. Using just three bits of data, Latanya Sweeney showed how to identify everyone – including Weld, whose diagnoses, medications, and entire medical history Ms. Sweeney, now a professor at Harvard, gleefully sent to his office. Today there are powerful ways to identify people from records supposed to keep things private. And there are concerns other than our health records. Dr. DeNardis worries about how much companies know about our social media habits. "Take a look at the published privacy policies of Apple, Facebook, or Google," she says. "They know what you view, when you make a call, where you are. People consent to that by 'I agree' to privacy terms. But how carefully are they read?" She's not alone. Jay Stanley of the American Civil Liberties Union describes one example of what companies can do with what they know about us: "credit-scoring." "Credit card companies," he wrote in a blog, "sometimes lower a customer's credit card limit based on the repayment history of other customers at stores where a person shops." Do we want Master Card to lower our credit-card limits, thinking we're a risk, just because people who frequent the stores we do don't pay their bills? In addition to individual privacy, critics worry about Big Data's impact in more expansive ways, such as the growing gap between rich and poor nations. Large American companies can hire hundreds of data analysts. How can Bangladesh compete? Will this aggravate the global digital divide? Perhaps most worrisome to people at the moment is the government's use of Big Data to monitor its own citizens, or others, in the name of national security. "The American people," President Obama after the NSA story broke, "don't have a Big Brother who is snooping into their business."
Did Obama mean George Orwell's term doesn't include governments secretly monitoring calls, e-mails, audio, and video of citizens suspected of nothing? Commandeering information from firms like Yahoo and Google? The questions that arose from Snowden's revelations in June encompass issues of privacy, confidentiality, freedom, and, of course, security. The Obama administration argues that monitoring personal info keeps the country safe, asserting that PRISM has helped foil 54 separate terrorist plots against the US. Some lawmakers on Capitol Hill dispute that number, though, and in recent weeks momentum has been building in Washington to rein in the NSA. Not only has support increased on the left and right to adopt more oversight of its surveillance program, polls show a hardening of public opinion about snooping, too. Meanwhile, there is no doubt about the fury in other countries when the news broke – especially in Germany, where critics have compared American monitoring of foreigners' phone calls and e-mails with that of Stasi, the former hated East German secret police. In fact, some of those most upset about the NSA revelations include Americans alarmed about what the new technology means outside US borders. Suzanne Nossel, head of the PEN American Center, which works to free writers and artists around the world imprisoned for free speech, worries about the government use of data from private companies to stifle dissent. "It's not new," she says, citing the Chinese dissident Shi Tao, imprisoned by China in 2004 for posting political commentary on foreign websites, and still locked up. "Yahoo China had assisted the Chinese government. They used [Yahoo data] to convict him." But then Ms. Nossel talks about the recent unrest in Turkey, where the Turkish military shot and arrested dozens of protesters in Istanbul's Taksim Square. To find more of what they called "looters," the Turkish government went to Twitter and Facebook for help – and announced that Facebook was "responding positively," something Facebook has denied. And Nossel sees a difference between 2004 and now. Talking about the most repressive governments in the world, she argues that "the government ability to sweep and search is [now] so great, it tips the scale. No technology on the side of human rights advocates can confront it. That's new – and chilling." What have we learned? There's a notable "Sesame Street" episode from years back in which Cookie Monster wanders into a library and drives the librarian crazy by asking over and over for a cookie. "This is a LIBRARY!" the librarian finally screams, forgetting to whisper. "We have books! Just books!" That's certainly been our image of what libraries do. "You can still find books here," Mandelbaum reminds me, standing in a room full of processors. But figures over the past decade seem to show that books – those rectangular things with pages we turn – are slowly on the way out in the Digital Age. That's less significant than it might seem, though. After all, we value books because of the knowledge they hold. We've changed the way we convey knowledge many times. Big Data is another source of knowledge. Will it become a more integral part of tomorrow's libraries? It is perhaps fitting that one of the "Sesame Street" characters most in tune with the future is ... the Count. He counts everything. His role is to teach kids the importance of counting. Big Data allows us to count everything – and analyze what we find. But are numbers enough? Brynjolfsson and Mr. McAfee compare Big Data to Leeuwenhoek's development of the microscope in the 1670s. They are, after all, both tools. They let people see lots of things that have always been around. Of course, the microscope also prompted us to ask questions we could never ask before. Big Data does that, too. RECOMMENDED: How much do you know about cybersecurity? Take our quiz. Still, while Big Data can predict a flu outbreak or where trees fall, it can't, by itself, resolve the economic and moral dilemmas we have. Whether to keep power running, help patients faster, or preserve the record of America, Big Data teaches us what's out there, not what's right. There's nothing inherently wrong with Big Data. What matters, as it does for Arnold Lund in California or Richard Rothman in Baltimore, are the questions – old and new, good and bad – this newest tool lets us ask. • Robert A. Lehrman is a novelist and former White House chief speechwriter for Vice President Al Gore. Author of 'The Political Speechwriter's Companion,' he teaches at American University and co-runs a blog, PunditWire. Related stories
Hadoop/MapReduce • WPI, Mohamed Eltabakh • MapReduce computing paradigm (E.g., Hadoop) vs. TraditionalDBS • Many enterprises are turning to Hadoop • Especially applications generating big data • Web applications, social networks, scientific applications • Why Hadoop is able to compete? • Scalability (petabytes of data, thousands of machines) • Flexibility in accepting all data formats (no schema) • Efficient and simple fault-tolerant mechanism • Commodity inexpensive hardware • Database • Performance (tons of indexing, tuning, data org tech.) • Features: - Provenance tracking - Annotation management • Hadoop: swtwr for distr proc of large datasets across large clusters • Large datasets Terabytes or petabytes of data • Large clusters hundreds or thousands of nodes • Hadoop is open-source implementation for Google MapReduce • Hadoop is based on a simple programming model called MapReduce • Hadoop is based on a simple data model, any data will fit • Hadoop framework consists on two main layers Distributed file system (HDFS), Execution engine (MapReduce) Master node (single node) Many slave nodes Hadoop designed as master-slave shared-nothingarchi • DESIGN PRINCIPALS • Need to process big data • Need to parallelize computation across ~1000 nodes • Commodity hardware Many low-end cheap machines work in parallel to solve problem • This is in contrast to Parallel DBs • Small number of high-end expensive machines • Automatic parallelization & distribution Hidden from the end-user • Fault tolerance and automatic recovery Nodes/tasks fail and recover automatically • Clean and simple prog abstraction Users only provide 2 fctns map/reduce • WHO USES IT? • Google: Inventors of MapReduce compute paradigm • Yahoo: Develop Hadoop open-source MapReduce • IBM, Microsoft, Oracle • Facebook, Amazon, AOL, NetFlex • Many others + universities and research labs
Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data Replication: Each data block is replicated many times (default is 3) Failure: Failure is the norm rather than exception Fault Tolerance: Detection of faults and quick, automatic recovery is a goal of HDFS N amenode is consistently checking Datanodes Master node (single node) Blocks (64 MB) Many slave nodes Hadoop Architecture Distributed file system (HDFS) Execution engine (MapReduce) Hadoop Distributed File System (HDFS) Centralized namenode - Maintains metadata info about files 1 2 3 4 5 File F Many datanode (1000s) - Store the actual data - Files are divided into blocks - Each block is replicated N times (Default=3)
Map-Reduce Execution Engine (Example: Color Count) Input blocks on HDFS Node 3 Node 1 Node 2 Users only provide the “Map” and “Reduce” functions Produces (k, v) ( , 1) Shuffle & Sorting based on k Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Produces(k’, v’) ( , 100) • Task Tracker is the slave node • (runs on each datanode) • Receives the task from Job Tracker • Runs the task until completion (either map or reduce task) • Always in communication with the Job Tracker reporting progress • In the top example, 1 map-reduce job consists of 4 map tasks and 3 reduce tasks • Properties of MapReduce Engine • Job Tracker is the master node (runs with the namenode) • Receives the user’s job • Decides on how many tasks will run (number of mappers) • Decides on where to run each mapper (concept of locality) • This file has 5 Blocks run 5 map tasks • Where to run the task reading block “1” • Try to run it on Node 1 or Node 3
KEY VALUE PAIRS • Mappers and Reducers are users’ code (provided functions) • Just need to obey the Key-Value pairs interface • Mappers: • Consume <key, value> pairs • Produce <key, value> pairs • Reducers: • Consume <key, <list of values>> • Produce <key, value> • Shuffling and Sorting: • Hidden phase between mappers and reducers • Groups all similar keys from all mappers, sorts and passes them to a certain reducer in the form of <key, <list of values>> Deciding on what will be the key and what will be the value developer’s resp Example 1: Word Count Job: Count the occurrences of each word in a data set
Example 2: Color Count Job: Count the number of each color in a data set Input blocks on HDFS Produces (k, v) ( , 1) Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Shuffle & Sorting based on k Produces(k’, v’) ( , 100) Write to HDFS Part0001 Each map task will select only the blue or green colors No need for reduce phase Part0002 Write to HDFS Part0003 Write to HDFS That’s output file, it has 3 parts on probably 3 different machines Write to HDFS Example 3: Color Filter Job: Select only the blue and the green colors Part0001 Part0002 That’s the output file, it has 4 parts on probably 4 different machines Part0003 Part0004
Bigger Picture: Hadoop vs. Other Systems • Bigger Picture: Hadoop vs. Other Systems • Cloud Computing • A computing model where any computing infrastructure can run on the cloud • Hardware & Software are provided as remote services • Elastic: grows and shrinks based on the user’s demand • Example: Amazon EC2
HDFS (Hadoop Distributed File System) is a distr file sys for commodity hdwr. Differences from other distr file sys are few but significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS originally was infrastructure for Apache Nutch web search engine project, is part of Apache Hadoop Core http://hadoop.apache.org/core/ 2.1. Hardware Failure Hardware failure is the normal. An HDFS may consist of hundreds or thousands of server machines, each storing part of the file system’s data. There are many components and each component has a non-trivial prob of failure means that some component of HDFS is always non-functional. Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. 2.2. Streaming Data Access Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates. 2.3. Large Data Sets Apps on HDFS have large data sets, typically gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It provides high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It supports ~10 million files in a single instance. 2.4. Simple Coherency Model: HDFS apps need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A Map/Reduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in future [write once read many at file level] 2.5. “Moving Computation is Cheaper than Moving Data” A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the app is running. HDFS provides interfaces for applications to move themselves closer to where the data is located. 2.6. Portability Across Heterogeneous Hardware and Software Platforms: HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications. 3. NameNode and DataNodes: HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is 1 blocks stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode
The NameNode and DataNode are pieces of software designed to run on commodity machines, typically run GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case. The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode. 4. The File System Namespace: HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories. The file system namespace hierarchy is similar to most other existing file systems; one can create and remove files, move a file from one directory to another, or rename a file. HDFS does not yet implement user quotas or access permissions. HDFS does not support hard links or soft links. However, the HDFS architecture does not preclude implementing these features. The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This info is stored by NameNode. 5. Data Replication: HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode
5.1. Replica Placement: The First Baby Steps: The placement of replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from most other distributed file systems. This is a feature that needs lots of tuning and experience. The purpose of a rack-aware replica placement policy is to improve data reliability, availability, and network bandwidth utilization. The current implementation for the replica placement policy is a first effort in this direction. The short-term goals of implementing this policy are to validate it on production systems, learn more about its behavior, and build a foundation to test and research more sophisticated policies. Large HDFS instances run on a cluster of computers that commonly spread across many racks. Communication between two nodes in different racks has to go through switches. In most cases, network bandwidth between machines in the same rack is greater than network bandwidth between machines in different racks. The NameNode determines the rack id each DataNode belongs to via the process outlined in Rack Awareness: A simple but non-optimal policy is to place replicas on unique racks. This prevents losing data when an entire rack fails and allows use of bandwidth from multiple racks when reading data. This policy evenly distributes replicas in the cluster which makes it easy to balance load on component failure. However, this policy increases the cost of writes because a write needs to transfer blocks to multiple racks. For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on one node in the local rack, another on a different node in the local rack, and the last on a different node in a different rack. This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is far less than that of node failure; this policy does not impact data reliability and availability guarantees. However, it does reduce the aggregate network bandwidth used when reading data since a block is placed in only two unique racks rather than three. With this policy, the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks. This policy improves write performance without compromising data reliability or read performance. The current, default replica placement policy described here is a work in progress. 5.2. Replica Selection: To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader. If there exists a replica on the same rack as the reader node, then that replica is preferred to satisfy the read request. If angg/ HDFS cluster spans multiple data centers, then a replica that is resident in the local data center is preferred over any remote replica. 5.3. Safemode: On startup, the NameNode enters a special state called Safemode. Replication of data blocks does not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat and Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that a DataNode is hosting. Each block has a specified minimum number of replicas. A block is considered safely replicated when the minimum number of replicas of that data block has checked in with the NameNode. After a configurable percentage of safely replicated data blocks checks in with the NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It then determines the list of data blocks (if any) that still have fewer than the specified number of replicas. The NameNode then replicates these blocks to other DataNodes.
6. The Persistence of File System Metadata: The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called the EditLog to persistently record every change that occurs to file system metadata. For example, creating a new file in HDFS causes the NameNode to insert a record into the EditLog indicating this. Similarly, changing the replication factor of a file causes a new record to be inserted into the EditLog. The NameNode uses a file in its local host OS file system to store the EditLog. The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. The FsImage is stored as a file in the NameNode’s local file system too. The NameNode keeps an image of the entire file system namespace and file Blockmap in memory. This key metadata item is designed to be compact, such that a NameNode with 4 GB of RAM is plenty to support a huge number of files and directories. When the NameNode starts up, it reads the FsImage and EditLog from disk, applies all the transactions from the EditLog to the in-memory representation of the FsImage, and flushes out this new version into a new FsImage on disk. It can then truncate the old EditLog because its transactions have been applied to the persistent FsImage. This process is called a checkpoint. In the current implementation, a checkpoint only occurs when the NameNode starts up. Work is in progress to support periodic checkpointing in the near future. The DataNode stores HDFS data in files in its local file system. The DataNode has no knowledge about HDFS files. It stores each block of HDFS data in a separate file in its local file system. The DataNode does not create all files in the same directory. Instead, it uses a heuristic to determine the optimal number of files per directory and creates subdirectories appropriately. It is not optimal to create all local files in the same directory because the local file system might not be able to efficiently support a huge number of files in a single directory. When a DataNode starts up, it scans through its local file system, generates a list of all HDFS data blocks that correspond to each of these local files and sends this report to the NameNode: this is the Blockreport. 7. The Communication Protocols: All HDFS communication protocols are layered on top of the TCP/IP protocol. A client establishes a connection to a configurable TCP port on the NameNode machine. It talks the ClientProtocol with the NameNode. The DataNodes talk to the NameNode using the DataNode Protocol. A Remote Procedure Call (RPC) abstraction wraps both the Client Protocol and the DataNode Protocol. By design, the NameNode never initiates any RPCs. Instead, it only responds to RPC requests issued by DataNodes or clients. 8. Robustness: The primary objective of HDFS is to store data reliably even in the presence of failures. The three common types of failures are NameNode failures, DataNode failures and network partitions. 8.1. Data Disk Failure, Heartbeats and Re-Replication: Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased. 8.2. Cluster Rebalancing: HDFS arch is compatible with data rebalancing . A scheme might automatically move data from 1 DataNode to another if the free space on a DataNode falls below a certain threshold. In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster. These types of data rebalancing schemes are not yet implemented. 8.3. Data Integrity: It is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can occur because of faults in a storage device, network faults, or buggy software. The HDFS client software implements checksum checking on the contents of HDFS files. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace. When a client retrieves file contents it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file. If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.
8.4. Metadata Disk Failure: The FsImage and EditLog are central data structures. A corruption of these files can cause the HDFS instance to be non-functional. For this reason, the NameNode can be configured to support maintaining multiple copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously. This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate of namespace transactions per second that a NameNode can support. However, this degradation is acceptable because even though HDFS applications are very data intensive in nature, they are not metadata intensive. When a NameNode restarts, it selects the latest consistent FsImage and EditLog to use. The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the NameNode software to another machine is not supported. 8.5. Snapshots: Snapshots support storing a copy of data at a particular instant of time. One usage of the snapshot feature may be to roll back a corrupted HDFS instance to a previously known good point in time. HDFS does not currently support snapshots but will in a future release. 9. Data Organization 9.1. Data Blocks: HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files. A typical block size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will reside on a different DataNode. 9.2. Staging: A client request to create a file does not reach the NameNode immediately. In fact, initially the HDFS client caches the file data into a temporary local file. Application writes are transparently redirected to this temporary local file. When the local file accumulates data worth over one HDFS block size, the client contacts the NameNode. The NameNode inserts the file name into the file system hierarchy and allocates a data block for it. The NameNode responds to the client request with the identity of the DataNode and the destination data block. Then the client flushes the block of data from the local temporary file to the specified DataNode. When a file is closed, the remaining un-flushed data in the temporary local file is transferred to the DataNode. The client then tells the NameNode that the file is closed. At this point, the NameNode commits the file creation operation into a persistent store. If the NameNode dies before the file is closed, the file is lost. The above approach has been adopted after careful consideration of target applications that run on HDFS. These applications need streaming writes to files. If a client writes to a remote file directly without any client side buffering, the network speed and the congestion in the network impacts throughput considerably. This approach is not without precedent. Earlier distributed file systems, e.g. AFS, have used client side caching to improve performance. A POSIX requirement has been relaxed to achieve higher performance of data uploads. 9.3. Replication Pipelining: When a client is writing data to an HDFS file, its data is first written to a local file as explained in the previous section. Suppose the HDFS file has a replication factor of three. When the local file accumulates a full block of user data, the client retrieves a list of DataNodes from the NameNode. This list contains the DataNodes that will host a replica of that block. The client then flushes the data block to the first DataNode. The first DataNode starts receiving the data in small portions (4 KB), writes each portion to its local repository and transfers that portion to the second DataNode in the list. The second DataNode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes that portion to the third DataNode. Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data is pipelined from one DataNode to the next.
10. Accessibility: HDFS can be accessed from applications in many different ways. Natively, HDFS provides a Java API for applications to use. A C language wrapper for this Java API is also available. In addition, an HTTP browser can also be used to browse the files of an HDFS instance. Work is in progress to expose HDFS through the WebDAV protocol. 10.1. FS Shell: HDFS allows user data to be organized in the form of files and directories. It provides a commandline interface called FS shell that lets a user interact with the data in HDFS. The /foodirbin/hadoop dfs -mkdir /foodir View the contents of a file named /foodir/myfile.txt bin/hadoop dfs -cat /foodir/myfile.txt FS shell is targeted for applications that need a scripting language to interact with the stored data. 10.2. DFSAdmin: The DFSAdmin command set is used for admin an HDFS cluster. These are commands used only by HDFS administrator. Here are some sample action/command pairs: Action Command Put the cluster in Safemode bin/hadoop dfsadmin -safemode enter Generate a list of DataNodes bin/hadoop dfsadmin -report Decommission DataNode datanodename bin/hadoop dfsadmin -decommission datanodename 10.3. Browser Interface: A typical HDFS install configures a web server to expose the HDFS namespace through a configurable TCP port. This allows a user to navigate the HDFS namespace and view the contents of its files using a web browser. 11. Space Reclamation 11.1. File Deletes and Undeletes: When a file is deleted by a user or an application, it is not immediately removed from HDFS. Instead, HDFS first renames it to a file in the /trash directory. The file can be restored quickly as long as it remains in /trash A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash, the NameNode deletes the file from the HDFS namespace. The deletion of a file causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS. user wants to undelete a file that he/she has deleted, he/she can navigate the /trash directory and retrieve the file. The /trash directory contains only the latest copy of the file that was deleted. The /trash directory is just like any other directory with one special feature: HDFS applies specified policies to automatically delete files from this directory. The current default policy is to delete files from /trash that are more than 6 hours old. In future, this policy will be configurable thru a well defined interface. 11.2. Decrease Replication Factor: When the replication factor of a file is reduced, the NameNode selects excess replicas that can be deleted. The next Heartbeat transfers this information to the DataNode. The DataNode then removes the corresponding blocks and the corresponding free space appears in the cluster. Once again, there might be a time delay between the completion of the setReplication API call and the appearance of free space in the cluster. 12. References HDFS Java API: http://hadoop.apache.org/core/docs/current/api/ HDFS source code: http://hadoop.apache.org/core/version_control.html
Simon Funk: Netflix provided a database of 100M ratings (1 to 5) of 17K movies by 500K users. as a triplet of numbers: (User,Movie,Rating). The challenge: For (User,Movie,?) not in the database, predict how the given User would rate the given Movie. Think of the data as a big sparsely filled matrix, with userIDs across the top and movieIDs down the side (or vice versa then transpose everything), and each cell contains an observed rating (1-5) for that movie (row) by that user (column), or is blank meaning you don't know. This matrix would have 8.5B entries, but you are only given values for 1/85th of those 8.5B cells (or 100M of them). The rest are all blank. Netflix posed a "quiz" of a bunch of question marks plopped into previously blank slots, and your job is to fill in best-guess ratings in their place. Squared error (se) measures accuracy (You guess=1.5, actual=2, you get docked (2-1.5)2=.25. They use root mean squared error (rmse) but if we minimize mse, we minimize rmse. There is a date for ratings and question marks (so a cell can potentially have >=1 rating in it. Any movie can be described in terms of some features (or aspects) such as quality, action, comedy, stars (e.g., Pitt), producer, etc. A user's preferences can be described in terms how they rate the same features (quality/action/comedy/star/producer/etc.). Then ratings ought to be explainable by a lot less than 8.5 billion numbers (e.g., a single number specifying how much action a particular movie has may help explain why a few million action-buffs like that movie.). SVD:Assume 40 features. A movie, m, is described by mF[40] = how much that movie exemplifies each aspect. A user, u, is described by uF[40] = how much he likes each aspect. Pu,m=uFomFerru,m=Pu,m- ru,m m=1..17K; u=1..500K()2/8.5B k=1..40uFk*mFk - ru,m mse = k=1..40uFk*mFk - ru,m (2/8.5B) m=1..17K; u=1..500K (erru,m)[ mse/uFh = ( )/uFh] uFk ] (2/8.5B) m=1..17K; u=1..500K (erru,m)[ mse/mFh = mFh ] = (2/8.5B) m=1..17K; u=1..500K (erru,m)[ Pm1 m m17K u1 . . u500K UTa1 a40 u1 u500K Mm1 m m17K a1 mF a40 = o u uF u Pu,m = k=1..40uFk*mFk - ru,m So, we increment each uFh+ = 2mse * mFh and we increment each mFh+ = 2mse * uFh+ This is a big move and may overshoot the minimum, so the 2 is replaced by a smaller learning rate, lrate (e.g., Funk takes lrate=0.001) SVD is a trick which finds UT, M which minimize mse(k) (one k at a time). So, the rank=40 SVD of the 8.5B Training matrix, is the best (least error) approx we can get within limits of our user-movie-rating model. I.e., the SVD has found the "best" feature generalizations. To get the SVD matrixes we take the gradient of mse(k) and follow it.This has a bonus - we can ignore the unknown error on the 8.4B empty slots. Take gradient of mse(k) (just the given values, not empties), one k at a time. userValue[user] += lrate*err*movieValue[movie]; movieValue[movie] += lrate*err*userValue[user]; With Horizontal data, the code is evaluated for each rating. So, to train for one sample:real *userValue= userFeature[featureBeingTrained]; real *movieValue= movieFeature[featureBeingTrained]; real lrate = 0.001; More correctly: uv = userValue[user] += err * movieValue[movie]; movieValue[movie] += err * uv; finds the most prominent feature remaining (most reduces error). When it's good, shift it onto done features, start a new one (cache residuals of the 100M. "What does that mean for us???). This Gradient descent has no local minima, which means it doesn't really matter how it's initialized. ua+= lrate (u,i * iaT - * ua ) where u,i = pu,i - ru,i andru,i = actual rating
Refinements:Prior to starting SVD, Note: AvgRating(movie), AvgOffset(UserRating, MovieAvgRating), for every user. I.e.: static inline real predictRating_Baseline(int movie, int user) {return averageRating[movie] + averageOffset[user];} So, that's the return value of predictRating before the first SVD feature even starts training. You'd think avg rating for a movie would just be... its average rating! Alas, Occam's razor was a little rusty that day. If m only appears once with r(m,u)=1 say, AvgRating(m)=1? Probably not! View r(m,u)=1 as a draw from a true prob dist who's avg you want... View that true average itself as a draw from a prob dist of averages--the histogram of average movie ratings. Assume both distributions Gaussian, then the best-guess mean should be lin combo of observed mean and apriori mean, with a blending ratio equal to the ratio of variances. If Ra and Va are the mean and variance (squared standard deviation) of all of the movies' average ratings (which defines your prior expectation for a new movie's average rating before you've observed any actual ratings) and Vb is the average variance of individual movie ratings (which tells you how indicative each new observation is of the true mean--e.g,. if the average variance is low, then ratings tend to be near the movie's true mean, whereas if the avg variance is high, ratings tend to be more random and less indicative) then: BogusMean = sum(ObservedRatings)/count(ObservedRatings) K = Vb/Va BetterMean = [GlobalAverage*K + sum(ObservedRatings)] / [K + count(ObservedRatings)] The point here is simply that any time you're averaging a small number of examples, the true average is most likely nearer the apriori average than the sparsely observed average. Note if the number of observed ratings for a particular movie is zero, the BetterMean (best guess) above defaults to the global average movie rating as one would expect. Moving on: 20M free params is a lot for a 100M TrainSet. Seems neat to just ignore all blanks, but we have expectations about them. As-is, this modified SVD algorithm tends to make a mess of sparsely observed movies or users. If you have a user who has only rated 1 movie, say American Beauty=2 while the avg is 4.5, and further that their offset is only -1, we'd, prior to SVD, expect them to rate it 3.5. So the error given to the SVD is -1.5 (the true rating is 1.5 less than we expect). m(Action) is training up to measure the amount of Action, say, .01 for American Beauty (ust slightly more than avg). SVD optimize predictions, which it can do by eventually setting our user's preference for Action to a huge -150. I.e., the alg naively looks at the only example it has of this user's preferences and in the context of only the one feature it knows about so far (Action), determines that our user so hates action movies that even the tiniest bit of action in American Beauty makes it suck a lot more than it otherwise might. This is not a problem for users we have lots of observations for because those random apparent correlations average out and the true trends dominate. We need to account for priors. As with the average movie ratings, blend our sparse observations in with some sort of prior, but it's a little less clear how to do that with this incremental algorithm. But if you look at where the incremental algorithm theoretically converges, you get: userValue[user] = [sum residual[user,movie]*movieValue[movie]] / [sum (movieValue[movie]^2)] The numerator there will fall in a roughly zero-mean Gaussian distribution when charted over all users, which through various gyrations: userValue[user] = [sum residual[user,movie]*movieValue[movie]] / [sum (movieValue[movie]^2 + K)] And finally back to: userValue[user] += lrate * (err * movieValue[movie] - K * userValue[user]); movieValue[movie] += lrate * (err * userValue[user] - K * movieValue[movie]); This is equivalent to penalizing the magnitude of the features. To cut over fitting, allowing use of more features.
Moving on: Linear models are limiting. We've bastardized the whole matrix analogy so much that we aren't really restricted to linear models: We can add non-linear outputs such that instead of predicting with: sum (userFeature[f][user] * movieFeature[f][movie]) for f from 1 to 40. We can use: sum G(userFeature[f][user] * movieFeature[f][movie]) for f from 1 to 40. Two choices for G proved useful. 1. clip the prediction to 1-5 after each component is added. E.g., each feature is limited to only swaying rating within the valid range, and any excess beyond that is lost rather than carried over. So, if the first feature suggests +10 on a scale of 1-5, and the second feature suggests -1, then instead of getting a 5 for the final clipped score, it gets a 4 because the score was clipped after each stage. The intuitive rationale here is that we tend to reserve the top of our scale for the perfect movie, and the bottom for one with no redeeming qualities whatsoever, and so there's a sort of measuring back from the edges that we do with each aspect independently. More pragmatically, since the target range has a known limit, clipping is guaranteed to improve our perf, and having trained a stage with clipping on, use it with clipping on. I did not really play with this extensively enough to determine there wasn't a better strategy. A second choice for G is to introduce some functional non-linearity such as a sigmoid. I.e., G(x) = sigmoid(x). Even if G is fixed, this requires modifying the learning rule slightly to include the slope of G, but that's straightforward. The next question is how to adapt G to the data. I tried a couple of options, including an adaptive sigmoid, but the most general and the one that worked the best was to simply fit a piecewise linear approximation to the true output/output curve. That is, if you plot the true output of a given stage vs the average target output, the linear model assumes this is a nice 45 degree line. But in truth, for the first feature for instance, you end up with a kink around the origin such that the impact of negative values is greater than the impact of positive ones. That is, for two groups of users with opposite preferences, each side tends to penalize more strongly than the other side rewards for the same quality. Or put another way, below-average quality (subjective) hurts more than above-average quality helps. There is also a bit of a sigmoid to the natural data beyond just what is accounted for by the clipping. The linear model can't account for these, so it just finds a middle compromise; but even at this compromise, the inherent non-linearity shows through in an actual-output vs. average-target-output plot, and if G is then simply set to fit this, the model can further adapt with this new performance edge, which leads to potentially more beneficial non-linearity and so on... This introduces new free parameters and encourages over fitting especially for the later features which tend to represent small groups. We found it beneficial to use this non-linearity only for the first twenty or so features and to disable it after that. Moving on: Despite the regularization term in the final incremental law above, over fitting remains a problem. Plotting the progress over time, the probe rmse eventually turns upward and starts getting worse (even though the training error is still inching down). We found that simply choosing a fixed number of training epochs appropriate to the learning rate and regularization constant resulted in the best overall performance. I think for the numbers mentioned above it was about 120 epochs per feature, at which point the feature was considered done and we moved on to the next before it started over fitting. Note that now it does matter how you initialize the vectors: Since we're stopping the path before it gets to the (common) end, where we started will affect where we are at that point. I wonder if a better regularization couldn't eliminate overfitting altogether, something like Dirichlet priors in an EM approach--but I tried that and a few others and none worked as well as the above. Here is the probe and training rmse for the first few features with and w/o regularization term "decay" enabled. Same thing, just the probe set rmse, further along where you can see the regularized version pulling ahead: This time showing probe rmse (vertical) against train rmse (horizontal). Note how the regularized version has better probe performance relative to the training performance: Anyway, that's about it. I've tried a few other ideas over the last couple of weeks, including a couple of ways of using the date information, and while many of them have worked well up front, none held their advantage long enough to actually improve the final result. If you notice any obvious errors or have reasonably quick suggestions for better notation or whatnot to make this explanation more clear, let me know. And of course, I'd love to hear what y'all are doing and how well it's working, whether it's improvements to the above or something completely different. Whatever you're willing to share,
APPENDIX: Classification/Prediction: 1 Entity TrainingSet (e.g., IRIS(PL,PW,SL,SW) FAUST (max STD of dot prods), pCkNN 2 Entity TrainingSet (e.g., Rating(user, movie), Buys(cust, item), tfidf(doc, term) 3 Entity TrainingSet (e.g., ????(doc, term, user) Recommender Taxonomy 2 Entity Recommenders (e.g., Buys(cust, item) pSVD (min sse w gradient and line search) DU 0 1 0 1 0 0 0 1 1 0 1 0 0 0 0 1 2 3 4 5 FD T TD FU FT FD 0 1 0 1 FU 0 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 2 2 0 0 1 1 1 1 1 1 0 0 0 3 3 1 0 0 1 0 0 0 0 0 0 4 4 FT 5 5 0 FU FD FT D D 0 1 0 0 0 1 1 1 0 0 0 1 1 1 2 2 0 0 0 0 0 0 0 0 0 0 1 1 1 2 3 3 3 UT 4 4 4 DTU cube 5 5 0 1 0 0 5 0 1 0 1 FD U U 2 D 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 2 T 2 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 2 4 0 0 0 1 U 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 2 1 Use [cyclic] 3 hop rolodex model, TDDUUT (Instead of 3D DataCube, DTU (What's should DTU cell measure??) Do ARM. or pSVD on TD (train TF and FD using TD 1. use DF to cluster D 2. use FT to cluster T into TCL and then use TCL to cluster D into DCL(TCL) 2 3 4 5 T FU 0 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 (BackProp+LineSearch)-train F = <----FT----> <----FU----> <----FD----> 0 1 0 1 TF 3 Entity Recommenders (e.g., DTU(Doc,Term,User) (But cell value measure what?) Train one single Feature Vector? DU TD Use a 3-hop rolodex model instead of 3D data cube model? TD: Term-Document matrix (cells contain tf-idf?). DU: Doc-User matrix (Docs are bought/accessed/rated by Users (Docs = web pages?). UT UT: User-Term matrix (cell contains level of user's preference characterized by term.) Train FU(Are we training for best user feature twice? Do we get the same aupport vector? If not, does the combination do better than either pair? What would happen if we use one feature vector space (concatenate D,T,U)? Training for best feature amounts to starting with a DTU feature vector, back-propagating (and line searching) to minimize sse. If sse is driven very low, the feature F=(DF,TF,UF) should capture most of the information in DU, TD and UT combined? Use DU (Recommender), UT (Anomaly Detection), TD (Anomaly Detection?) Do the conversion to pTrees and train in the CLOUD. Then download F to an agen'ts (or soldier's) personal device for Classification / Anomaly Detections