250 likes | 345 Views
Mining twitter. 1.9, 1.10 1131036001 김종명. 1.9 Making Robust Twitter Requests. Problem
E N D
Mining twitter 1.9, 1.10 1131036001 김종명
1.9 Making Robust Twitter Requests • Problem • You want to write a long-running script that harvests large amounts of data, such as the friend and follower ids for a very popular Twitterer; however, the Twitter API is inherently unreliable and imposes rate limits that require you to always expect the unexpected. • Solution • Write an abstraction for making twitter requests that accounts for rate limiting and other types of HTTP errors so that you can focus on the problem at hand and not worry about HTTP errors or rate limits, which are just a very specific kind of HTTP error.
Error Messages • {"errors":[{"message":"Sorry, that page does not exist","code":34}]} • <?xml version="1.0" encoding="UTF-8"?><errors><error code="34">Sorry, that page does not exist</error></errors>
URL Error • DNS 교체
1.10 • Problem • You want to harvest and store tweets from a collection of id values, or harvest entire timelines of tweets • Solution • Use the /statuses/showresource to fetch a single tweet by its id value; the various /statuses/*_timeline methods can be used to fetch timeline data. CouchDBis a great option for persistent storage, and also provides a map/reduce processing paradigm and built-in ways to share your analysis with others.
문서 기반분산 데이터베이스 • Cluster Of Unreliable Commodity Hardware
Document-oriented DB • MongoDB(C++) • RavenDB(C#) • CouchDB(Erlang)
Document { "_id": "tansac", “_rev”: “1” "profile": { "nickname": "tansanc", "name": { "firstname": "종명", "lastname": "김" }, "birthdate": "1987-05-31“ } }
Schema Free { "_id": "tansac", “_rev”: “2” "profile": { "nickname": "tansanc", "name": { "firstname": "종명", "lastname": "김" }, "birthdate": "1987-05-31” “hasBrother”: true } }
No Locking • Multi-Version Concurrency Control (MVCC)
/statuses/show • public_timeline() • user_timline() • home_timeline()
tweepy get timeline • API.public_timeline() • Returns the 20 most recent statuses from non-protected users who have set a custom user icon. The public timeline is cached for 60 seconds so requesting it more often than that is a waste of resources. • Parameters: None • Returns: list of class:Status objects • API.home_timeline() • Returns the 20 most recent statuses, including retweets, posted by the authenticating user and that user’s friends. This is the equivalent of /timeline/home on the Web. • Parameters: since_id, max_id, count, page • Returns: list of class:Status objects • API.friends_timeline() • Returns the 20 most recent statuses posted by the authenticating user and that user’s friends. • Parameters: since_id, max_id, count, page • Returns: list of class:Status objects • API.user_timeline() • Returns the 20 most recent statuses posted from the authenticating user. It’s also possible to request another user’s timeline via the id parameter. • Parameters: (id or user_id or screen_name), since_id, max_id, count, page • Returns: list of class:Statusobjects • http://pythonhosted.org/tweepy/html/api.html#timeline-methods
API.friends_timeline() • API.public_timeline()
API.user_timeline • API.mention_timeline