To mine a social network, the most important thing is the data.(it is only my opinion…)
I have been mining locaiton-based social network for a while. I have used other people’s data and I also collect data myself. With some experience, I think I can summarize a bit about the area.
First of all, I’m just a student from a univeristy most of people never heard of, there is no way for me to get high quality data from big commpanies like Facebook, Twitter and Instagram.
There are two ways to overcome this, the first one is using other researchers’ data. In the world of location-based network, there are mainly two famous baseline datasets. One is from SNAP Stanford where Prof. Leskovec and his students collect users’ checkin data from Gowalla and Brightkite. Even though these two companies are both dead, but the data is quite valuable. The other one is from Texas A&M, Dr. Zhiyuan Cheng collect about 5 months geo-tagged tweets from Twitter, the size of the dataset is about 10M. Both of datasets are at a global level and have been used a lot by researchers (see their papers’ citation number).
The second way is to collect data by ourselves. Now, most of social network publish their API that allows every user to extract data. This is where a PhD student should cut in…………..
Twitter. Twitter has a very strong API and allow you to extract real time tweets with a geo bounding box. I have been using this API for about five months, it indeed gives you a lot of data. BTW, I discovered that if you choose your searching area at a city level, you will get much more data in that city compared to the global level. We have about 3M geo-tagged tweets in New York while the whole global Gowalla dataset is about 6M. But this tip doesn’t matter anymore. Since April 27th, 2015, Twitter modified (or improved in their claim) its service, now when you decides to share a geo-tagged tweet, you can directly choose the exact venue (data from Foursquare), share the exact lat/lng is optional. Therefore, Twitter’s API cannot extract that many geo-tagged tweets. After that April 27, I can only get 10% of the data I used to get before. So Twitter’s way is dead. Luckily and sadly, I get the last five months data….
Foursquare. I have never used Foursquare’s API to collect user checkins since I don’t think it is a general social network service. But I do use it a lot to query a location’s information, including name, category, rating, tips and so on…. Foursquare indeed has an excellent API with not many rating limit, I barely need to wait though the whole sliding window to get the data.
Instagram. More recently, I start to use Instagram to collect data, it is awsome. I only use the static REST API to query check-ins from Instagram, since I haven’t figured out how to use its Streaming API. I discover that Instagram gives you more recent data than old ones. Therefore, to get more data, it is better query a city multiple times a day. In New York, I can get about 20k check-ins a day. Instagram’s policy on API is also quite geneours, the only thing that bothers me is querying friends, the API only gives you about 50 friends in one page, therefore, you need to query many pages to extract a user’s friends. But some user indeed has a lot of friends…….
That’s all I know about these APIs, I mainly focus on instagram nowadays. Hope they won’t change that much in the future like Twitter.
Urban informatics rocks!