BeautifulSouping Twitter
I’m here with Aral Balkan and we’re working on scraping Twitter to do functions that the Twitter API doesn’t currently support. Aral just releeased TwitAPI, a PHP regular expressions-based screen scraper.
Aral’s written some regular expressions to pull the data out of the direct messages out. I’m doing it with Python’s BeautifulSoup.
Here are the BeautifulSoup recipes (‘n’ is the B.S. instance, x is to be looped over).
User URL: n.findAll(True, {"class": "status_actions"})[x].\
parent.contents[5].contents[1].contents[0]['href']
User Name: n.findAll(True, {"class": "status_actions"})[0]\
.parent.contents[5].contents[1].contents[0].contents
Comment: n.findAll(True, {"class": "status_actions"})[0]\
.parent.contents[5].contents[2].string.strip()
Fucked-up Twitter timecode: n.findAll(True, {"class": "status_actions"})[0].parent.contents[5].contents[3].contents[1].string.strip()
Once I’ve figured out how to do HTTP Basic authorisation using urllib2, the Twitter parser can be released unto the world!
Tags: barcamplondon2, screen scraping, beautifulsoup, python, aral balkan, twitter
|
Jeremy Keith: “Right now the BT Centre has become Werewolf Central. There are two or three concurrent games running at any one time. It’s three in the morning now and the games show no sign of stopping.” They stopped at about 5.30am - I moderated the last two games, and slept for all of about quarter-of-an-hour afterwards.
It’s time for macroformats
We just had an utterly humourous play-fight about an hour ago with Jeremy Keith, Brian Suda, Tom Hughes-Croucher, Ian Forrester and myself over microformats and the SemWeb.
One of the conclusions was that RDF and the Semantic Web needs much better marketing, by taking what works from the marketing of microformats - namely, the plethora of great examples, well-written specifications and tools.
The only difference between us and the microformats folk (who we know and love) is that we prefer the endgame of RDF and the Semantic Web.
HTML is not the end, and we cannot simply observe.
“Macroformats” is a reaction to the accusation that RDF isn’t sexy. We want to make RDF sexy, and the Semantic Web fun.
To explain what macroformats are, the important thing is to say what they are not.
Firstly, macroformats are not a replacement for microformats. If you use microformats and they solve problems for you, that is great.
Secondly, macroformats are not anything new. There is absolutely zero new technology. What the macroformat movement does is different - it helps people use pre-existing technology to do new and interesting things. Through a Darwinian process, new and interesting uses will bubble up that aren’t imaginable.
We have already seen this with microformats - people are doing things with microformats that are not within the specified ‘problem’. Why should we not let the whole web get involved with the development of the web of tomorrow?
The technology that is in development need not be complicated. We need to stop talking about RDF and OWL and SPARQL. We really need to stop talking about ontology development and inference engines and SemWeb research, not because that stuff isn’t important - but because that stuff isn’t all important.
Technology doesn’t matter, data matters.
Keep a watch out - we have registered usemacroformats.com. The macroformats are coming. Semantic Web for the rest of us.
Tags: macroformats, microformats, rdf, barcamplondon2
|