green_amber: (Default)
green_amber ([personal profile] green_amber) wrote2006-01-03 02:24 pm

RSS quick query

I know I asked this before but could someone remind me: if you are an on line creator (eg the P V P people or Boing Boing say) can you STOP your work being syndicated by RSS using particular code? if so, how? does it have any drawbacks? Thanks!!! Could you allow some people to RSS it and not others? Would that require password protection effectively?

[identity profile] surliminal.livejournal.com 2006-01-03 02:53 pm (UTC)(link)
Well, there's nothing to stop someone scraping your site HTML and building a feed from it.

So there's no code you can insert a bit like the robots.txt that stops people making an RSS feed out of your site? can you build your site not in XML just in HTTP?

[identity profile] sbisson.livejournal.com 2006-01-03 02:57 pm (UTC)(link)
Not everyone respects robots.txt :-)

The thing is, once you have content in an open format like HTML, anyone can do anything with it. You'd need to put your site content in FLash or similar.

One option would be to build you site as a content negotiated CMS and just block out the IP addresses or HTTP User Agents of scraping tools. That would work...

drplokta: (Default)

[personal profile] drplokta 2006-01-03 03:02 pm (UTC)(link)
Not if the scraper is using a user agent like "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" and coming via an ADSL connection with dynamic IP addressing or a megaproxy farm from an ISP like AOL or NTL. I spend a fair amount of time trying to block screen-scraping spiders, and it's not a trivial exercise if they don't want to be blocked.

[identity profile] sbisson.livejournal.com 2006-01-03 03:04 pm (UTC)(link)
True.

The problem is that the bad guys have access to the same technologies as you do. It's like dealing ith spam...

[identity profile] surliminal.livejournal.com 2006-01-03 03:05 pm (UTC)(link)
Wow. Thanks guys. V helpful..
drplokta: (Default)

[personal profile] drplokta 2006-01-03 03:00 pm (UTC)(link)
A robots.txt doesn't physically prevent anything from happening, it just politely asks robots not to index certain pages. It's like putting up a "keep out" sign on an unfenced piece of land -- people know they're not supposed to trespass, but there's no physical barrier preventing them.

HTTP is a protocol not a markup language -- I assume you meant HTML. XML or HTML makes no difference, anything that is human-readable is also machine readable unless you hide it behind something that needs human-level pattern-matching skills like a CAPTCHA image.