Netzsphaere

Conversation

kaia

I need to scrape an internal website and make it into JSON. they don't offer API, the data is not an endpoint, it's in the HTML. what tool should I use?

lain, author of the quixote

lain@lain.com

24 days ago

Reply to @kaia@brot.eus

@kaia codex?

GNUkko Sauvage (eris-ng)

eris@p.enes.lv

24 days ago

Reply to @kaia@brot.eus

python & regex (Only half-joking, if you need to do it once or twice and they don't change their HTML to fuck with scrapers, it can be sufficient)

But I don't know about dedicated libraries enough

monkee

monkee@other.li

24 days ago

Reply to @kaia@brot.eus

@kaia
Beautiful Soup? brdThink

https://beautiful-soup-4.readthedocs.io/en/latest/

Snacks

snacks

24 days ago

Reply to @kaia@brot.eus

@kaia regex or an actual html parser library

vaartis of the ratular bells

vaartis@pl.kotobank.ch

24 days ago

Reply to @kaia@brot.eus

@kaia something like beautifulsoup which is an html parsing library, or the equivalent for your preferred programming language

GNUkko Sauvage (eris-ng)

eris@p.enes.lv

24 days ago

Reply to @vaartis@pl.kotobank.ch

I had a feeling they were named "soup" but all I could remember was tag soup

CC: @kaia@brot.eus

Oblomov

oblomov@sociale.network

23 days ago

Reply to @u0421793@pikopublish.ing

@u0421793 @kaia depends how well-formed the HTML is and what amount of conversion is needed. If the HTML is NOT well formed (as it usually isn't in these cases), XSLT cannot process it, but there are libraries for scripting languages that can do a pretty good job at selecting and extracting data (beautiful soup for example)

bovaz@misskey.social

23 days ago

Reply to @kaia@brot.eus

@kaia@brot.eus a paddle to smack whoever asked this, to make sure they really need it.

Oblomov

oblomov@sociale.network

23 days ago

Reply to @kaia@brot.eus

@kaia awk

Oblomov

oblomov@sociale.network

23 days ago

Reply to @oblomov@sociale.network

@kaia wait, you mean for the scraping or for the conversion?

Ian K Tindale

u0421793@pikopublish.ing

23 days ago

Reply to @oblomov@sociale.network

@oblomov @kaia is this a job for XSLT ?

About Netzsphaere

Terms of Service

DA RULEZ:

Don't cause us any legal trouble
Try not to be too annoying
No loli or beast
Rule #9 still applies

If there's any questions or you want an invite link, feel free to ask snacks.

动态网自由门天安門天安门法輪功李洪志 Free Tibet 六四天安門事件 The Tiananmen Square protests of 1989 天安門大屠殺 The Tiananmen Square Massacre 反右派鬥爭 The Anti-Rightist Struggle 大躍進政策 The Great Leap Forward 文化大革命 The Great Proletarian Cultural Revolution 人權 Human Rights 民運 Democratization 自由 Freedom 獨立 Independence 多黨制 Multi-party system 台灣臺灣 Taiwan Formosa 中華民國 Republic of China 西藏土伯特唐古特 Tibet 達賴喇嘛 Dalai Lama 法輪功 Falun Dafa 新疆維吾爾自治區 The Xinjiang Uyghur Autonomous Region 諾貝爾和平獎 Nobel Peace Prize 劉暁波 Liu Xiaobo 民主言論思想反共反革命抗議運動騷亂暴亂騷擾擾亂抗暴平反維權示威游行李洪志法輪大法大法弟子強制斷種強制堕胎民族淨化人體實驗肅清胡耀邦趙紫陽魏京生王丹還政於民和平演變激流中國北京之春大紀元時報九評論共産黨獨裁專制壓制統一監視鎮壓迫害侵略掠奪破壞拷問屠殺活摘器官誘拐買賣人口遊進走私毒品賣淫春畫賭博六合彩天安門天安门法輪功李洪志 Free Tibet 劉曉波动态网自由门