Wikidata as a Giant Crosswalk File

e12e 11 hours ago

Nice!

> However, for reasons unknown to me, they wrap these neatly separated rows with brackets ([ and ]) and add a comma to each line

Well, the reason (misguided or not) is as you say, I imagine:

> so it’s a valid, JSON array containing 100+ million items.

> We are not going to attempt to load a this massive array. Instead, we’re running this command:

    zcat ../latest-all.json.gz | sed 's/,$//' | split -l 100000 - wd_items_cw --filter='gzip > $FILE.gz'

That's one approach - I'm always a little wary of treating a rich format like JSON as <something> deliminated text - I'd be curious if using jq in streaming mode is much different in run-time. I believe this snippet, the core of which we lifted from stack overflow or somewhere does the same thing; split a valid JSON array into ndjson (with tweaks to hopefully generate similar splits:

    gunzip -c ../latest-all.json.gz \
     | jq -cn --stream \
       'fromstream(inputs|(.[0]  |= .[1:]) | select(. != [[]]) )' \
   | split -l 100000 - wd_items_cw --filter='gzip > $FILE.gz

Note on MacOS zcat might not be gunzip, hence the change.

fiddlerwoaroof 7 hours ago

It's too bad there aren't more streaming JSON parsers like oboe.js[1]. It would be nice if parsing libraries always supplied an event-based approach like this in addition to parsers that build up the entire data structure in memory.
[1]: https://github.com/jimhigson/oboe.js
EDIT: looking around a bit, I found json-stream ( https://github.com/dgraham/json-stream ) for Ruby.
- Alifatisk 41 minutes ago
  
  There's even more alternatives at the bottom https://github.com/dgraham/json-stream?tab=readme-ov-file#al...

Alifatisk 20 hours ago

Interesting article, I think this is the first time I've seen someone pick Ractors over Parallel gem, cool!

I love seeing these quick and dirty Ruby scripts used for data processing / filtering or whatever, this is what it is good at!

dbreunig 16 hours ago

Thanks! This is a near perfect use case for Ractors since we chunked all the files and there’s no need for the file processing function to share any context.

nighthawk454 12 hours ago

Hey cool article, thanks! Might be time to finally dive in to DuckDB