New version of sselph/scraper v0.9.0-beta

This topic has 58 replies, 20 voices, and was last updated 9 years, 11 months ago by sselph.

Viewing 35 posts - 1 through 35 (of 59 total)

1 2 →

Author

Posts
07/04/2015 at 21:38 #101350
sselph
Participant
Hi Everyone,

I’ve been working on my scraper to refactor much of the code to make it easier to add features and I added a few features since I’ve posted last.
https://github.com/sselph/scraper

New Features:
- MAME/Arcade descriptions – I added in information from arcade-history so that MAME and other arcade systems should have more complete data.
- PSX Support – I added support for bin/cue PSX games from redump dat files. It will create a single entry for each cue file.
- Dreamcast Support – I added support for gdi/bin games from redump dat files. It seems reicast supports this format but it isn’t enabled in es_systems.cfg.
- Zip/Gzip support – since retroarch added zip/gzip support I now scan inside zip files for the first file that looks like a rom and scan it.
- More accurate and complete scraping on several systems. Thanks to @robertybob for adding literally ~1000 games to thegamesdb.
- Ability to append to a gamelist – You can now use -append to skip files that are already in the gamelist.zml file.
Guide:
Thanks to Floob there is a very nice video guide that is still valid:

Issues:
Since I’ve changed most of the code and don’t have a lot of tests, I’m sure I have created bugs. Please create issues here:
https://github.com/sselph/scraper/issues
07/04/2015 at 23:18 #101360

Floob
Member

Thanks very much for the update. Its great!
Loving the extra description detail on mame roms.

I’ve added an error I found on your issue list, it may just be me doing something odd though.

I like the PSX support, although as I have very few PSX games, and I dont use a .cue for single track games I’ll probably still use the ad-hoc in built scraper for those.

Thanks again for all the work you put into this, it makes Emulation Station so much nicer to use.

I’ll try to sort a new video for all these updates!

07/05/2015 at 03:05 #101368

sselph
Participant

Thanks for the report. I’ll release a fix soon if I don’t hear any other issues.

Regarding bin/cue: The scraper will still scrape the bin file if there isn’t a cue file. How it works is it looks for cue files, parses them then gets a list of associated bin files. Then hashes files cue/track1/track2/etc until it finds a match and uses that. So if there isn’t a cue it will just treat the .bin as a binary and hash that like normal.

07/05/2015 at 03:29 #101369

Floob
Member

Ah I see – thats great. I’ll give it a go.

Can you remind me how the mame lookup works – which database does it check?

For example, I’ve got ddp2.zip which is:
http://www.progettoemma.net/index.php?gioco=ddp2&lang=en

but nothing scraped?

07/05/2015 at 03:39 #101370

Floob
Member

Scrap that – it found it this time – just no image returned.

One it didnt find was wyvernf0.zip
http://www.progettoemma.net/index.php?gioco=wyvernf0&lang=en

07/05/2015 at 03:48 #101371

sselph
Participant

It uses mamedb.com. It strips off the file extension and pulls the url http://www.mamedb.com/game/wyvernf0

mamedb.com uses .147 and wyvernf0 is .154

07/05/2015 at 03:49 #101372
Floob
Member
Also, when processing mame4all roms I seem to periodically get these errors

I dont think its rom specific though, as its a consecutive batch, then next scrape they are fine and others complain?
```
/07/05 01:47:12 INFO: Starting: bosco.zip
2015/07/05 01:47:12 ERR: error processing bosco.zip: ILM Bad HTML
2015/07/05 01:47:12 INFO: Starting: bouldash.zip
2015/07/05 01:47:12 ERR: error processing bouldash.zip: ILM Bad HTML
2015/07/05 01:47:12 INFO: Starting: bouldash.zip
2015/07/05 01:47:12 ERR: error processing bouldash.zip: ILM Bad HTML
2015/07/05 01:47:12 INFO: Starting: bouldash.zip
2015/07/05 01:47:12 ERR: error processing bouldash.zip: ILM Bad HTML
2015/07/05 01:47:12 INFO: Starting: brain.zip
2015/07/05 01:47:13 ERR: error processing brain.zip: ILM Bad HTML
2015/07/05 01:47:13 INFO: Starting: brain.zip
2015/07/05 01:47:13 ERR: error processing brain.zip: ILM Bad HTML
2015/07/05 01:47:13 INFO: Starting: brain.zip
2015/07/05 01:47:13 ERR: error processing brain.zip: ILM Bad HTML
2015/07/05 01:47:13 INFO: Starting: breakers.zip
2015/07/05 01:47:13 ERR: error processing breakers.zip: ILM Bad HTML
2015/07/05 01:47:13 INFO: Starting: breakers.zip
2015/07/05 01:47:13 ERR: error processing breakers.zip: ILM Bad HTML
2015/07/05 01:47:13 INFO: Starting: breakers.zip
2015/07/05 01:47:14 ERR: error processing breakers.zip: ILM Bad HTML
2015/07/05 01:47:14 INFO: Starting: brkthru.zip
2015/07/05 01:47:14 ERR: error processing brkthru.zip: ILM Bad HTML
2015/07/05 01:47:14 INFO: Starting: brkthru.zip
2015/07/05 01:47:14 ERR: error processing brkthru.zip: ILM Bad HTML
2015/07/05 01:47:14 INFO: Starting: brkthru.zip
2015/07/05 01:47:14 ERR: error processing brkthru.zip: ILM Bad HTML
2015/07/05 01:47:14 INFO: Starting: brubber.zip
2015/07/05 01:47:15 ERR: error processing brubber.zip: ILM Bad HTML
2015/07/05 01:47:15 INFO: Starting: brubber.zip
```
07/05/2015 at 03:49 #101373

Floob
Member

[quote=101371]It uses mamedb.com. It strips off the file extension and pulls the url http://www.mamedb.com/game/wyvernf0

[/quote]

Ah – ok, that explains it. Thanks.

07/05/2015 at 03:56 #101375

sselph
Participant

Hmm those errors are from the mame scraper trying to parse the result of getting the URL and getting a response it can’t parse. Since it happens with different roms and in bursts might be some throttling or issues with the website.

07/05/2015 at 03:58 #101376

Floob
Member

Could a backupdb query work like this?

http://www.progettoemma.net/gioco.php?game=wyvernf0

with the image being:
http://www.progettoemma.net/snap/wyvernf0/0000.png

Just a thought. I’m more than impressed with what it does already!

07/05/2015 at 04:02 #101377

sselph
Participant

Yeah we can create a backup DB. The metadata I could probably download another dat file parse it and shove it in the same data store I’m using for history then point to images in another site or see how taxing it would be to host them.

07/05/2015 at 04:03 #101378

Floob
Member

[quote=101375]Hmm those errors are from the mame scraper trying to parse the result of getting the URL and getting a response it can’t parse. Since it happens with different roms and in bursts might be some throttling or issues with the website.

[/quote]

Just tried it again, and its fine now. Must have been a temporary bottleneck like you said.

07/05/2015 at 13:25 #101396
Floob
Member
Just had a major meltdown with some atarilynx rom scraping which seemed fine before. Can you see where the issue may be?
```
github.com/sselph/scraper/ds.(*Hasher).Hash(0x1080aa90, 0x10f1d320, 0x23, 0x0, 0x0, 0x0, 0x0)
        /home/sselph/go/src/github.com/sselph/scraper/ds/hasher.go:32 +0x170 fp=0x1a462a4c sp=0x1a4629e0
github.com/sselph/scraper/ds.(*Hasher).Hash(0x1080aa90, 0x10f1d320, 0x23, 0x0, 0x0, 0x0, 0x0)
        /home/sselph/go/src/github.com/sselph/scraper/ds/hasher.go:32 +0x170 fp=0x1a462ab8 sp=0x1a462a4c
github.com/sselph/scraper/ds.(*Hasher).Hash(0x1080aa90, 0x10f1d320, 0x23, 0x0, 0x0, 0x0, 0x0)
        /home/sselph/go/src/github.com/sselph/scraper/ds/hasher.go:32 +0x170 fp=0x1a462b24 sp=0x1a462ab8
...additional frames elided...
created by main.CrawlROMs
        /home/sselph/go/src/github.com/sselph/scraper/scraper.go:173 +0x5e4

goroutine 1 [chan send]:
main.CrawlROMs(0x11522cc0, 0x10a48010, 0x1, 0x1, 0x10810140, 0x1080aa88, 0x0, 0x0)
        /home/sselph/go/src/github.com/sselph/scraper/scraper.go:184 +0xf98
main.Scrape(0x10a48010, 0x1, 0x1, 0x10810140, 0x1080aa88, 0x0, 0x0)
        /home/sselph/go/src/github.com/sselph/scraper/scraper.go:285 +0x194
main.main()
        /home/sselph/go/src/github.com/sselph/scraper/scraper.go:414 +0xf54

goroutine 5 [syscall]:
os/signal.loop()
        /usr/local/go/src/os/signal/signal_unix.go:21 +0x1c
created by os/signal.initÂ·1
        /usr/local/go/src/os/signal/signal_unix.go:27 +0x40

goroutine 15 [chan receive]:
main.funcÂ·003()
        /home/sselph/go/src/github.com/sselph/scraper/scraper.go:187 +0x60
created by main.CrawlROMs
        /home/sselph/go/src/github.com/sselph/scraper/scraper.go:184 +0x938

goroutine 14 [chan receive]:
main.funcÂ·002()
        /home/sselph/go/src/github.com/sselph/scraper/scraper.go:177 +0x94
created by main.CrawlROMs
        /home/sselph/go/src/github.com/sselph/scraper/scraper.go:180 +0x6b8

goroutine 10 [select]:
net.funcÂ·019()
        /usr/local/go/src/net/dnsclient_unix.go:241 +0x310
```
07/05/2015 at 15:39 #101406

sselph
Participant

Thanks!

I think I see the error and have submitted a fix and releasing a new version. Hopefully I get all the issues before I hit 1.0.0 :)

07/06/2015 at 23:05 #101522

robertybob
Participant

Keep up the great work Sselph! If ever you want to add more systems and want someone to help you match up IDs or whatever, just ask me :)

07/08/2015 at 14:01 #101618

ekstreme
Participant

Working well for me. Just scraped my GnGeo set.

07/19/2015 at 08:18 #102257

socalretrogamer
Participant

Thanks for this scraper! It works great! Much, much better than the scraper on Emulation Station. There are still a lot of games it didn’t scrape, but I think that’s because some of the ROM file names are truncated. For example, “Zelda2” (no space) didn’t scrape for NES. At some point, I plan on renaming all the files that didn’t scrape, so would you have any tips to ensure the scraper recognizes the title? Particularly a sequel game like Zelda 2? Thanks again!

07/19/2015 at 13:17 #102273
sselph
Participant
Hi socialretrogamer,

To minimize false positives, On consoles I’m not actually using the name of the file only the extension so you could name them 1.nes, 2.nes and it should still work. The scraper is using the rom data itself. It hashes it and compare it to a hash to ID mapping database I generated by hand for each system it supports.

So there are several reasons it may not have scraped a rom:
- Different ROM dump and therefore different hash. The Zelda2 could be bad, hacked, overdumped, or a rev no in the no-intro hashes I used.
- No entry in thegamesdb, for SNES there are 3385 No-Intro roms and only 1055 games in the GDB. With clones I matched 2434.
- No entry in my DB, because I have to manually add the hash>ID, I don’t automatically have new entries.
07/19/2015 at 17:02 #102294

gutossn
Participant

The scraper is amazing! Very fast and doesn’t freezes the ES. So could you include the wonderswan and neogeo pocket (and color too) to database? Thank you.

07/19/2015 at 17:59 #102297

sselph
Participant

I have issues tracking adding new systems on github. It is a function of: are there available hashes, what are the file formats, are there entries in thegamesdb.net, how many games, how busy I am, etc.

Feel free to add issues for each system but I can’t make any promises until I look more closely.

08/02/2015 at 02:46 #103154

Anonymous
Inactive

Hi sselph!!
First of all, too many thanks for this awesome scraper!!!

I’ve one question that I can’t find a solution: (may be, I’m to newbee ;)

I start one scraper session and, if for any reason (like I abort execution crtl+C, or scraper show errors and exit), the scraper don’t finish a complete rom directory, ¿How can I continue the scraper session without analyze all roms I’ve now correctly scraped?

Thanks again for your hard work with this great super-tool!! :)

*EDIT*
Ok, I think I need to use -append=true param…

08/02/2015 at 03:03 #103156

sselph
Participant

Hi,

Yes the -append flag should be what you are looking for, although the scraper will skip downloading any images that already exist so should be fast to catch back up either way.

I have too many flags :)

08/17/2015 at 16:08 #104126

Omnija
Participant

Will there be support for psx .pbp formats?

08/18/2015 at 03:45 #104178

sselph
Participant

I don’t know enough about the pbp file format to know if I could translate the information it contains to what would have been in the original bin file to match it against the hash in redump.

08/19/2015 at 10:35 #104268

Anonymous
Inactive

Great work on version 1.0.0 sselph!!

I have a question, i have a complete collection of PAL Megadrive boxart……why you may ask, well i feel that the PAL look of the boxart is much more appealing to me (being from the UK) and actually has MegaDrive on the boxart. Is there a way we can implement scrapping just PAL box art for the Megadrive at all. I can upload these images to a place of your discretion if you like, if this would bring this idea into reality??

08/19/2015 at 17:08 #104281

sselph
Participant

There are a couple issues with the whole megadrive/genesis situation. First one is when I did the mapping from hash to gamedb id I didn’t really care which version I chose as long as there was a match. So if there were a US version and a EU version I just chose one at random, sometimes I looked to see which one had the best description or clearer image. The other issue is data quality from thegamedb, there are several megadrive games that have genesis art and possibly vice versa.

When I have time to remap MD and GEN I’ll take better care at only giving a MD version a GEN match if there isn’t a MD entry in the DB and vice versa. Ideally we could get the entries in thegamesdb fixed and improved so that other projects benefit as well.

I have tinkered with the idea of setting up a repository of my own to improve some of the MAME stuff but haven’t had time. If I do, I’ll see if I could do something similar for other systems but I imagine the cost would be prohibitive and I won’t actually do any of it :)

08/23/2015 at 14:17 #104523

greyhulk
Participant

hi guys, im using the inbuilt scraper on psx games its finds the relevants artwork etc but when i restart my pi its all missing again? any advice..

thanks
steve

08/23/2015 at 17:18 #104533

herbfargus
Member

It may not be writing manual changes unless you cleanly exit emulationstation. So select quite emulationstation from the start menu and when it reloads see if your changes save.

08/28/2015 at 17:52 #104907

Anonymous
Inactive

Is there a build for windows at all?

08/29/2015 at 02:43 #104946

sselph
Participant

I make several prebuilt binaries available at https://github.com/sselph/scraper/releases

or if your the type that likes compiling it yourself, there are no special instructions for doing it on windows.

08/31/2015 at 14:08 #105092

Anonymous
Inactive

Nice!, thanks

11/12/2015 at 19:53 #109770

phantom27
Participant

Ok… So I might be dumb…. No… I’m pretty sure I am… but I need help.

I have a ROM database that I tried running this on. I did it on my mac. It looked like it worked. Even said saving session… etc. But I can’t find the gamelist.xml file. I even searched my mac for it.

I’m probably doing something wrong.

11/12/2015 at 19:55 #109771

phantom27
Participant

Yep, I’m an idiot apparently. I didn’t realize it would put it in my ‘home’ folder. Found it.

Ok, stupid question. If I put this file in my ROM folder on my Pi, will it work or is the paths all messed up since I ran it on my mac?

11/16/2015 at 14:48 #110041

sselph
Participant

Hmm the gamelist should be in the same directory where you ran the script was run. I’ve heard some other complaints about this so maybe something has changed.

Anyway if you ran the script from inside a folder with a bunch of roms and didn’t change any of the flags, all the paths should be correct just put the gamelist in the rom folder along with all the roms and the images folder.

01/23/2016 at 03:41 #114810

proxycell
Participant

Hey Steven,
Long time since I last used your scraper

I hope this thread is the one to be used for such things:

How would I go about ADDING to this database? I have every fan-translated game there is and I would love for them to be scraped as the original game
Author

Posts