Time to discover Apache Spark... 🧐
@angristan hey there :) Same thing there, for some time. It's pretty great tool, though a bit complex
@angristan why didn't you just use pandas from the beginning?
@alexcleac i don't know man, the teacher wants use to use Spark...
@angristan just to be sure that you understand the code correctly: you just read a file, pushing it onto executor node and then get everything back to the driver by doing "toPandas". So... you could do this faster without using spark at all 😂
@angristan after actually reading the code, I think it is possible to use parallelized nature of spark, but that will make this code less pythonic and more scala-like
@alexcleac I guess it's more performant with these 300k lines of JSON I have to import
@angristan probably, because JSON file will be read not on python side, but on java, which (probably) is faster. Also, you can get rid of "drop_dublicates" on driver by doing "spark_df.distinct().toPandas()". That will do the de-duplication in parallel. Java2Python conversion of data is pretty slow, so it is better to convert less data :)
@alexcleac thank you! I'm not experienced at all in that field
@angristan happy to help ^_^
Though I am not much experienced in this tech either, just telling things I've had issues with lately.
The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!