@angristan hey there :) Same thing there, for some time. It's pretty great tool, though a bit complex

@angristan why didn't you just use pandas from the beginning?

@alexcleac i don't know man, the teacher wants use to use Spark...

@angristan just to be sure that you understand the code correctly: you just read a file, pushing it onto executor node and then get everything back to the driver by doing "toPandas". So... you could do this faster without using spark at all 😂

@angristan after actually reading the code, I think it is possible to use parallelized nature of spark, but that will make this code less pythonic and more scala-like

@alexcleac I guess it's more performant with these 300k lines of JSON I have to import

@angristan probably, because JSON file will be read not on python side, but on java, which (probably) is faster. Also, you can get rid of "drop_dublicates" on driver by doing "spark_df.distinct().toPandas()". That will do the de-duplication in parallel. Java2Python conversion of data is pretty slow, so it is better to convert less data :)

@alexcleac thank you! I'm not experienced at all in that field

@angristan happy to help ^_^

Though I am not much experienced in this tech either, just telling things I've had issues with lately.

Sign in to participate in the conversation
Mastodon

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!